Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-UDF table logic creates unnecessary copies #457

Open
rlamy opened this issue Sep 18, 2024 · 2 comments
Open

Pre-UDF table logic creates unnecessary copies #457

rlamy opened this issue Sep 18, 2024 · 2 comments

Comments

@rlamy
Copy link
Contributor

rlamy commented Sep 18, 2024

The pre-UDF logic in

def process_input_query(self, query: Select) -> tuple[Select, list["Table"]]:
if os.getenv("DATACHAIN_DISABLE_QUERY_CACHE", "") not in ("", "0"):
return query, []
table = self.catalog.warehouse.create_pre_udf_table(query)
q: Select = sqlalchemy.select(*table.c)
if query._order_by_clauses:
# we are adding ordering only if it's explicitly added by user in
# query part before adding signals
q = q.order_by(table.c.sys__id)
return q, [table]
unconditionally copies the input query into a new table, which is expensive and useless in most cases.

For context, this was introduced in https://github.com/iterative/dvcx/pull/1068

@ilongin
Copy link
Contributor

ilongin commented Sep 18, 2024

How will we overcome the issue that was solved in that https://github.com/iterative/dvcx/pull/1068 you mentioned if we are not going to copy query to temp table?

@rlamy
Copy link
Contributor Author

rlamy commented Sep 19, 2024

IIUC, that issue only happens when the result-set of the query is non-deterministic, i.e. only when there's a non-deterministic filter on the rows (such as LIMIT). So I can see 2 solutions:

  • Detect deterministic cases and avoid copying in that case.
  • Rely on row ids to avoid running the filter twice: create the UDF table based on the original query, and then inner-join on row ids with an unfiltered version of the query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants