Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty save() does not works as query #327

Closed
dreadatour opened this issue Aug 20, 2024 · 25 comments
Closed

Empty save() does not works as query #327

dreadatour opened this issue Aug 20, 2024 · 25 comments
Assignees
Labels
bug Something isn't working

Comments

@dreadatour
Copy link
Contributor

Empty save() in datachain query fails with Internal error on creating dataset error.

Query example:

from datachain import DataChain

DataChain.from_storage("gs://dvcx-datacomp-small/metadata", anon=True).save()

This happens because if name is not set, temp dataset name will be used. At the same time, after save() is finishes, temp datasets are cleaned up.

Some explanation:

Empty save is needed because operations are lazy. Chains are not executed immediately and intermediate results are not stored.
For example, if we want to create dc_filtered_1 and dc_embeddings from dc, without save() intermediate dc chain will be executed twice, for each children.

Couple options here:

  1. Skip saved temp dataset on cleanup
  2. Use exec() instead of empty save(). Need to check if this is proper replace + need to add exec() to docs.
@dreadatour dreadatour added the bug Something isn't working label Aug 20, 2024
@skshetry skshetry self-assigned this Aug 21, 2024
@skshetry
Copy link
Member

Note that this happens as a part of session cleanup, not during DatasetQuery.cleanup().

self._cleanup_temp_datasets()

@skshetry
Copy link
Member

If I remember correctly, this was intended to work as it is.

Btw, can you share what the "Internal Error" is? It's working fine for me on CLI.

@dreadatour
Copy link
Contributor Author

If I remember correctly, this was intended to work as it is.

Interesting, @mnrozhkov says it works before, also he have workflows based in this behavior. May be he can provide more details on this? 🙏

Btw, can you share what the "Internal Error" is? It's working fine for me on CLI.

This is literally the error 👀 From this line:

DATASET_INTERNAL_ERROR_MESSAGE = "Internal error on creating dataset"

@skshetry
Copy link
Member

Looks like this is raised on Studio side.

https://github.com/iterative/studio/blob/6e7ed3fe9b30ef2631b7cee97e5eb621263e155b/backend/datachain_worker/tasks/run_job.py#L367

@skshetry
Copy link
Member

I think we should disable Session cleanup on Studio inside the scripts.

We can decide if we want to delete the datasets or hide it, but that should happen after the script runs.

@shcheklein
Copy link
Member

from datachain import DataChain
DataChain.from_storage("gs://dvcx-datacomp-small/metadata", anon=True).save()

folks, do you remember what was the expected product behavior in this case? may be there was a discussion on this?

I feel it's better for this to be consistent with CLI? I assume in CLI we would drop the dataset (it won't be saved). Should it be the same here?

What is happening on the Studio side when we just return a datachain w/o save() - how do we name those tables and prevent then from being collected?

@mattseddon
Copy link
Member

@shcheklein the expected behaviour is in the docstring:

def save( # type: ignore[override]
self, name: Optional[str] = None, version: Optional[int] = None
) -> "Self":
"""Save to a Dataset. It returns the chain itself.
Parameters:
name : dataset name. Empty name saves to a temporary dataset that will be
removed after process ends. Temp dataset are useful for optimization.
version : version of a dataset. Default - the last version that exist.
"""

@shcheklein
Copy link
Member

okay, so, on the Studio side - what do we do with a query result (if there is no save() at the end) - how / where do we save it?

@skshetry
Copy link
Member

okay, so, on the Studio side - what do we do with a query result (if there is no save() at the end) - how / where do we save it?

On Studio, we always save the dataset (with catalog.query(save=True).

If save is set to True, we are creating new dataset with results
from dataset query. If it's set to False, we will just print results
without saving anything

We need this for showing preview results.

Here the fix is to disable cleanup on session, which fixes the error.

We can discuss what we should do about dangling datasets. But that is already the case.

@shcheklein
Copy link
Member

thanks @skshetry! let me step back a bit:

  • is this issue (in the ticket description( happening only on Studio (just to make sure I understand it right)?
  • if it happening on the Studio side and we have save=True - what is the difference between:

chain.save()

and just

chain

why can't we just save the result in both cases in the same way and cleanup all the temp tables as we usually do?


Here the fix is to disable cleanup on session, which fixes the error.

yep, it just seems a bit strange that we have to disable GC globally to solve (what seems to me at least - I might be wrong) a "local" / specific situation with the save() at the end

We need this for showing preview results.

I think btw that preview is going into a separate model / table (in addition to the result). Are we saving the resulting dataset itself besides that preview?

But that is already the case.

could you clarify please?

@dreadatour
Copy link
Contributor Author

Random idea: may be we should store query result dataset in session and skip it while deleting temporary datasets on session cleanup? Because it is not temporary anymore, as it is a query result. This should solve the issue.

Other possible solution might be to remove query result dataset from session temporary datasets list if it is exists in this list.

@ilongin
Copy link
Contributor

ilongin commented Aug 23, 2024

This is expected behavior, just error message is bad / wrong.

Temp datasets should be used inside a query for optimization purposed, and not as a result of a query. We should just catch this error and write better error message.

@dmpetrov
Copy link
Member

We should just catch this error and write better error message.

Yes, it's just a messaging and docs issue.

A radical solution - rename save() without params to persist()

@shcheklein
Copy link
Member

I think @ilongin @dmpetrov you are on the right track and one way is to improve the message - probably the simplest solution here.

Alternative - I think we can allow running a query with a "save() without params" at the end. It can still "persist" that dataset (or we can optimize it), return datachain and then process the result in the same way as any other query result. Not clear why it is not possible or why it is not working this way now to be honest.

We can put a warning "save() at the end of query is not meaningful" or something.

@dmpetrov
Copy link
Member

dmpetrov commented Aug 24, 2024

why it is not possible or why it is not working this way now to be honest.

I'm not sure about the use case. If dataset is needed - user has to assign a name.
Without params it serves a whole different purpose - avoid re-computing. That's why renaming might be best option.

@shcheklein
Copy link
Member

I'm not sure about the use case. If dataset is needed - user has to assign a name.
Without params it serves a whole different purpose - avoid re-computing. That's why renaming might be best option.

Yep, I think I understand the difference. And there is not use case for this (you don't need indeed that extra "check point"). It still doesn't explain tbh why it is not working. It might be inefficient (extra table that will GCed almost right away), but it's not clear why it should not be possible.

So, if we make it permissive (allow save() at the end and ignore it or have that extra table):

  • less chances people will be getting an exception at the very end (and have to rerun the query)
  • to my mind it is more general - cleaner semantics. If save() returns a DC there is no immediate reason to not process it as a regular result.

If we make it strict:

  • better for education - people will know that save() has a specific purpose (?)
  • anything else??

@dmpetrov
Copy link
Member

It might be inefficient (extra table that will GCed almost right away), but it's not clear why it should not be possible.

I mean... yes, adjusting an error message is just a shortcut. The assumption is - we don't want spend time on this now (does not seem a priority for now) and we can assume that users do not do save() when compute-persist is not needed.

from_storage(...).save() or save("mycats").save() ideally should work just fine (a copy of a table is not even needed - can be optimized).

@shcheklein
Copy link
Member

from_storage(...).save() or save("mycats").save() ideally should work just fine (a copy of a table is not even needed - can be optimized).

yes, exactly

I mean... yes, adjusting an error message is just a shortcut.

yep, agreed if it saves time ... I would also not spend time optimizing it now or changing it if it requires any significant complicated logic.

I was thinking that it could be a small fix tbh - not sure on this.

@dmpetrov
Copy link
Member

I was thinking that it could be a small fix tbh - not sure on this.

If it's small fix - yes, that's wold be great. Otherwise, I'd just improve the message.

@skshetry
Copy link
Member

skshetry commented Aug 26, 2024

Note that we also have "Register Dataset" function in Studio. So we do need to save the datasets in Studio (if it has to work like a detached query).

@skshetry
Copy link
Member

Even if we do ignore that functionality, if we want to do this inside query script itself,
we'll have to add the functionality of following function inside the script:

def update_dataset_version_with_warehouse_info(

This function saves stats, preview, schema and other information of the dataset. Note that the preview has to be raw database results, so that sys columns are also included. (CLI and Studio are currently not aligned on how they use preview results, so we'll need to also fix that).

@shcheklein
Copy link
Member

shcheklein commented Aug 27, 2024

Decided to postpone this and come back later.

First steps before that to take cc @skshetry @dreadatour please create tickets for these:

  • Rename exec or save() into persist()
  • Remove the logic that requires DC at the end (as the last statement). We don't show preview in CLI / Studio if there is no explicit save(name) at the end.

@dreadatour
Copy link
Contributor Author

dreadatour commented Aug 27, 2024

Smallest fix possible for empty save() last line:

diff --git a/src/datachain/query/dataset.py b/src/datachain/query/dataset.py
index 027a713..7a0a79c 100644
--- a/src/datachain/query/dataset.py
+++ b/src/datachain/query/dataset.py
@@ -1778,6 +1778,8 @@ def query_wrapper(dataset_query: DatasetQuery) -> DatasetQuery:
     save = bool(os.getenv("DATACHAIN_QUERY_SAVE"))
     save_as = os.getenv("DATACHAIN_QUERY_SAVE_AS")
 
+    is_session_dataset = dataset_query.name.startswith(Session.DATASET_PREFIX)
+
     if save_as:
         if dataset_query.attached:
             dataset_name = dataset_query.name
@@ -1804,7 +1806,7 @@ def query_wrapper(dataset_query: DatasetQuery) -> DatasetQuery:
             )
         else:
             dataset_query = dataset_query.save(save_as)
-    elif save and not dataset_query.attached:
+    elif save and (is_session_dataset or not dataset_query.attached):
         name = catalog.generate_query_dataset_name()
         dataset_query = dataset_query.save(name)

With this fix this query will works as expected in Studio too:

from datachain import DataChain

DataChain.from_storage("gs://dvcx-datacomp-small/metadata", anon=True).save()

Basically, this query will be equivalent to this:

from datachain import DataChain

DataChain.from_storage("gs://dvcx-datacomp-small/metadata", anon=True)

@skshetry
Copy link
Member

Closed by #357.

@dreadatour
Copy link
Contributor Author

dreadatour commented Aug 27, 2024

Decided to postpone this and come back later.

Hopefully it was solved by #357 using existing mechanics.

First steps before that to take cc @skshetry @dreadatour please create tickets for these:

  • Rename exec or save() into persist()
  • Remove the logic that requires DC at the end (as the last statement). We don't show preview in CLI / Studio if there is no explicit save(name) at the end.

Follow-up issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants