Empty save() does not works as query #327

dreadatour · 2024-08-20T14:53:15Z

Empty save() in datachain query fails with Internal error on creating dataset error.

Query example:

from datachain import DataChain

DataChain.from_storage("gs://dvcx-datacomp-small/metadata", anon=True).save()

This happens because if name is not set, temp dataset name will be used. At the same time, after save() is finishes, temp datasets are cleaned up.

Some explanation:

Empty save is needed because operations are lazy. Chains are not executed immediately and intermediate results are not stored.
For example, if we want to create dc_filtered_1 and dc_embeddings from dc, without save() intermediate dc chain will be executed twice, for each children.

Couple options here:

Skip saved temp dataset on cleanup
Use exec() instead of empty save(). Need to check if this is proper replace + need to add exec() to docs.

The text was updated successfully, but these errors were encountered:

skshetry · 2024-08-22T12:22:56Z

Note that this happens as a part of session cleanup, not during DatasetQuery.cleanup().

datachain/src/datachain/query/session.py

Line 70 in 0eb7959

self._cleanup_temp_datasets()

skshetry · 2024-08-22T12:27:57Z

If I remember correctly, this was intended to work as it is.

Btw, can you share what the "Internal Error" is? It's working fine for me on CLI.

dreadatour · 2024-08-22T13:49:14Z

If I remember correctly, this was intended to work as it is.

Interesting, @mnrozhkov says it works before, also he have workflows based in this behavior. May be he can provide more details on this? 🙏

Btw, can you share what the "Internal Error" is? It's working fine for me on CLI.

This is literally the error 👀 From this line:

datachain/src/datachain/catalog/catalog.py

Line 100 in 0eb7959

DATASET_INTERNAL_ERROR_MESSAGE = "Internal error on creating dataset"

skshetry · 2024-08-22T14:00:24Z

Looks like this is raised on Studio side.

https://github.com/iterative/studio/blob/6e7ed3fe9b30ef2631b7cee97e5eb621263e155b/backend/datachain_worker/tasks/run_job.py#L367

skshetry · 2024-08-22T14:43:57Z

I think we should disable Session cleanup on Studio inside the scripts.

We can decide if we want to delete the datasets or hide it, but that should happen after the script runs.

shcheklein · 2024-08-22T18:26:10Z

from datachain import DataChain
DataChain.from_storage("gs://dvcx-datacomp-small/metadata", anon=True).save()

folks, do you remember what was the expected product behavior in this case? may be there was a discussion on this?

I feel it's better for this to be consistent with CLI? I assume in CLI we would drop the dataset (it won't be saved). Should it be the same here?

What is happening on the Studio side when we just return a datachain w/o save() - how do we name those tables and prevent then from being collected?

mattseddon · 2024-08-22T21:48:54Z

@shcheklein the expected behaviour is in the docstring:

datachain/src/datachain/lib/dc.py

Lines 562 to 571 in 49141a5

    
               def save(  # type: ignore[override] 
        
                   self, name: Optional[str] = None, version: Optional[int] = None 
        
               ) -> "Self": 
        
                   """Save to a Dataset. It returns the chain itself. 
        
                   Parameters: 
        
                       name : dataset name. Empty name saves to a temporary dataset that will be 
        
                           removed after process ends. Temp dataset are useful for optimization. 
        
                       version : version of a dataset. Default - the last version that exist. 
        
                   """

shcheklein · 2024-08-22T21:51:22Z

okay, so, on the Studio side - what do we do with a query result (if there is no save() at the end) - how / where do we save it?

skshetry · 2024-08-23T03:10:56Z

okay, so, on the Studio side - what do we do with a query result (if there is no save() at the end) - how / where do we save it?

On Studio, we always save the dataset (with catalog.query(save=True).

datachain/src/datachain/catalog/catalog.py

Lines 1988 to 1990 in 49141a5

    
                   If save is set to True, we are creating new dataset with results 
        
                   from dataset query. If it's set to False, we will just print results 
        
                   without saving anything

We need this for showing preview results.

Here the fix is to disable cleanup on session, which fixes the error.

We can discuss what we should do about dangling datasets. But that is already the case.

shcheklein · 2024-08-23T05:36:58Z

thanks @skshetry! let me step back a bit:

is this issue (in the ticket description( happening only on Studio (just to make sure I understand it right)?
if it happening on the Studio side and we have save=True - what is the difference between:

chain.save()

and just

chain

why can't we just save the result in both cases in the same way and cleanup all the temp tables as we usually do?

Here the fix is to disable cleanup on session, which fixes the error.

yep, it just seems a bit strange that we have to disable GC globally to solve (what seems to me at least - I might be wrong) a "local" / specific situation with the save() at the end

We need this for showing preview results.

I think btw that preview is going into a separate model / table (in addition to the result). Are we saving the resulting dataset itself besides that preview?

But that is already the case.

could you clarify please?

dreadatour · 2024-08-23T08:56:37Z

Random idea: may be we should store query result dataset in session and skip it while deleting temporary datasets on session cleanup? Because it is not temporary anymore, as it is a query result. This should solve the issue.

Other possible solution might be to remove query result dataset from session temporary datasets list if it is exists in this list.

ilongin · 2024-08-23T23:31:08Z

This is expected behavior, just error message is bad / wrong.

Temp datasets should be used inside a query for optimization purposed, and not as a result of a query. We should just catch this error and write better error message.

dmpetrov · 2024-08-24T00:14:14Z

We should just catch this error and write better error message.

Yes, it's just a messaging and docs issue.

A radical solution - rename save() without params to persist()

shcheklein · 2024-08-24T00:21:16Z

I think @ilongin @dmpetrov you are on the right track and one way is to improve the message - probably the simplest solution here.

Alternative - I think we can allow running a query with a "save() without params" at the end. It can still "persist" that dataset (or we can optimize it), return datachain and then process the result in the same way as any other query result. Not clear why it is not possible or why it is not working this way now to be honest.

We can put a warning "save() at the end of query is not meaningful" or something.

dmpetrov · 2024-08-24T01:17:19Z

why it is not possible or why it is not working this way now to be honest.

I'm not sure about the use case. If dataset is needed - user has to assign a name.
Without params it serves a whole different purpose - avoid re-computing. That's why renaming might be best option.

shcheklein · 2024-08-24T01:26:22Z

I'm not sure about the use case. If dataset is needed - user has to assign a name.
Without params it serves a whole different purpose - avoid re-computing. That's why renaming might be best option.

Yep, I think I understand the difference. And there is not use case for this (you don't need indeed that extra "check point"). It still doesn't explain tbh why it is not working. It might be inefficient (extra table that will GCed almost right away), but it's not clear why it should not be possible.

So, if we make it permissive (allow save() at the end and ignore it or have that extra table):

less chances people will be getting an exception at the very end (and have to rerun the query)
to my mind it is more general - cleaner semantics. If save() returns a DC there is no immediate reason to not process it as a regular result.

If we make it strict:

better for education - people will know that save() has a specific purpose (?)
anything else??

dmpetrov · 2024-08-24T01:38:02Z

It might be inefficient (extra table that will GCed almost right away), but it's not clear why it should not be possible.

I mean... yes, adjusting an error message is just a shortcut. The assumption is - we don't want spend time on this now (does not seem a priority for now) and we can assume that users do not do save() when compute-persist is not needed.

from_storage(...).save() or save("mycats").save() ideally should work just fine (a copy of a table is not even needed - can be optimized).

shcheklein · 2024-08-24T01:42:32Z

from_storage(...).save() or save("mycats").save() ideally should work just fine (a copy of a table is not even needed - can be optimized).

yes, exactly

I mean... yes, adjusting an error message is just a shortcut.

yep, agreed if it saves time ... I would also not spend time optimizing it now or changing it if it requires any significant complicated logic.

I was thinking that it could be a small fix tbh - not sure on this.

dmpetrov · 2024-08-24T01:49:12Z

I was thinking that it could be a small fix tbh - not sure on this.

If it's small fix - yes, that's wold be great. Otherwise, I'd just improve the message.

skshetry · 2024-08-26T11:34:24Z

Note that we also have "Register Dataset" function in Studio. So we do need to save the datasets in Studio (if it has to work like a detached query).

skshetry · 2024-08-26T11:39:12Z

Even if we do ignore that functionality, if we want to do this inside query script itself,
we'll have to add the functionality of following function inside the script:

datachain/src/datachain/catalog/catalog.py

Line 1144 in f0db01d

def update_dataset_version_with_warehouse_info(

This function saves stats, preview, schema and other information of the dataset. Note that the preview has to be raw database results, so that sys columns are also included. (CLI and Studio are currently not aligned on how they use preview results, so we'll need to also fix that).

shcheklein · 2024-08-27T04:31:21Z

Decided to postpone this and come back later.

First steps before that to take cc @skshetry @dreadatour please create tickets for these:

Rename exec or save() into persist()
Remove the logic that requires DC at the end (as the last statement). We don't show preview in CLI / Studio if there is no explicit save(name) at the end.

dreadatour · 2024-08-27T05:47:52Z

Smallest fix possible for empty save() last line:

diff --git a/src/datachain/query/dataset.py b/src/datachain/query/dataset.py
index 027a713..7a0a79c 100644
--- a/src/datachain/query/dataset.py
+++ b/src/datachain/query/dataset.py
@@ -1778,6 +1778,8 @@ def query_wrapper(dataset_query: DatasetQuery) -> DatasetQuery:
     save = bool(os.getenv("DATACHAIN_QUERY_SAVE"))
     save_as = os.getenv("DATACHAIN_QUERY_SAVE_AS")
 
+    is_session_dataset = dataset_query.name.startswith(Session.DATASET_PREFIX)
+
     if save_as:
         if dataset_query.attached:
             dataset_name = dataset_query.name
@@ -1804,7 +1806,7 @@ def query_wrapper(dataset_query: DatasetQuery) -> DatasetQuery:
             )
         else:
             dataset_query = dataset_query.save(save_as)
-    elif save and not dataset_query.attached:
+    elif save and (is_session_dataset or not dataset_query.attached):
         name = catalog.generate_query_dataset_name()
         dataset_query = dataset_query.save(name)

With this fix this query will works as expected in Studio too:

from datachain import DataChain

DataChain.from_storage("gs://dvcx-datacomp-small/metadata", anon=True).save()

Basically, this query will be equivalent to this:

from datachain import DataChain

DataChain.from_storage("gs://dvcx-datacomp-small/metadata", anon=True)

skshetry · 2024-08-27T07:08:00Z

Closed by #357.

dreadatour · 2024-08-27T07:20:22Z

Decided to postpone this and come back later.

Hopefully it was solved by #357 using existing mechanics.

First steps before that to take cc @skshetry @dreadatour please create tickets for these:

Rename exec or save() into persist()

Remove the logic that requires DC at the end (as the last statement). We don't show preview in CLI / Studio if there is no explicit save(name) at the end.

Follow-up issues:

dreadatour added the bug Something isn't working label Aug 20, 2024

skshetry self-assigned this Aug 21, 2024

skshetry mentioned this issue Aug 22, 2024

session: support disabling temp datasets cleanup on exit via an envvar #343

Closed

shcheklein assigned dreadatour Aug 27, 2024

dreadatour mentioned this issue Aug 27, 2024

Fix empty 'save()' as query last statement #357

Merged

skshetry removed their assignment Aug 27, 2024

skshetry closed this as completed Aug 27, 2024

dreadatour mentioned this issue Aug 27, 2024

New persist() method #361

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty save() does not works as query #327

Empty save() does not works as query #327

dreadatour commented Aug 20, 2024

skshetry commented Aug 22, 2024

skshetry commented Aug 22, 2024

dreadatour commented Aug 22, 2024

skshetry commented Aug 22, 2024

skshetry commented Aug 22, 2024

shcheklein commented Aug 22, 2024

mattseddon commented Aug 22, 2024

shcheklein commented Aug 22, 2024

skshetry commented Aug 23, 2024

shcheklein commented Aug 23, 2024

dreadatour commented Aug 23, 2024

ilongin commented Aug 23, 2024

dmpetrov commented Aug 24, 2024

shcheklein commented Aug 24, 2024

dmpetrov commented Aug 24, 2024 •

edited

Loading

shcheklein commented Aug 24, 2024

dmpetrov commented Aug 24, 2024

shcheklein commented Aug 24, 2024

dmpetrov commented Aug 24, 2024

skshetry commented Aug 26, 2024 •

edited

Loading

skshetry commented Aug 26, 2024

shcheklein commented Aug 27, 2024 •

edited by dreadatour

Loading

dreadatour commented Aug 27, 2024 •

edited

Loading

skshetry commented Aug 27, 2024

dreadatour commented Aug 27, 2024 •

edited

Loading

Empty save() does not works as query #327

Empty save() does not works as query #327

Comments

dreadatour commented Aug 20, 2024

skshetry commented Aug 22, 2024

skshetry commented Aug 22, 2024

dreadatour commented Aug 22, 2024

skshetry commented Aug 22, 2024

skshetry commented Aug 22, 2024

shcheklein commented Aug 22, 2024

mattseddon commented Aug 22, 2024

shcheklein commented Aug 22, 2024

skshetry commented Aug 23, 2024

shcheklein commented Aug 23, 2024

dreadatour commented Aug 23, 2024

ilongin commented Aug 23, 2024

dmpetrov commented Aug 24, 2024

shcheklein commented Aug 24, 2024

dmpetrov commented Aug 24, 2024 • edited Loading

shcheklein commented Aug 24, 2024

dmpetrov commented Aug 24, 2024

shcheklein commented Aug 24, 2024

dmpetrov commented Aug 24, 2024

skshetry commented Aug 26, 2024 • edited Loading

skshetry commented Aug 26, 2024

shcheklein commented Aug 27, 2024 • edited by dreadatour Loading

dreadatour commented Aug 27, 2024 • edited Loading

skshetry commented Aug 27, 2024

dreadatour commented Aug 27, 2024 • edited Loading

dmpetrov commented Aug 24, 2024 •

edited

Loading

skshetry commented Aug 26, 2024 •

edited

Loading

shcheklein commented Aug 27, 2024 •

edited by dreadatour

Loading

dreadatour commented Aug 27, 2024 •

edited

Loading

dreadatour commented Aug 27, 2024 •

edited

Loading