-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty save() does not works as query #327
Comments
Note that this happens as a part of session cleanup, not during datachain/src/datachain/query/session.py Line 70 in 0eb7959
|
If I remember correctly, this was intended to work as it is. Btw, can you share what the "Internal Error" is? It's working fine for me on CLI. |
Interesting, @mnrozhkov says it works before, also he have workflows based in this behavior. May be he can provide more details on this? 🙏
This is literally the error 👀 From this line: datachain/src/datachain/catalog/catalog.py Line 100 in 0eb7959
|
Looks like this is raised on Studio side. |
I think we should disable Session cleanup on Studio inside the scripts. We can decide if we want to delete the datasets or hide it, but that should happen after the script runs. |
folks, do you remember what was the expected product behavior in this case? may be there was a discussion on this? I feel it's better for this to be consistent with CLI? I assume in CLI we would drop the dataset (it won't be saved). Should it be the same here? What is happening on the Studio side when we just return a datachain w/o |
@shcheklein the expected behaviour is in the docstring: datachain/src/datachain/lib/dc.py Lines 562 to 571 in 49141a5
|
okay, so, on the Studio side - what do we do with a query result (if there is no |
On Studio, we always save the dataset (with datachain/src/datachain/catalog/catalog.py Lines 1988 to 1990 in 49141a5
We need this for showing preview results. Here the fix is to disable cleanup on session, which fixes the error. We can discuss what we should do about dangling datasets. But that is already the case. |
thanks @skshetry! let me step back a bit:
and just
why can't we just save the result in both cases in the same way and cleanup all the temp tables as we usually do?
yep, it just seems a bit strange that we have to disable GC globally to solve (what seems to me at least - I might be wrong) a "local" / specific situation with the
I think btw that preview is going into a separate model / table (in addition to the result). Are we saving the resulting dataset itself besides that preview?
could you clarify please? |
Random idea: may be we should store query result dataset in session and skip it while deleting temporary datasets on session cleanup? Because it is not temporary anymore, as it is a query result. This should solve the issue. Other possible solution might be to remove query result dataset from session temporary datasets list if it is exists in this list. |
This is expected behavior, just error message is bad / wrong. Temp datasets should be used inside a query for optimization purposed, and not as a result of a query. We should just catch this error and write better error message. |
Yes, it's just a messaging and docs issue. A radical solution - rename |
I think @ilongin @dmpetrov you are on the right track and one way is to improve the message - probably the simplest solution here. Alternative - I think we can allow running a query with a " We can put a warning " |
I'm not sure about the use case. If dataset is needed - user has to assign a name. |
Yep, I think I understand the difference. And there is not use case for this (you don't need indeed that extra "check point"). It still doesn't explain tbh why it is not working. It might be inefficient (extra table that will GCed almost right away), but it's not clear why it should not be possible. So, if we make it permissive (allow
If we make it strict:
|
I mean... yes, adjusting an error message is just a shortcut. The assumption is - we don't want spend time on this now (does not seem a priority for now) and we can assume that users do not do save() when compute-persist is not needed.
|
yes, exactly
yep, agreed if it saves time ... I would also not spend time optimizing it now or changing it if it requires any significant complicated logic. I was thinking that it could be a small fix tbh - not sure on this. |
If it's small fix - yes, that's wold be great. Otherwise, I'd just improve the message. |
Note that we also have "Register Dataset" function in Studio. So we do need to save the datasets in Studio (if it has to work like a detached query). |
Even if we do ignore that functionality, if we want to do this inside query script itself, datachain/src/datachain/catalog/catalog.py Line 1144 in f0db01d
This function saves stats, preview, schema and other information of the dataset. Note that the preview has to be raw database results, so that sys columns are also included. (CLI and Studio are currently not aligned on how they use preview results, so we'll need to also fix that). |
Decided to postpone this and come back later. First steps before that to take cc @skshetry @dreadatour please create tickets for these:
|
Smallest fix possible for empty diff --git a/src/datachain/query/dataset.py b/src/datachain/query/dataset.py
index 027a713..7a0a79c 100644
--- a/src/datachain/query/dataset.py
+++ b/src/datachain/query/dataset.py
@@ -1778,6 +1778,8 @@ def query_wrapper(dataset_query: DatasetQuery) -> DatasetQuery:
save = bool(os.getenv("DATACHAIN_QUERY_SAVE"))
save_as = os.getenv("DATACHAIN_QUERY_SAVE_AS")
+ is_session_dataset = dataset_query.name.startswith(Session.DATASET_PREFIX)
+
if save_as:
if dataset_query.attached:
dataset_name = dataset_query.name
@@ -1804,7 +1806,7 @@ def query_wrapper(dataset_query: DatasetQuery) -> DatasetQuery:
)
else:
dataset_query = dataset_query.save(save_as)
- elif save and not dataset_query.attached:
+ elif save and (is_session_dataset or not dataset_query.attached):
name = catalog.generate_query_dataset_name()
dataset_query = dataset_query.save(name) With this fix this query will works as expected in Studio too: from datachain import DataChain
DataChain.from_storage("gs://dvcx-datacomp-small/metadata", anon=True).save() Basically, this query will be equivalent to this: from datachain import DataChain
DataChain.from_storage("gs://dvcx-datacomp-small/metadata", anon=True) |
Closed by #357. |
Hopefully it was solved by #357 using existing mechanics.
Follow-up issues: |
Empty
save()
in datachain query fails withInternal error on creating dataset
error.Query example:
This happens because if
name
is not set, temp dataset name will be used. At the same time, aftersave()
is finishes, temp datasets are cleaned up.Some explanation:
Couple options here:
exec()
instead of emptysave()
. Need to check if this is proper replace + need to addexec()
to docs.The text was updated successfully, but these errors were encountered: