Add `--allow-missing` for `dvc commit` #10524

ermolaev94 · 2024-08-14T13:34:53Z

Sometimes it's necessary to update pipeline hashes and params without downloading large datasets. Currently I can't run dvc commit on machine without data even if I'd like to update my source code machine.

It would be great to have similar to dvc repro flag providing a way to ignore files that are not inside cache.

The text was updated successfully, but these errors were encountered:

shcheklein · 2024-08-14T18:50:32Z

Makes sense, should be quite straightforward to add. I hope someone from the community can pick it up.

anunayasri · 2024-09-10T18:35:08Z

I can give this a shot.

@ermolaev94 I am new to the community and need some help here. I understand that you need a flag --allow-missing is dvc commit command. The flag will skip the downloading step mentioned in the pipeline definition in dvc.yaml. Am I correct? Do you want to add something to this? I see that allow-missing flag is in other commands also. Do you think you would need this feature in other cli commands also?

shcheklein · 2024-09-10T22:34:22Z

@anunayasri thanks! the scope for this is to add --allow-missing to the dvc commit command only for now (we can see if need it somewhere else later, I would wait for some demand).

The idea is that dvc commit would ignore updating .dvc and dvc.lock file (take previous values) for DVC tracked outputs / data that doesn't exist (not dvc pull-ed) in the workspace.

I think the use case comes from a practical issue like this. Let's say we have a pipeline that has a lot of data (input), output (models), etc. It also depends on some python files (source code tracked by Git).

Let's say we added a comment in that source file and we don't want to run the pipeline again (since it doesn't change the result but takes a lot of time to download the data and run it), but we want to update dvc.lock, .dvc hashes (so that dvc status shows that everything is up to date). That's where dvc commit comes to play usually. And --allow-missing would allow people to avoid running the dvc pull before that (that can be time consuming).

Let me know if that makes sense.

anunayasri · 2024-09-11T18:12:59Z

Thanks for the detailed response @shcheklein . This makes sense. I will try it out in some time and revert back.

anunayasri · 2024-09-12T09:01:32Z

@shcheklein Can you guide me how to reproduce this issue of downloading data on dvc commit. Or point me to a doc that describes this scenario.

I have tried the following -

I am following the docs for Get Started > Pipelines.
I add a comment in src/featurize.py
dvc status shows the file as edited.
I deleted data/data.xml. I believe this is the scenario we are trying to replicate.
dvc commit fails saying that the dep doesn't exist.

skshetry · 2024-09-12T10:44:02Z

I don't think this is a good issue for contributions, as there are lot of open product questions.

@ermolaev94, if you are not aware, dvc add can already update datasets without completely downloading datasets.
See https://dvc.org/doc/user-guide/data-management/modifying-large-datasets.

We could not implement this for commit last time, because it was not clear what partial commit means for pipelines.

This PR implements virtual operation for commit: #9440.

The problem is that dvc commit is primarily meant to be used with dvc repro --no-commit to reduce cache transfers when you are quickly experimenting and don't want to save to the cache yet.

In that case, you do a lot of dvc repro --no-commit while experimenting and then do a dvc commit at the final stage, when you want to transfer all files to the cache. IIRC supporting virtual operations in commit would break this basic scenario.

shcheklein · 2024-09-12T22:50:40Z

@skshetry just to a bit more color (and @ermolaev94 can correct me):

if you are not aware, dvc add can already update datasets without completely downloading datasets.
See https://dvc.org/doc/user-guide/data-management/modifying-large-datasets.

I think this is about pipelines primarily, not dvc add

The problem is that dvc commit is primarily meant to be used with dvc repro --no-commit to reduce cache transfers when you are quickly experimenting and don't want to save to the cache yet.

I think we advertise a few use case actually. Including the one when you change quickly a dependency and don't want to run the whole pipeline. Here is the description: https://dvc.org/doc/command-reference/commit#description

We could not implement this for commit last time, because it was not clear what partial commit means for pipelines.

could you please clarify / do you remember where / when that discussion was happening?

skshetry · 2024-09-13T03:59:49Z

I think this is about pipelines primarily, not dvc add.

Yup, I understand that. If you read in #9440, the suggestion from Ruslan is to add support for updating pipelines through dvc add. But I prefer to keep it separate.

I think we advertise a few use case actually. Including the one when you change quickly a dependency and don't want to run the whole pipeline. Here is the description: dvc.org/doc/command-reference/commit#description

Documentation is not quite right. Please read this comment: #9389 (comment).
Ruslan explained me (quite patiently) what commit is supposed to do. It took me quite some time to understand.

I had always thought of commit to be a way to sync workspace to the cache, but it was not quite right. dvc repro --no-commit creates a "staging", and commit commits the staging to the cache. That's more of a correct explanation for commit.

could you please clarify / do you remember where / when that discussion was happening?

I hope the discussion in #9389 and #9440 captures most of the thing.
(cc @dberenbaum, maybe you remember some discussion with Ruslan on this).

shcheklein · 2024-09-13T04:12:39Z

Documentation is not quite right. Please read this comment: #9389 (comment).
Ruslan explained me (quite patiently) what commit is supposed to do. It took me quite some time to understand.

Hmm 🤔 I'm pretty sure (I can find the original implementation discussion probably) that dvc commit was introduced (after some push from the community) to provide a faster way to update .dvc files within pipelines (there was no dvc.lock at that time. Probably Ruslan's though process is some kind of evolution after --no-commit flags were introduced - hard to tell. He was trying to reconcile names, semantics, etc in some nice way?

Anyways,I think dvc commit does atm exactly what it is saying in the docs and is useful in the those scenarios (like the one that is described in this ticket).

I had always thought of commit to be a way to sync workspace to the cache, but it was not quite right. dvc repro --no-commit creates a "staging", and commit commits the staging to the cache. That's more of a correct explanation for commit.

I would go not from Git analogs but from practical DVC-specific scenarios if possible.

E.g. pin down existing state to .dvc and dvc.lock if you don't want to run dvc repro (too expensive).

It's way easier for me to think in these terms tbh. And then if needed map to Git analogs. It might well be that the name is not perfect and we might well have discrepancy in semantics for --no-commit and dvc commit - I would need to check this tbh. It's been a while :)

skshetry · 2024-09-13T04:44:20Z

Sometimes it's necessary to update pipeline hashes and params without downloading large datasets. Currently I can't run dvc commit on machine without data even if I'd like to update my source code machine.

Sorry, I don't think I read it very carefully. I guess the source code itself is a dependency. Currently, dvc commit supports specifying outputs or stage names. But it does not support specifying dependency, so virtual operation is not needed here.

It looks like we already support allow_missing on Repo.commit(), so it is only a matter of exposing it to through the CLI.

dvc/dvc/repo/commit.py

Lines 45 to 54 in f56343d

    
           def commit( 
        
               self, 
        
               target=None, 
        
               with_deps=False, 
        
               recursive=False, 
        
               force=False, 
        
               allow_missing=False, 
        
               data_only=False, 
        
               relink=True, 
        
           ):

Although virtual operation would help for large datasets here.

skshetry · 2024-09-13T04:52:10Z

Anyways,I think dvc commit does atm exactly what it is saying in the docs and is useful in the those scenarios (like the one that is described in this ticket).

But it does not, not by default.

By default, dvc commit commits the files specified in .dvc and dvc.lock file.
If the workspace matches with the hash in the .dvc/dvc.lock file, it silently commits it to the cache. If it does not, it asks you what to do. Or, in case of --force, it force commits to reflect the changes from your workspace.

So, what dvc commit does by default, without prompt and without --force flag, is not mentioned in the docs or is not explicit.
These days, most of the commit usage is likely to commit changes from your workspace. But that's not how dvc works by default.

I'm pretty sure (I can find the original implementation discussion probably) that dvc commit was introduced (after some push from the community) to provide a faster way to update .dvc files within pipelines (there was no dvc.lock at that time.

Reading #919 (and, #1601), Ruslan is right here. dvc commit and repro --no-commit were introduced together.

anunayasri · 2024-09-13T07:12:52Z

Looks like this issue needs more discussion. I think I should stop working on it for now.

I don't think this is a good issue for contributions, as there are lot of open product questions.

@skshetry I am looking to contribute to the repo. Could you please point me to beginner friendly product questions that I can work on.

skshetry · 2024-09-13T07:27:50Z

@skshetry I am looking to contribute to the repo. Could you please point me to beginner friendly product questions that I can work on.

Hi, I think it's okay to implement --allow-missing, which is what is being asked here.
As I mentioned above, Repo.commit() already implements this, we only need to expose this in the CLI.

To give you more information, internally, every command in dvc mirrors an API of same name in Repo class (with a few exceptions). Eg: dvc add calls Repo.add(), and dvc commit calls Repo.commit().

dvc/dvc/commands/commit.py

Line 9 in f56343d

class CmdCommit(CmdBase):

dvc/dvc/repo/commit.py

Line 45 in f56343d

def commit(

Repo.commit() already seem to implement allow_missing, but we also need to make sure it works.

You can take an example of dvc checkout on how it implements --allow-missing as an example.

dvc/dvc/commands/checkout.py

Lines 104 to 109 in f56343d

    
           checkout_parser.add_argument( 
        
               "--allow-missing", 
        
               action="store_true", 
        
               default=False, 
        
               help="Ignore errors if some of the files or directories are missing.", 
        
           )

Fixes iterative#10524

shcheklein added the feature request Requesting a new feature label Aug 14, 2024

shcheklein added help wanted A: data-management Related to dvc add/checkout/commit/move/remove labels Aug 14, 2024

skshetry removed the help wanted label Sep 12, 2024

anunayasri added a commit to anunayasri/dvc that referenced this issue Sep 13, 2024

cli: add allow-missing flag to commit command

71d2341

Fixes iterative#10524

anunayasri linked a pull request Sep 13, 2024 that will close this issue

cli: add allow-missing flag to commit command #10555

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--allow-missing` for `dvc commit` #10524

Add `--allow-missing` for `dvc commit` #10524

ermolaev94 commented Aug 14, 2024

shcheklein commented Aug 14, 2024

anunayasri commented Sep 10, 2024

shcheklein commented Sep 10, 2024

anunayasri commented Sep 11, 2024

anunayasri commented Sep 12, 2024

skshetry commented Sep 12, 2024 •

edited

Loading

shcheklein commented Sep 12, 2024

skshetry commented Sep 13, 2024 •

edited

Loading

shcheklein commented Sep 13, 2024

skshetry commented Sep 13, 2024 •

edited

Loading

skshetry commented Sep 13, 2024 •

edited

Loading

anunayasri commented Sep 13, 2024

skshetry commented Sep 13, 2024

Add --allow-missing for dvc commit #10524

Add --allow-missing for dvc commit #10524

Comments

ermolaev94 commented Aug 14, 2024

shcheklein commented Aug 14, 2024

anunayasri commented Sep 10, 2024

shcheklein commented Sep 10, 2024

anunayasri commented Sep 11, 2024

anunayasri commented Sep 12, 2024

skshetry commented Sep 12, 2024 • edited Loading

shcheklein commented Sep 12, 2024

skshetry commented Sep 13, 2024 • edited Loading

shcheklein commented Sep 13, 2024

skshetry commented Sep 13, 2024 • edited Loading

skshetry commented Sep 13, 2024 • edited Loading

anunayasri commented Sep 13, 2024

skshetry commented Sep 13, 2024

Add `--allow-missing` for `dvc commit` #10524

Add `--allow-missing` for `dvc commit` #10524

skshetry commented Sep 12, 2024 •

edited

Loading

skshetry commented Sep 13, 2024 •

edited

Loading

skshetry commented Sep 13, 2024 •

edited

Loading

skshetry commented Sep 13, 2024 •

edited

Loading