-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only half of parameters are saved when applied PP #474
Comments
hmm. we shouldn't really need DTensor to solve the problem of layer0 being saved and layer1 not being saved. The fqn should be preserved and not conflict, so we should be able to save both. From the pattern I assume this is using virtual pipeline stages and layer 0,2,4,... are on gpu0 and only gpu0 is correctly saving things? In the 3D case with PP, we expect that gpu0 would save DTensor including any TP/DP replication/sharding. However, we do not rely on DTensor for dealing with layer0 vs layer1. |
dcp.save() works with both DTensor and Tensor. Rank0 will determine what to save on each rank. If tensors are not duplicated (FQNs are different), all the tensors will be saved . |
I tested this case, and figured out several points:
I think there is a key confilct in _save_state_dict( ) method, so _save_state_dict( ) in dcp.save( ) does weird operation. |
could you share the exact repro command so we can debug? |
I run "run_llama_train.sh", setting "pipeline_parallel_degree" as 2 and other parallel degrees are 1. |
I'm currently training Llama-3-8B model in 2 GPUs with Pipeline parallel only.
However, when i save a checkpoint on each rank, half of that checkpoint is saved. (Layer 1 is saved, Layer 2 is not saved, Layer 3 is saved, Layer 4 is not saved ... Layer 15 is saved.)
I think dcp.save only works well with dtensor, not tensor. I need your insight on this. Thanks a lot!
The text was updated successfully, but these errors were encountered: