Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP (replicate) + TP? #577

Open
yzs981130 opened this issue Sep 13, 2024 · 2 comments
Open

DDP (replicate) + TP? #577

yzs981130 opened this issue Sep 13, 2024 · 2 comments
Labels
question Further information is requested

Comments

@yzs981130
Copy link

yzs981130 commented Sep 13, 2024

Currently, when there are two device meshes (tp and dp), torchtitan should choose FSDP as the only backend for DP. Ref:

if world_mesh.ndim > 1:
raise RuntimeError("DDP has not supported > 1D parallelism")

However, the replicate should support >1D mesh and be used with TP enabled. Ref.

Q1: Why does torchtitan not support DDP (replicate) + TP? Is it only an implementation choice?

I have handwritten DDP + TP in torchtitan and surprisingly found that the loss never goes down. It seems there are no gradients after loss.backward().

image

To reproduce, use the branch above and run run_llama_train.sh on an 8-GPU machine.

Q2: Is it a bug or an intended feature that DDP+TP is not used, and that results in missing gradients?

And collect_env:

Collecting environment information...
PyTorch version: 2.5.0.dev20240903+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 9.13 (stretch) (x86_64)
GCC version: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
Clang version: Could not collect
CMake version: version 3.21.2
Libc version: glibc-2.24

Python version: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.56.bsk.2-amd64-x86_64-with-glibc2.24
Is CUDA available: True
CUDA runtime version: 12.6.20
CUDA_MODULE_LOADING set to: LAZY
...

Nvidia driver version: 560.28.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
...

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] optree==0.12.1
[pip3] pytorch-triton==3.0.0+dedb7bdf33
[pip3] torch==2.5.0.dev20240903+cu118
[pip3] torchaudio==2.5.0.dev20240903+cu118
[pip3] torchdata==0.8.0
[pip3] torchvision==0.20.0.dev20240903+cu118
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] optree                    0.12.1                   pypi_0    pypi
[conda] pytorch-triton            3.0.0+dedb7bdf33          pypi_0    pypi
[conda] torch                     2.5.0.dev20240903+cu118          pypi_0    pypi
[conda] torchaudio                2.5.0.dev20240903+cu118          pypi_0    pypi
[conda] torchdata                 0.8.0                    pypi_0    pypi
[conda] torchvision               0.20.0.dev20240903+cu118          pypi_0    pypi

P.S.

  • Torch 2.4.0 shares the similar abnormal results
  • Using DistributedDataParallel (class) rather than replicate behaves well

Thanks in advance!

@fegin
Copy link
Contributor

fegin commented Sep 13, 2024

We do not plan to support DDP + TP as we have not identified any major use cases for this combination. When working with large models, it is more common to use FSDP + TP instead of DDP + TP. Additionally, FSDP offers several features that are not available in DDP, such as fp8. Therefore, we believe that DDP is better suited for smaller models.
In TorchTitan, we enabled DDP primarily for sanity check purposes, such as verifying parallelism with 8B model and very a small batch size. So we did not verify the correctness of DDP + TP.

@yzs981130
Copy link
Author

We do not plan to support DDP + TP as we have not identified any major use cases for this combination. When working with large models, it is more common to use FSDP + TP instead of DDP + TP. Additionally, FSDP offers several features that are not available in DDP, such as fp8. Therefore, we believe that DDP is better suited for smaller models. In TorchTitan, we enabled DDP primarily for sanity check purposes, such as verifying parallelism with 8B model and very a small batch size. So we did not verify the correctness of DDP + TP.

Thanks for the reply! I learned that FSDP+TP should be the primary/only use, especially for LLMs.

And just check. I am wondering about the original comments indicating "This is a temporary work around to enable DDP + TP." in https://github.com/pytorch/pytorch/blob/7dc1788396fc9e2860c0c236e0c0e108e96b83c8/torch/distributed/_composable/replicate.py#L225-L237. Does it not suggest that the DDP + TP is working now?

@tianyu-l tianyu-l added the question Further information is requested label Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants