DDP (replicate) + TP? #577

yzs981130 · 2024-09-13T08:10:05Z

Currently, when there are two device meshes (tp and dp), torchtitan should choose FSDP as the only backend for DP. Ref:

torchtitan/torchtitan/parallelisms/parallelize_llama.py

Lines 97 to 98 in d2a4904

    
           if world_mesh.ndim > 1: 
        
               raise RuntimeError("DDP has not supported > 1D parallelism")

However, the replicate should support >1D mesh and be used with TP enabled. Ref.

Q1: Why does torchtitan not support DDP (replicate) + TP? Is it only an implementation choice?

I have handwritten DDP + TP in torchtitan and surprisingly found that the loss never goes down. It seems there are no gradients after loss.backward().

To reproduce, use the branch above and run run_llama_train.sh on an 8-GPU machine.

Q2: Is it a bug or an intended feature that DDP+TP is not used, and that results in missing gradients?

And collect_env:

Collecting environment information...
PyTorch version: 2.5.0.dev20240903+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 9.13 (stretch) (x86_64)
GCC version: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
Clang version: Could not collect
CMake version: version 3.21.2
Libc version: glibc-2.24

Python version: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.56.bsk.2-amd64-x86_64-with-glibc2.24
Is CUDA available: True
CUDA runtime version: 12.6.20
CUDA_MODULE_LOADING set to: LAZY
...

Nvidia driver version: 560.28.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
...

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] optree==0.12.1
[pip3] pytorch-triton==3.0.0+dedb7bdf33
[pip3] torch==2.5.0.dev20240903+cu118
[pip3] torchaudio==2.5.0.dev20240903+cu118
[pip3] torchdata==0.8.0
[pip3] torchvision==0.20.0.dev20240903+cu118
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] optree                    0.12.1                   pypi_0    pypi
[conda] pytorch-triton            3.0.0+dedb7bdf33          pypi_0    pypi
[conda] torch                     2.5.0.dev20240903+cu118          pypi_0    pypi
[conda] torchaudio                2.5.0.dev20240903+cu118          pypi_0    pypi
[conda] torchdata                 0.8.0                    pypi_0    pypi
[conda] torchvision               0.20.0.dev20240903+cu118          pypi_0    pypi

P.S.

Torch 2.4.0 shares the similar abnormal results
Using DistributedDataParallel (class) rather than replicate behaves well

Thanks in advance!

The text was updated successfully, but these errors were encountered:

fegin · 2024-09-13T16:32:20Z

We do not plan to support DDP + TP as we have not identified any major use cases for this combination. When working with large models, it is more common to use FSDP + TP instead of DDP + TP. Additionally, FSDP offers several features that are not available in DDP, such as fp8. Therefore, we believe that DDP is better suited for smaller models.
In TorchTitan, we enabled DDP primarily for sanity check purposes, such as verifying parallelism with 8B model and very a small batch size. So we did not verify the correctness of DDP + TP.

yzs981130 · 2024-09-13T17:49:21Z

We do not plan to support DDP + TP as we have not identified any major use cases for this combination. When working with large models, it is more common to use FSDP + TP instead of DDP + TP. Additionally, FSDP offers several features that are not available in DDP, such as fp8. Therefore, we believe that DDP is better suited for smaller models. In TorchTitan, we enabled DDP primarily for sanity check purposes, such as verifying parallelism with 8B model and very a small batch size. So we did not verify the correctness of DDP + TP.

Thanks for the reply! I learned that FSDP+TP should be the primary/only use, especially for LLMs.

And just check. I am wondering about the original comments indicating "This is a temporary work around to enable DDP + TP." in https://github.com/pytorch/pytorch/blob/7dc1788396fc9e2860c0c236e0c0e108e96b83c8/torch/distributed/_composable/replicate.py#L225-L237. Does it not suggest that the DDP + TP is working now?

tianyu-l added the question Further information is requested label Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP (replicate) + TP? #577

DDP (replicate) + TP? #577

yzs981130 commented Sep 13, 2024 •

edited

Loading

fegin commented Sep 13, 2024

yzs981130 commented Sep 13, 2024

DDP (replicate) + TP? #577

DDP (replicate) + TP? #577

Comments

yzs981130 commented Sep 13, 2024 • edited Loading

fegin commented Sep 13, 2024

yzs981130 commented Sep 13, 2024

yzs981130 commented Sep 13, 2024 •

edited

Loading