Process got stuck when trying to optimize different groups of parameters using different types of data #584

Yangyi-Chen · 2024-09-18T22:34:15Z

Hi,

I'm adding a new linear projection layer (nn.Linear) to the original Llama3 architecture to process a new type of data. During training, I use two types of data (language-only and multimodal data). When using language-only data, the whole Llama-3 parameters will be finetuned. When using multimodal data, the whole Llama-3 parameters and the parameters in the added linear layer will be finetuned. Both of them can function well independently.

However, when I combined these two types of data to do multi-task learning, the process just got stuck without any further information. Doesn't the current torchtitan support this kind of function? Thanks.

Tasks

Give feedback

No tasks being tracked yet.

Options

Yangyi-Chen · 2024-09-18T22:54:39Z

For some further information, I use a single node, multi-GPU distributed training. When waiting for a long time, I received the following messages:

[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]:[rank0]:[E918 17:48:37.892017038 ProcessGroupNCCL.cpp:1423] [PG ID 0 PG GUID 0(default_pg) Rank 0] Observed flight recorder dump signal from another rank via TCPStore.
[rank0]:[rank0]:[E918 17:48:37.892143284 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 0] Received a dump signal due to a collective timeout from rank 3 and we will try our best to dump the debug info. Last enqueued NCCL work: 108, last completed NCCL work: 107.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank0]:[rank0]:[E918 17:48:37.892317119 ProcessGroupNCCL.cpp:1288] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[rank0]:[E918 17:48:37.935023931 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=108, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=300000) ran for 300032 milliseconds before timing out.
[rank0]:[rank0]:[E918 17:48:37.938135753 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 108, last enqueued NCCL work: 108, last completed NCCL work: 107.

awgu · 2024-09-19T07:03:03Z

It may help if you can provide a repro of some kind and/or give some more information about what parallelism you are using.

Yangyi-Chen · 2024-09-19T07:06:18Z

Hi,
Thanks for the follow-up question. I basically use the default setting as in the ./train_configs/llama3_8b.toml file.

[training]
batch_size = 1
seq_len = 8192 # 8192 # 16384
warmup_steps = 200 # lr scheduler warm up
max_norm = 1.0 # grad norm clipping
steps = 3000
data_parallel_degree = -1
tensor_parallel_degree = 1
enable_fp8_linear = false
compile = true
dataset = "imagenet+dclm"

[experimental]
pipeline_parallel_degree = 1

Yangyi-Chen added a commit to Duet-LVLM/torchtitan that referenced this issue Sep 18, 2024

add the training that includes data mixing, but remains: pytorch#584

ab0795d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process got stuck when trying to optimize different groups of parameters using different types of data #584

Process got stuck when trying to optimize different groups of parameters using different types of data #584

Yangyi-Chen commented Sep 18, 2024 •

edited

Loading

Tasks

Yangyi-Chen commented Sep 18, 2024

awgu commented Sep 19, 2024

Yangyi-Chen commented Sep 19, 2024

Process got stuck when trying to optimize different groups of parameters using different types of data #584

Process got stuck when trying to optimize different groups of parameters using different types of data #584

Comments

Yangyi-Chen commented Sep 18, 2024 • edited Loading

Tasks

Yangyi-Chen commented Sep 18, 2024

awgu commented Sep 19, 2024

Yangyi-Chen commented Sep 19, 2024

Yangyi-Chen commented Sep 18, 2024 •

edited

Loading