train llama3 error #502

starstream · 2024-08-05T10:04:08Z

Root Cause (first observed failure):
[0]:
time : 2024-08-05_10:01:43
host : iZuf6ct0ygsd4zjh2lit8uZ
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 46669)
error_file: /tmp/torchelastic_i4d4ivao/none_jzj2c4lc/attempt_0/0/error.json
traceback : Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/ncluster/dushuai/torchtitan/train.py", line 207, in main
tokenizer = create_tokenizer(tokenizer_type, job_config.model.tokenizer_path)
File "/ncluster/dushuai/torchtitan/torchtitan/datasets/tokenizer/init.py", line 19, in create_tokenizer
return TikTokenizer(tokenizer_path)
File "/ncluster/dushuai/torchtitan/torchtitan/datasets/tokenizer/tiktoken.py", line 52, in init
mergeable_ranks = load_tiktoken_bpe(model_path)
File "/usr/local/lib/python3.10/dist-packages/tiktoken/load.py", line 148, in load_tiktoken_bpe
return {
File "/usr/local/lib/python3.10/dist-packages/tiktoken/load.py", line 149, in
base64.b64decode(token): int(rank)
ValueError: invalid literal for int() with base 10: b'coding=utf-8'

fegin · 2024-08-06T05:46:28Z

Is it possible that the tokenizer is corrupted? Can you re-download the tokenizer and try again?

starstream · 2024-08-06T07:20:56Z

Is it possible that the tokenizer is corrupted? Can you re-download the tokenizer and try again?

Again, the problem occurs. Oddly, both llama2 and test_tiktoken are successful

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train llama3 error #502

train llama3 error #502

starstream commented Aug 5, 2024

fegin commented Aug 6, 2024

starstream commented Aug 6, 2024

train llama3 error #502

train llama3 error #502

Comments

starstream commented Aug 5, 2024

fegin commented Aug 6, 2024

starstream commented Aug 6, 2024