Match Transformers RoPE implementation #214
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Match RoPE with Transformers
What does this PR do?
Why?
In Nanotron, we currently use the interleaved version of RoPE, which differs from the implementation in Transformers. This discrepancy seems to cause a performance gap between Nanotron and Transformers after converting the weights.
Evaluation with lighteval
At lease for LLaMA 3/3.1, the evaluation results are very close
Note
I used this converter to maintain the order of the columns
--------------------------------- Previously ---------------------------------
It's no longer relevant to the PR, but I'm keeping the information here for reference
What has been done?
Change the modeling code of LLaMA in nanotron to get the same output as the Transformers library
Why?
What has been changed?
-> Exact logits match during generation.
How to test it?
CUDA_LAUNCH_BLOCKING=1 CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=1 --nnodes=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:29600 --max_restarts=0 --tee=3 tests/test_llama_generation.py --ckpt-path /fsx/haojun/lighteval_evaluation_model/Llama-3-8B-split
This script compares the output logits, and asserts Nanotron's output is exactly the same as transformers
Here llama3 weights are obtained by using converter script, but need to separate q,k,v; up, gate projection.
Note
However, I found it a bit overkill. The performance drop is most likely due to different RoPE implementations. So to match with Transformers, set the default value to False!