Welcome to LLaMA, my library for training and fine-tuning the LLaMA model. I find it helpful to implement something from scratch to gain a better understanding. I hope the simplicity of this repo could potentially serve as a good starting point for beginners.
Currently, this library supports:
- Flash Attention, Triton RMSNorm, Flash RoPE (Triton/CUDA acceleration)
- KV Cache
- Tensor Parallelism
- DDP with bucket
- Speedup/Loss benchmark results under LLaMA/tools/benchmark
I'm actively working on integrating the following features:
- Training on real data
- More benchmarks
- Zero Optimizer