Memory optimization in async tp-linear #208

AleHD · 2024-07-18T16:49:30Z

This PR introduces the memory optimization methods implemented in #203, but now allows them to be used in the async comm regime. This PR also includes commits related to fixing the row-parallel tp-linear in #172, so both of those PRs should be merged first to make reviewing this one easier. This PR also includes two modes: recomputing or not the all_gather. Recomputing is more-memory efficient, but slightly slower.

Here is a table that summarizes my observations on a tp4 llama8b model on A100 GPUs:

Method	Average throughput (tok/sec/gpu)	Max memory reserved (GB)
Baseline (current main implementation)	3415	72.6
Sync no-recompute	3525 (+3%)	68.5 (-6%)
Sync recompute	3427	61.9 (-15%)
Async no-recompute	3587 (+5%)	68.8 (-5%)
Async recompute	3526 (+3%)	60.9 (-16%)

These changes should prove to be very useful for more efficient training. I recommend using the async-recompute setting, but using async-norecompute might make more sense for the extra throughput when memory is not a concern. In addition, as dp and pp increase and optimizer states and parameters become more sharded, the memory savings this PR brings should only increase as these affect memory activations. Very useful for scaling to larger models.

I attach the wandb logs for llama8b (top) experiments and a tiny 152M model (bottom) to study the effects on smaller models.

(blue = sync baseline, green = sync no-recompute, purple = sync recompute, yellow = async baseline, gray = async recompute, red = async no-recompute).

…rentiable distributed operations

…c_fix

3outeille · 2024-07-22T14:48:58Z

src/nanotron/parallel/tensor_parallel/functional.py

@@ -141,22 +142,27 @@ def forward(ctx, tensor, weight, bias, group, tp_mode):
                # `tensor` can sometimes not be contiguous
                # https://cs.github.com/pytorch/pytorch/blob/2b267fa7f28e18ca6ea1de4201d2541a40411457/torch/distributed/nn/functional.py#L317
                tensor = tensor.contiguous()
-                ctx.save_for_backward(tensor, weight)
+                # ctx.save_for_backward(tensor, weight)


remove comments

src/nanotron/config/config.py

xrsrke

LGTM!

Minor restyling

3outeille · 2024-08-05T08:47:27Z

lgtm as well

AleHD added 17 commits June 27, 2024 11:56

Implemented global memory buffer to reduce activation memory of diffe…

bcf405d

…rentiable distributed operations

GLU fusion

ed1ca7d

precommit

9b0de5b

Merge branch 'main' into fix_tp_mem_cache

bbc259f

Wrong backward fixed

803b6da

Removed useless prints

59bfb6b

Minor fixes

2c69e9a

precommit

30439fd

Added tp_recompute_allgather option

1e02a9c

Changed recompute default

9cc81bb

Changed recompute default

956fbfd

Moved ColumnLinearNoAsync module for consistency

b9e9201

Merge branch 'fix-row-parallel' of github.com:C-TC/nanotron into asyn…

cd2ff64

…c_fix

Merge branch 'async_fix' into mem_fix_async

25acc0e

memory efficient async linear

7cc6653

precommit

cb0f260

Added no_recompute_allgather mode to async

6d85d03

3outeille reviewed Jul 22, 2024

View reviewed changes

AleHD added 4 commits July 23, 2024 09:07

Merge branch 'main' into fix_tp_mem_cache

49633df

Fixed List not found

2afd007

Merge branch 'main' into mem_fix_async

81e7a54

Fixed tp=1 case

7e758db

xrsrke self-requested a review July 29, 2024 16:10

AleHD marked this pull request as draft July 30, 2024 08:41

AleHD added 5 commits July 30, 2024 18:50

Merge branch 'main' into fix_tp_mem_cache

ce2a96b

Fixed column parallel

cd84d4f

Added tp_recompute_allgather test

d3db06a

Merge branch 'fix_tp_mem_cache' into mem_fix_async

6f82050

Added tp_recompute_allgather test

4c94b99

AleHD marked this pull request as ready for review July 30, 2024 17:43

xrsrke reviewed Aug 1, 2024

View reviewed changes

src/nanotron/config/config.py Show resolved Hide resolved

xrsrke approved these changes Aug 1, 2024

View reviewed changes

xrsrke added the pull request label Aug 2, 2024

AleHD and others added 3 commits August 2, 2024 15:40

Minor restyling

7daa186

Fixed names

31c3c5a

Merge pull request #1 from AleHD/fix_tp_mem_cache

0adb368

Minor restyling

3outeille merged commit 03d67f2 into huggingface:main Aug 5, 2024
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory optimization in async tp-linear #208

Memory optimization in async tp-linear #208

AleHD commented Jul 18, 2024

3outeille Jul 22, 2024

xrsrke left a comment

3outeille commented Aug 5, 2024

Memory optimization in async tp-linear #208

Memory optimization in async tp-linear #208

Conversation

AleHD commented Jul 18, 2024

3outeille Jul 22, 2024

Choose a reason for hiding this comment

xrsrke left a comment

Choose a reason for hiding this comment

3outeille commented Aug 5, 2024