Support GPU-to-CPU synchronization dependency with HolisticTraceAnalysis #57

JoongunPark · 2024-05-10T22:10:43Z

Summary

This PR relies on Refactor KinetoOperator and TraceLinker for code quality and test coverage #119. It can be merged after Refactor KinetoOperator and TraceLinker for code quality and test coverage #119 is merged.

This PR introduces dependencies from GPU operators to CPU operators using the critical path analysis in HolisticTraceAnalysis (HTA). In the simulation flow of Chakra, postprocessors like the trace linker and the converter are required. They are responsible for merging Chakra host traces with Chakra device traces and encoding dependencies. Currently, the dependencies encoded by the postprocessors are limited to CPU operators to GPU operators. However, there can be dependencies from GPU operators to CPU operators if a CPU operator has a dependency on a GPU operator. To identify such dependencies, this PR utilizes the critical path analysis of HTA. More specifically, this PR uses the synchronization dependency of HTA. A synchronization dependency occurs when a CPU operator has to wait for a dispatched GPU operator to be completed. Therefore, synchronization dependency is the best for identifying such dependencies.

HTA Repository: https://github.com/facebookresearch/HolisticTraceAnalysis
HTA Critical Path Analysis Documentation: https://hta.readthedocs.io/en/latest/source/features/lightweight_critical_path_analysis.html
HTA Critical Path Analysis Example: https://github.com/facebookresearch/HolisticTraceAnalysis/blob/main/examples/experimental/critical_path_analysis.ipynb

Please note that:

The command has been changed. We need to specify the rank ID with --rank for chakra_trace_link.
When collecting the trace, the Kineto trace profiler should collect nodes where the 'cat' field is 'cuda_sync'. Please follow the instructions here: ([profiler] add option for kineto synchronization events in the trace pytorch/pytorch#105187).

from torch.autograd.profiler import profile, _ExperimentalConfig
with profile(use_kineto=True, use_cuda=True,
   experimental_config=_ExperimentalConfig(enable_cuda_sync_events=True),
) as prof:
   workload()

Test Plan

Download and Install HTA.

git clone https://github.com/facebookresearch/HolisticTraceAnalysis.git
cd HolisticTraceAnalysis
git checkout d731cc2e2249976c97129d409a83bd53d93051f6
git submodule update --init
pip install -r requirements.txt
pip install -e .

Next, you need to collect traces by following the instructions here: pytorch/pytorch#105187.

After that, you can load sync dependencies and print them out with the following script:

import argparse
import logging
import os
from typing import Dict, List

from hta.analyzers.critical_path_analysis import CPEdgeType
from hta.trace_analysis import TraceAnalysis

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def load_sync_dependencies(
    rank: int, kineto_file: str, annotation: str = "ProfilerStep", instance_id: int = 0
) -> Dict[int, List[int]]:
    """
    Load synchronization dependencies using Holistic Trace Analysis (HTA).

    Args:
        rank (int): Rank for the input Kineto trace.
        kineto_file (str): Path to the Kineto trace file.
        annotation (str): Annotation to use for the analysis. Defaults to "ProfilerStep".
        instance_id (int): Instance ID for the analysis. Defaults to 0.

    Returns:
        Dict[int, List[int]]: A dictionary mapping end event's external ID to a list of start event's external IDs
            that have synchronization dependencies.
    """
    sync_dependencies = {}
    trace_analysis = TraceAnalysis(trace_dir=os.path.dirname(kineto_file))
    cp_graph, success = trace_analysis.critical_path_analysis(rank=rank, annotation=annotation, instance_id=instance_id)
    if not success:
        logger.error("Failed to load Critical Path Graph")
        return sync_dependencies

    raw_events = trace_analysis.t.get_raw_trace_for_one_rank(rank=rank)["traceEvents"]
    for edge in cp_graph.critical_path_edges_set:
        if edge.type in [CPEdgeType.SYNC_DEPENDENCY]:
            start_event_id, end_event_id = cp_graph.get_events_for_edge(edge)
            start_event, end_event = raw_events[start_event_id], raw_events[end_event_id]
            if "External id" in end_event["args"] and "External id" in start_event["args"]:
                start_event_external_id = start_event["args"]["External id"]
                end_event_external_id = end_event["args"]["External id"]
                start_event_name = start_event["name"]
                end_event_name = end_event["name"]
                if start_event_external_id != end_event_external_id:
                    print(
                        f"start_event_id {start_event_id}, end_event_id {end_event_id}, "
                        f"start_event_external_id {start_event_external_id}, end_event_external_id {end_event_external_id}, "
                        f"start_event_name '{start_event_name}', end_event_name '{end_event_name}'"
                    )
            else:
                logger.warning(
                    f"Synchronization dependency from event {start_event_id} to event {end_event_id} will "
                    "not be considered due to missing external IDs."
                )
    return sync_dependencies


def main() -> None:
    """
    Main function to parse arguments and load synchronization dependencies.
    """
    parser = argparse.ArgumentParser(description="Load and print critical paths from Kineto traces.")
    parser.add_argument("--input", type=str, help="Path to the Kineto trace file.")
    parser.add_argument("--rank", type=int, help="Rank for the input traces.")
    args = parser.parse_args()

    load_sync_dependencies(args.rank, args.input)


if __name__ == "__main__":
    main()

You can run it with the following command:

$ python sync_dep.py --input ~/Downloads/cuda-sync/kineto_0.json --rank 0 > /tmp/out

cuda-sync.zip

/tmp/out

start_event_id 24868, end_event_id 24874, start_event_external_id 94785, end_event_external_id 94792, start_event_name 'void multi_tensor_apply_kernel<TensorListMetadata<6>, DistAdamWithParamRemaindersFunctor<float>, float*, float, float, float, float, float, float, adamMode_t, float>(long, int volatile*, TensorListMetadata<6>, DistAdamWithParamRemaindersFunctor<float>, float*, float, float, float, float, float, float, adamMode_t, float)', end_event_name 'cudaDeviceSynchronize'
start_event_id 24536, end_event_id 24650, start_event_external_id 13847, end_event_external_id 91874, start_event_name 'ncclDevKernel_ReduceScatter_Sum_f32_RING_LL(ncclDevComm*, unsigned long, ncclWork*)', end_event_name 'cudaDeviceSynchronize'

Two synchronization dependencies are identified with the script. In this test, we focus on the dependency between 'ncclDevKernel_ReduceScatter_Sum_f32_RING_LL(ncclDevComm*, unsigned long, ncclWork*)' and 'cudaDeviceSynchronize'.

Let's confirm our observation with a trace visualizer. You can read Kineto traces with https://perfetto.dev/. By searching for ncclDevKernel_ReduceScatter_Sum_f32_RING_LL, you can find that it is a GPU kernel (category field) with an external ID of 13847. Around the operator but in the CPU row of the visualization, you can find cudaDeviceSynchronize where the external ID is 94792. It is a cuda_runtime operator. As the cuda_runtime operator is not considered a simulatable operator in the toolchains, the closest but later CPU operator, aten::empty, with the external ID of 16392, should rely on the GPU kernel.

Let's see if the synchronization dependency is properly encoded in trace_link. Make sure you install Chakra.

$ pip install .

Run chakra_trace_link.

chakra_trace_link \
  --pytorch-et-file /Users/theo/Downloads/et_0.json\
  --kineto-file /Users/theo/Downloads/kineto_0.json\
  --output-file ~/megatron_0.json\
  --rank 0

You can review ~/megatron_0.json and find that sync dependencies are encoded.

        {
            "id": 15899,
            "name": "ncclDevKernel_ReduceScatter_Sum_f32_RING_LL(ncclDevComm*, unsigned long, ncclWork*)",
            "ctrl_deps": 15898,
            "inputs": {
                "values": [
                    [
                        87,
                        49,
                        576716800,
                        52428800,
                        4,
                        "cuda:0"
                    ],
                    [
                        90,
                        49,
                        629145600,
                        52428800,
                        4,
                        "cuda:0"
                    ]
                ],
                "shapes": [
                    [
                        52428800
                    ],
                    [
                        52428800
                    ]
                ],
                "types": [
                    "Tensor(float)",
                    "Tensor(float)"
                ]
            },
            "outputs": {
                "values": [],
                "shapes": [],
                "types": []
            },
            "attrs": [
                {
                    "name": "rf_id",
                    "type": "uint64",
                    "value": 13651
                },
                {
                    "name": "fw_parent",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "seq_id",
                    "type": "int64",
                    "value": -1
                },
                {
                    "name": "scope",
                    "type": "uint64",
                    "value": 7
                },
                {
                    "name": "tid",
                    "type": "uint64",
                    "value": 4
                },
                {
                    "name": "fw_tid",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "op_schema",
                    "type": "string",
                    "value": ""
                }
            ],
            "inclusive_dur": 44160,
            "exclusive_dur": 44160,
            "ts": 1719249141376319,
            "inter_thread_dep": 15685,
            "cat": "kernel",
            "ph": "X",
            "stream": 64
        },
        {
            "id": 15982,
            "name": "aten::detach",
            "ctrl_deps": 29,
            "inputs": {
                "values": [
                    [
                        20832,
                        1363,
                        0,
                        1,
                        4,
                        "cuda:0"
                    ]
                ],
                "shapes": [
                    []
                ],
                "types": [
                    "Tensor(float)"
                ]
            },
            "outputs": {
                "values": [
                    [
                        20842,
                        1363,
                        0,
                        1,
                        4,
                        "cuda:0"
                    ]
                ],
                "shapes": [
                    []
                ],
                "types": [
                    "Tensor(float)"
                ]
            },
            "attrs": [
                {
                    "name": "rf_id",
                    "type": "uint64",
                    "value": 13722
                },
                {
                    "name": "fw_parent",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "seq_id",
                    "type": "int64",
                    "value": 19404
                },
                {
                    "name": "scope",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "tid",
                    "type": "uint64",
                    "value": 1
                },
                {
                    "name": "fw_tid",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "op_schema",
                    "type": "string",
                    "value": "aten::detach(Tensor(a) self) -> Tensor(a)"
                }
            ],
            "inclusive_dur": 17,
            "exclusive_dur": 11,
            "ts": 1719249141527040,
            "inter_thread_dep": 15901,
            "sync_dep": 15899
        },

Run chakra_converter

chakra_converter --input_filename ~/megatron_0.json\
    --output_filename megatron_0.chakra\
    --input_type PyTorch\
     --log_filename /tmp/rank_0

Here are traces that I used.
cuda-sync.zip
Resnet-50.zip
llama2.zip

github-actions · 2024-05-10T22:10:55Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

JoongunPark · 2024-05-30T18:24:35Z

Hi, how is the review going?

TaekyungHeo · 2024-05-30T18:28:10Z

Hi, @JoongunPark. We did not get a chance to review and test because we have an urgent task internally. Thank you for your patience.

srinivas212 · 2024-05-31T02:13:07Z

@JoongunPark - we may need 1-2 more weeks since we are setting up integration tests internally as we speak. We will try to expedite this asap. Thank you for your patience.

JoongunPark · 2024-07-01T15:09:17Z

I have tested with Taekyung's lastest enhancement. It works well on my environment (Python 3.10.13, Linux 5.15.0-105-generic)
Below is the log that I have obtained after converting traces into Chakra HDT.

Llama2 
Node ID 14063 now has an synchonization dependency on Node ID 13783
Node ID 13782 now has an synchonization dependency on Node ID 13779
Node ID 13784 now has an synchonization dependency on Node ID 13779
Node ID 13837 now has an synchonization dependency on Node ID 13779
Node ID 13852 now has an synchonization dependency on Node ID 13783
Node ID 13779 now has an synchonization dependency on Node ID 13776
Node ID 13781 now has an synchonization dependency on Node ID 13776
Node ID 13831 now has an synchonization dependency on Node ID 13776
Node ID 13849 now has an synchonization dependency on Node ID 13780

Resnet-50
Node ID 4864 now has an synchonization dependency on Node ID 4861
Node ID 4866 now has an synchonization dependency on Node ID 4861
Node ID 5270 now has an synchonization dependency on Node ID 4865

Also, as he mentioned, now the code builds sync dependency with the closest next CPU operator instead of cuda_runtime op.

Update et_feeder for compatibility with Chakra schema v0.0.4

srinivas212 · 2024-10-05T02:20:22Z

@JoongunPark can you please resolve the merge conflicts? We can merge this PR.

Co-authored-by: Joongun Park <[email protected]>

…onverter Co-authored-by: Joongun Park <[email protected]>

…akra_trace_link Co-authored-by: Joongun Park <[email protected]>

Co-authored-by: Joongun Park <[email protected]>

srinivas212

Merging based on @TaekyungHeo's feedback and review. Thanks for the PR @JoongunPark and thanks for the review @TaekyungHeo.

JoongunPark · 2024-10-08T00:28:24Z

Merging based on @TaekyungHeo's feedback and review. Thanks for the PR @JoongunPark and thanks for the review @TaekyungHeo.

My apologies for the delayed recognition of the merge conflicts. Thank you so much for reviewing and managing this PR, @srinivas212 and @TaekyungHeo!

JoongunPark requested a review from a team as a code owner May 10, 2024 22:10

JoongunPark requested a review from TaekyungHeo May 10, 2024 22:10

JoongunPark force-pushed the HolisticTraceAnalysis branch 2 times, most recently from 0076451 to 742abd4 Compare May 10, 2024 22:41

JoongunPark mentioned this pull request Jun 25, 2024

Gpu cpu dep #103

Closed

TaekyungHeo force-pushed the HolisticTraceAnalysis branch from 742abd4 to 1d2289b Compare June 25, 2024 22:05

TaekyungHeo changed the title ~~Support Synchronization Dependency with Holistic Trace Analysis~~ Support Synchronization Dependency with HolisticTraceAnalysis Jun 25, 2024

TaekyungHeo changed the title ~~Support Synchronization Dependency with HolisticTraceAnalysis~~ Support GPU-to-CPU Synchronization Dependency with HolisticTraceAnalysis Jun 25, 2024

TaekyungHeo changed the title ~~Support GPU-to-CPU Synchronization Dependency with HolisticTraceAnalysis~~ Support GPU-to-CPU synchronization dependency with HolisticTraceAnalysis Jun 25, 2024

TaekyungHeo removed their request for review June 25, 2024 22:21

TaekyungHeo force-pushed the HolisticTraceAnalysis branch from 1d2289b to 3e54936 Compare June 26, 2024 13:54

TaekyungHeo added the enhancement New feature or request label Jun 27, 2024

TaekyungHeo force-pushed the HolisticTraceAnalysis branch 2 times, most recently from 0eae1ab to 882e10a Compare July 1, 2024 12:54

TaekyungHeo force-pushed the HolisticTraceAnalysis branch 3 times, most recently from eba5712 to eff2dc6 Compare July 3, 2024 09:45

TaekyungHeo force-pushed the HolisticTraceAnalysis branch 3 times, most recently from 3441a2c to c5db738 Compare July 13, 2024 15:44

TaekyungHeo force-pushed the HolisticTraceAnalysis branch from c5db738 to cdfa867 Compare July 23, 2024 17:51

rvinaybharadwaj pushed a commit to rvinaybharadwaj/chakra that referenced this pull request Sep 23, 2024

Merge pull request mlcommons#57 from astra-sim/schema_update_0.004

b892cbe

Update et_feeder for compatibility with Chakra schema v0.0.4

TaekyungHeo force-pushed the HolisticTraceAnalysis branch from cdfa867 to 7306c14 Compare October 7, 2024 17:27

Leverage HTA to support synchronization dependency in trace linking

aacfcff

Co-authored-by: Joongun Park <[email protected]>

TaekyungHeo and others added 4 commits October 7, 2024 14:18

Add synchronization dependency from GPU to CPU operators in PyTorch c…

afe6713

…onverter Co-authored-by: Joongun Park <[email protected]>

Update user guide to include HTA installation and new argument for ch…

8f37462

…akra_trace_link Co-authored-by: Joongun Park <[email protected]>

Add HTA installation step in end_to_end_tests workflow

bece42f

Co-authored-by: Joongun Park <[email protected]>

Resolve kineto_file path bug with absolute path conversion

6bdf24e

TaekyungHeo force-pushed the HolisticTraceAnalysis branch from 7306c14 to 6bdf24e Compare October 7, 2024 18:18

srinivas212 approved these changes Oct 7, 2024

View reviewed changes

srinivas212 merged commit 4a4fd09 into main Oct 7, 2024
10 checks passed

github-actions bot locked and limited conversation to collaborators Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GPU-to-CPU synchronization dependency with HolisticTraceAnalysis #57

Support GPU-to-CPU synchronization dependency with HolisticTraceAnalysis #57

JoongunPark commented May 10, 2024 •

edited by TaekyungHeo

Loading

github-actions bot commented May 10, 2024 •

edited

Loading

JoongunPark commented May 30, 2024

TaekyungHeo commented May 30, 2024

srinivas212 commented May 31, 2024

JoongunPark commented Jul 1, 2024 •

edited

Loading

srinivas212 commented Oct 5, 2024

srinivas212 left a comment

JoongunPark commented Oct 8, 2024

Support GPU-to-CPU synchronization dependency with HolisticTraceAnalysis #57

Support GPU-to-CPU synchronization dependency with HolisticTraceAnalysis #57

Conversation

JoongunPark commented May 10, 2024 • edited by TaekyungHeo Loading

Summary

Test Plan

github-actions bot commented May 10, 2024 • edited Loading

JoongunPark commented May 30, 2024

TaekyungHeo commented May 30, 2024

srinivas212 commented May 31, 2024

JoongunPark commented Jul 1, 2024 • edited Loading

srinivas212 commented Oct 5, 2024

srinivas212 left a comment

Choose a reason for hiding this comment

JoongunPark commented Oct 8, 2024

JoongunPark commented May 10, 2024 •

edited by TaekyungHeo

Loading

github-actions bot commented May 10, 2024 •

edited

Loading

JoongunPark commented Jul 1, 2024 •

edited

Loading