Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL:Broadcast collectives are missing from the converted trace but present in the trace_link #161

Open
alexseceks opened this issue Oct 16, 2024 · 3 comments

Comments

@alexseceks
Copy link

Describe the Bug

After running a ResNet50 or TinyLlama2 workload on 4 ranks I see that in the Kineto trace at least one nccl:broadcast collective is observed. In the trace_link file the same collective is observed, but in the converted trace the collective is no longer present. Is this a normal behavior or is it an issue on the Chakra Converter side?

I looked in the converter implementation, but I did not observe any pointers that this should be done - dismiss broadcast collectives. Is there something I missed?

Steps to Reproduce

Using the Chakra version from 6 Sept, after the merge of commit #140.

Expected Behavior

See the nccl:broadcast collective in the converted trace.

Screenshots

This is the trace_link file, the broadcast collective is present.
Screenshot 2024-10-16 at 14 15 15
This is the converted trace, in json format, no broadcast collective can be found - search result is at the bottom of the picture.
Screenshot 2024-10-16 at 14 17 39

@alexseceks
Copy link
Author

Tested the same steps with the latest version of Chakra, installed from repository, 16 Oct, the behavior is the same.

@alexseceks
Copy link
Author

While looking into this issue I observed that the nccl:broadcast operation is a CPU operation and therefore is does not pass this check from pytorch_converter.py:

    def get_protobuf_node_type_from_json_node(
        self, json_node_map: Dict[int, PyTorchNode], json_node: PyTorchNode
    ) -> int:
        """
        Determine the Protobuf node type from a Chakra node.

        Args:
            json_node_map (Dict[int, PyTorchNode]): Dictionary of JSON nodes.
            json_node (PyTorchNode): The JSON node to determine the type of.

        Returns:
            int: The corresponding Chakra node type.
        """
        if json_node.is_gpu_op():
            if "ncclDevKernel_SendRecv" in json_node.name:
                parent_node = json_node_map[json_node.parent]
                keyword = (
                    json_node_map[parent_node.parent].name
                    if parent_node.name == "record_param_comms"
                    else parent_node.name
                )
                if "send" in keyword:
                    return COMM_SEND_NODE
                if "recv" in keyword:
                    return COMM_RECV_NODE
            if "ncclKernel" in json_node.name or "ncclDevKernel" in json_node.name:
                return COMM_COLL_NODE
        return COMP_NODE

What is the reason why this collective is not included in the converted trace?
Are there any plans in the future to include broadcast operation in the trace?
Thanks!

@fh-TurbaAI
Copy link

I'm seing similar issues with nccl:gather which seems to clear the if statement alex showcased above properly, but then fails to match against "known" collective operation strings in get_collective_comm_type()

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants