Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilize ctrl_deps for operator dependencies in simulation #25

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

TaekyungHeo
Copy link
Contributor

@TaekyungHeo TaekyungHeo commented Feb 22, 2024

Summary

Previously, the data_deps field was utilized to encode operator dependencies in simulations. However, data_deps should actually be reserved for encoding data dependencies, not for simulating operator dependencies. Therefore, this commit updates the code to ensure that pytorch2chakra_converter.py employs ctrl_deps for this purpose.

Test Plan

$ cd ~/param
$ cd param/train/comms/pt
$ pip install .
$ cd ../../compute/python
$ pip install -r requirements.txt
$ python setup.py install
$ python tools/trace_link.py --pytorch-et-file ~/llama_pytorch_et/llama_et_0.json --kineto-file ~/llama_kineto/worker0_step_12.1697596714999.pt.trace.json --output-file ~/rank0.json &
$ python tools/trace_link.py --pytorch-et-file ~/llama_pytorch_et/llama_et_1.json --kineto-file ~/llama_kineto/worker1_step_12.1697596715001.pt.trace.json --output-file ~/rank1.json &
$ python tools/trace_link.py --pytorch-et-file ~/llama_pytorch_et/llama_et_2.json --kineto-file ~/llama_kineto/worker2_step_12.1697596714848.pt.trace.json --output-file ~/rank2.json &
$ python tools/trace_link.py --pytorch-et-file ~/llama_pytorch_et/llama_et_3.json --kineto-file ~/llama_kineto/worker3_step_12.1697596714880.pt.trace.json --output-file ~/rank3.json &
$ python tools/trace_link.py --pytorch-et-file ~/llama_pytorch_et/llama_et_4.json --kineto-file ~/llama_kineto/worker4_step_12.1697596714944.pt.trace.json --output-file ~/rank4.json &
$ python tools/trace_link.py --pytorch-et-file ~/llama_pytorch_et/llama_et_5.json --kineto-file ~/llama_kineto/worker5_step_12.1697596714871.pt.trace.json --output-file ~/rank5.json &
$ python tools/trace_link.py --pytorch-et-file ~/llama_pytorch_et/llama_et_6.json --kineto-file ~/llama_kineto/worker6_step_12.1697596714614.pt.trace.json --output-file ~/rank6.json &
$ python tools/trace_link.py --pytorch-et-file ~/llama_pytorch_et/llama_et_7.json --kineto-file ~/llama_kineto/worker7_step_12.1697596714853.pt.trace.json --output-file ~/rank7.json &

$ cd ~/charka
$ pip install .
$ python3 -m chakra.et_converter.et_converter --input_type PyTorch --input_filename ~/rank0.json --output_filename ~/rank.0.et --num_dims 1
$ python3 -m chakra.et_converter.et_converter --input_type PyTorch --input_filename ~/rank1.json --output_filename ~/rank.1.et --num_dims 1
$ python3 -m chakra.et_converter.et_converter --input_type PyTorch --input_filename ~/rank2.json --output_filename ~/rank.2.et --num_dims 1
$ python3 -m chakra.et_converter.et_converter --input_type PyTorch --input_filename ~/rank3.json --output_filename ~/rank.3.et --num_dims 1
$ python3 -m chakra.et_converter.et_converter --input_type PyTorch --input_filename ~/rank4.json --output_filename ~/rank.4.et --num_dims 1
$ python3 -m chakra.et_converter.et_converter --input_type PyTorch --input_filename ~/rank5.json --output_filename ~/rank.5.et --num_dims 1
$ python3 -m chakra.et_converter.et_converter --input_type PyTorch --input_filename ~/rank6.json --output_filename ~/rank.6.et --num_dims 1
$ python3 -m chakra.et_converter.et_converter --input_type PyTorch --input_filename ~/rank7.json --output_filename ~/rank.7.et --num_dims 1

$ cd ~/astra-sim
$ ./build/astra_analytical/build.sh
$ ./build/astra_analytical/build/bin/AstraSim_Analytical_Congestion_Unaware --workload-configuration=/Users/
theo/rank --system-configuration=./inputs/system/Switch.json \
  --network-configuration=./inputs/network/analytical/Switch.yml \                                                                  
  --remote-memory-configuration=./inputs/remote_memory/analytical/no_memory_expansion.json
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
sys[2] finished, 7213509000 cycles                                
sys[6] finished, 7226613000 cycles                                
sys[0] finished, 7269182000 cycles                                
sys[4] finished, 7276689000 cycles                                
sys[1] finished, 7340042000 cycles                                
sys[7] finished, 7367494000 cycles                                
sys[5] finished, 7374663000 cycles                                
sys[3] finished, 7375565000 cycles

Copy link

github-actions bot commented Feb 22, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Previously, the data_deps field was utilized to encode operator
dependencies in simulations. However, data_deps should actually
be reserved for encoding data dependencies, not for simulating
operator dependencies. Therefore, this commit updates the code
to ensure that pytorch2chakra_converter.py employs ctrl_deps for
this purpose.
@TaekyungHeo TaekyungHeo marked this pull request as ready for review February 23, 2024 01:30
@TaekyungHeo TaekyungHeo requested a review from a team as a code owner February 23, 2024 01:30
@github-actions github-actions bot locked and limited conversation to collaborators May 10, 2024
@JoongunPark JoongunPark reopened this May 10, 2024
@TaekyungHeo TaekyungHeo marked this pull request as draft August 2, 2024 10:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants