Plugin support #263

TaekyungHeo · 2024-10-14T20:47:20Z

Summary

This PR introduces plugins to CloudAI. Plugins are tests that run either before or after each test in a test scenario. They are defined globally within a test scenario and are automatically executed for each test. There are two types of plugins: prologues and epilogues. Prologues run before the tests, while epilogues are executed after the tests. Multiple prologues and epilogues can be specified in each scenario.

An example of how plugins are defined within a test scenario:

name = "nccl-test"

prologue = "nccl_test_prologue"
epilogue = "nccl_test_epilogue"

[[Tests]]
id = "Tests.1"
test_name = "nccl_test_all_reduce"
num_nodes = "2"
time_limit = "00:20:00"

[[Tests]]
id = "Tests.2"
test_name = "nccl_test_all_gather"
num_nodes = "2"
time_limit = "00:20:00"
  [[Tests.dependencies]]
  type = "start_post_comp"
  id = "Tests.1"

You can see the prologue and epilogue fields. These are used to look up the corresponding plugin file. A plugin file is a separate test scenario file as shown below:

name = "nccl_test_prologue"

[[Tests]]
id = "Tests.1"
test_name = "nccl_test_all_reduce"
time_limit = "00:20:00"

If any of the tests in the prologue fail, the main test or the epilogue tests will not run. In other words, the main test and epilogue run conditionally when the prologue is successful. The tests in plugins have time limits, just as tests in the main scenario do. Output files should be stored in the output directory, in a subdirectory called "prologue" or "epilogue," following a proper directory hierarchy. Plugins are not supported for NeMo 1.0 (NeMo launcher).

Note

Previous Design Document: https://docs.google.com/document/d/1fqD_hBXmj0ikXX91iAVGvCLZk2tZdUKqK08TZdT7E0g/edit#heading=h.5cwfklh0ybde
TODO: Document how it works and share it.
Term - plugin / run wrapper / pre-flight, post-flight
Idea
- We may need to generate reports from plugins.
- We may need to consider the performance impact of plugins.
- Dependencies are not implemented for now.

Test Plan

CI passes
Manual run
2.1 Success

$ cloudai run --system-config ~/cloudaix-main/conf/common/system/israel_1.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nccl_tes
t.toml
/.autodirect/mswg2/E2E/theo/venv/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.19) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported ver
sion!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "                                                                                                                
[INFO] System Name: Israel-1                                                                            
[INFO] Scheduler: slurm                                                                                 
[INFO] Test Scenario Name: nccl-test                                                                    
[INFO] Checking if test templates are installed.                                                        
[INFO] Test Scenario: nccl-test                                                                         

Section Name: Tests.1                                                                                   
  Test Name: nccl_test_all_reduce                                                                       
  Description: all_reduce                                                                               
  No dependencies                                                                                       
[INFO] Initializing Runner [RUN] mode                                                                   
[INFO] Creating SlurmRunner                                                                             
[INFO] Starting test scenario execution.                                                                
[INFO] Starting test: Tests.1                                                                           
[INFO] Running test: Tests.1                                                                            
[INFO] Executing command for test Tests.1: sbatch /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-05-46/Tests.1/0/cloudai_sbatch_script.sh
[INFO] Job completed: Tests.1
[INFO] All test scenario results stored at: /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-05-46
[INFO] All test scenario execution attempts are complete. Please review the 'debug.log' file to confirm successful completion or to identify any issues.

$ cd /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-05-46/Tests.1/0
$ ls
cloudai_sbatch_script.sh  epilogue  prologue  stderr.txt  stdout.txt

$ ls prologue/nccl_test_all_reduce/
stderr.txt  stdout.txt

$ ls epilogue/nccl_test_all_gather/
stderr.txt  stdout.txt

2.2 Failure

$ cloudai run --system-config ~/cloudaix-main/conf/common/system/israel_1.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nccl_test.toml
/.autodirect/mswg2/E2E/theo/venv/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.19) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
[INFO] System Name: Israel-1
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nccl-test
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nccl-test

Section Name: Tests.1
  Test Name: nccl_test_all_reduce
  Description: all_reduce
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Executing command for test Tests.1: sbatch /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-16-25/Tests.1/0/cloudai_sbatch_script.sh
[ERROR] Job 383928 for test Tests.1 failed: Missing success indicators in /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-16-25/Tests.1/0/stdout.txt: '# Out of bounds values', '# Avg bus bandwidth'. These keywords are expected to be present in stdout.txt, usually towards the end of the file. Please review the NCCL test output and errors in the file. Ensure the NCCL test ran to completion. You can run the generated sbatch script manually and check if /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-16-25/Tests.1/0/stdout.txt is created and contains the expected keywords. If the issue persists, contact the system administrator.
[INFO] Terminating all jobs...
[INFO] All jobs have been killed.

amaslenn

For existing prologue we use real NCCL run. In your examples it seems that we are switching to some predefined commands.

How are we going to generate it?
Will that cover our needs? cc @srivatsankrishnan

I do have some code related notes, but let's leave it for later discussion.

TaekyungHeo · 2024-10-16T19:43:18Z

@amaslenn

How are we going to generate it?

Yes, it is one of the main design choices that we need to make.

amaslenn · 2024-10-17T09:54:24Z

Yes, it is one of the main design choices that we need to make.

Can we rely on existing mechanisms? Each plugin will be defined as a regular Test TOML, meaning we can generate a CLI for it for a particular system. This is what we do now and it seems to cover all our needs for this feature.

…Path

amaslenn

Have you tried verify-configs with these changes? Will prologue/epilogue names be checked for existence?

tests/ref_data/gpt-plugin.sbatch

amaslenn · 2024-10-31T12:24:05Z

tests/ref_data/gpt-no-plugin.sbatch

Should we name file gpt-no-prologue.sbatch? We don't use epilogue here, this could be a nice additional test though.

Please find the updated names.

amaslenn · 2024-10-31T12:32:48Z

src/cloudai/_core/test_scenario_parser.py

@@ -136,8 +139,14 @@ def _parse_data(self, data: Dict[str, Any]) -> TestScenario:
        total_weight = sum(tr.weight for tr in ts_model.tests)
        normalized_weight = 0 if total_weight == 0 else 100 / total_weight

+        prologue, epilogue = None, None
+        if ts_model.prologue:
+            prologue = self.plugin_mapping.get(ts_model.prologue)


Should we handle situation when specified name doesn't exist in the mapping?

I chose not to raise an exception when plugins are missing. The plugins will not be executed, and I have added warning messages.

This is about meeting expectations, ours and our users'. Warning will likely go unnoticed, so we should answer to ourselves:

If we raise an error here, it will make the plugin feature more visible to users, while we prefer it to "just work".

However, plugins are meant to save time on costly runs. If we let it proceed without plugins when they’re enabled, it won’t fulfil its intended purpose.

src/cloudai/_core/test_scenario_parser.py

amaslenn · 2024-10-31T12:46:21Z

src/cloudai/parser.py

@@ -49,14 +49,21 @@ def __init__(self, system_config_path: Path) -> None:
        self.system_config_path = system_config_path

    def parse(


IMO this should be inlined into parse function:

... # same as it is plugin_test_path = Path("conf/plugin/test") try: plugin_tests = ( self.parse_tests(list(plugin_test_path.glob("*.toml")), system) if plugin_test_path.exists() else [] ) except TestConfigParsingError: exit(1) # exit right away to keep error message readable for users filtered_tests = tests test_scenario: Optional[TestScenario] = None if test_scenario_path: test_mapping = {t.name: t for t in tests} plugin_test_scenario_mapping = ... # load mapping here try: test_scenario = self.parse_test_scenario(test_scenario_path, test_mapping, plugin_test_scenario_mapping) except TestScenarioParsingError: exit(1) # exit right away to keep error message readable for users scenario_tests = set(tr.test.name for tr in test_scenario.test_runs) filtered_tests = [t for t in tests if t.name in scenario_tests] return system, filtered_tests, test_scenario

Let's also define two constants for paths:

PLUGINS_ROOT = Path("conf/plugin") PLUGINS_TESTS_ROOT = PLUGINS_ROOT / "tests"

Please find the updated code.

TaekyungHeo · 2024-10-31T16:07:08Z

@amaslenn , I ran verify-configs and got this warning

$ cloudai verify-configs conf
[WARNING] Test configuration directory not provided, using all found test TOMLs in the specified directory.
[INFO] Checked systems: 3, all passed
[INFO] Checked tests: 40, all passed
[WARNING] System configuration not provided, mocking it.
[WARNING] Prologue 'nccl_test_prologue' not found in plugin mapping. Ensure that a proper plugin directory is set under the working directory.
[WARNING] Epilogue 'nccl_test_epilogue' not found in plugin mapping. Ensure that a proper plugin directory is set under the working directory.
[INFO] Checked scenarios: 9, all passed
[INFO] Checked 52 configuration files, all passed

amaslenn · 2024-11-01T08:28:46Z

tests/test_parser.py

+
+        _, tests, _ = parser.parse(tests_dir, Path())
+
+        filtered_test_names = {t.name for t in tests}


Can we simplify to assert tests == {"test-1"}?

amaslenn · 2024-11-01T08:29:48Z

tests/test_parser.py

+        _, filtered_tests, _ = parser.parse(tests_dir, test_scenario_path)
+
+        filtered_test_names = {t.name for t in filtered_tests}
+        assert len(filtered_tests) == 2


[minor] IMO it is OK to simplify to have only one regular test and one plugin.

amaslenn · 2024-11-01T08:31:42Z

src/cloudai/parser.py

+
+        test_mapping = {t.name: t for t in tests}
+        plugin_test_scenario_mapping = {}
+        if PLUGIN_ROOT.exists() and list(PLUGIN_ROOT.glob("*.toml")):


Let's log.debug the existence of these directories and files, could be useful in future.

amaslenn · 2024-11-01T08:38:51Z

src/cloudai/_core/test_scenario_parser.py

@@ -136,8 +139,14 @@ def _parse_data(self, data: Dict[str, Any]) -> TestScenario:
        total_weight = sum(tr.weight for tr in ts_model.tests)
        normalized_weight = 0 if total_weight == 0 else 100 / total_weight

+        prologue, epilogue = None, None
+        if ts_model.prologue:
+            prologue = self.plugin_mapping.get(ts_model.prologue)


This is about meeting expectations, ours and our users'. Warning will likely go unnoticed, so we should answer to ourselves:

If we raise an error here, it will make the plugin feature more visible to users, while we prefer it to "just work".

However, plugins are meant to save time on costly runs. If we let it proceed without plugins when they’re enabled, it won’t fulfil its intended purpose.

TaekyungHeo force-pushed the plugin-jan branch from 8e3e909 to 17f8388 Compare October 14, 2024 20:52

TaekyungHeo added feature Jan25 Jan'25 release feature labels Oct 15, 2024

TaekyungHeo force-pushed the plugin-jan branch from 17f8388 to e5841d2 Compare October 15, 2024 16:08

TaekyungHeo changed the title ~~Plugin Draft~~ Plugin support Oct 15, 2024

TaekyungHeo force-pushed the plugin-jan branch 3 times, most recently from ec87e78 to 07fc86f Compare October 15, 2024 17:26

TaekyungHeo requested review from amaslenn, artemry-nv, srivatsankrishnan and srinivas212 October 15, 2024 17:49

amaslenn reviewed Oct 16, 2024

View reviewed changes

TaekyungHeo force-pushed the plugin-jan branch 15 times, most recently from 7594c19 to 852fee8 Compare October 24, 2024 19:54

TaekyungHeo added 5 commits October 29, 2024 11:15

Remove plugin option from CLI

9f83cd5

Make plugin directory self-contained

f656eee

Update Parser to support self-contained plugin directory

5af7113

Refactor plugin path handling in parse to use a single plugin_path param

b22c2f2

Remove test_scenario directory from conf/common/plugin/

c88fe2e

TaekyungHeo closed this Oct 29, 2024

TaekyungHeo added 6 commits October 29, 2024 14:54

Merge branch 'main'

d66e0fe

Use Pydantic model to load prologue and epilogue

c814ccb

Recover acceptance tests with plugin

a6d3efc

Clean up unit tests

46cabe9

Refactor parser to remove explicit plugin_path argument, use default …

764e181

…Path

Refactor gen_exec_command to simplify indentation logic for readability

d9e8c1f

TaekyungHeo reopened this Oct 29, 2024

TaekyungHeo added 9 commits October 29, 2024 15:29

Make prologue and epilogue fields optional

897a7da

Set prologue and epilogue to None by default

d44023b

Recover comments

12022de

Remove unused tmp_path from unit tests

3cf27df

Merge branch 'main'

b37127d

Do not allow empty test runs in plugins

9244bd6

Simplify prologue unit tests

00f34f2

Move plugin directory to conf

7de1185

Reflect Andrei's comments

c2b8d83

amaslenn reviewed Oct 31, 2024

View reviewed changes

TaekyungHeo added 5 commits October 31, 2024 11:02

Reflect Andrei's comments

e1534d1

Print out warning when plugins are missing

42080e2

Update acceptance test sbatch script names

3dbb4d3

Reflect Andrei's comments

a9f5c97

Make vulture happy

d3c7cfd

amaslenn reviewed Nov 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plugin support #263

Plugin support #263

TaekyungHeo commented Oct 14, 2024 •

edited

Loading

amaslenn left a comment

TaekyungHeo commented Oct 16, 2024

amaslenn commented Oct 17, 2024

amaslenn left a comment

amaslenn Oct 31, 2024

TaekyungHeo Oct 31, 2024

amaslenn Oct 31, 2024

TaekyungHeo Oct 31, 2024

amaslenn Nov 1, 2024

amaslenn Oct 31, 2024

TaekyungHeo Oct 31, 2024

TaekyungHeo commented Oct 31, 2024

amaslenn Nov 1, 2024

amaslenn Nov 1, 2024

amaslenn Nov 1, 2024

amaslenn Nov 1, 2024

		@@ -49,14 +49,21 @@ def __init__(self, system_config_path: Path) -> None:
		self.system_config_path = system_config_path

		def parse(


		_, tests, _ = parser.parse(tests_dir, Path())

		filtered_test_names = {t.name for t in tests}

Plugin support #263

Are you sure you want to change the base?

Plugin support #263

Conversation

TaekyungHeo commented Oct 14, 2024 • edited Loading

Summary

Note

Test Plan

amaslenn left a comment

Choose a reason for hiding this comment

TaekyungHeo commented Oct 16, 2024

amaslenn commented Oct 17, 2024

amaslenn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TaekyungHeo commented Oct 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TaekyungHeo commented Oct 14, 2024 •

edited

Loading