-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plugin support #263
base: main
Are you sure you want to change the base?
Plugin support #263
Conversation
8e3e909
to
17f8388
Compare
17f8388
to
e5841d2
Compare
ec87e78
to
07fc86f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For existing prologue we use real NCCL run. In your examples it seems that we are switching to some predefined commands.
- How are we going to generate it?
- Will that cover our needs? cc @srivatsankrishnan
I do have some code related notes, but let's leave it for later discussion.
Yes, it is one of the main design choices that we need to make. |
Can we rely on existing mechanisms? Each plugin will be defined as a regular Test TOML, meaning we can generate a CLI for it for a particular system. This is what we do now and it seems to cover all our needs for this feature. |
7594c19
to
852fee8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried verify-configs
with these changes? Will prologue/epilogue names be checked for existence?
tests/ref_data/gpt-no-plugin.sbatch
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we name file gpt-no-prologue.sbatch
? We don't use epilogue here, this could be a nice additional test though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please find the updated names.
@@ -136,8 +139,14 @@ def _parse_data(self, data: Dict[str, Any]) -> TestScenario: | |||
total_weight = sum(tr.weight for tr in ts_model.tests) | |||
normalized_weight = 0 if total_weight == 0 else 100 / total_weight | |||
|
|||
prologue, epilogue = None, None | |||
if ts_model.prologue: | |||
prologue = self.plugin_mapping.get(ts_model.prologue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we handle situation when specified name doesn't exist in the mapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose not to raise an exception when plugins are missing. The plugins will not be executed, and I have added warning messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about meeting expectations, ours and our users'. Warning will likely go unnoticed, so we should answer to ourselves:
- If we raise an error here, it will make the plugin feature more visible to users, while we prefer it to "just work".
- However, plugins are meant to save time on costly runs. If we let it proceed without plugins when they’re enabled, it won’t fulfil its intended purpose.
@@ -49,14 +49,21 @@ def __init__(self, system_config_path: Path) -> None: | |||
self.system_config_path = system_config_path | |||
|
|||
def parse( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this should be inlined into parse
function:
... # same as it is
plugin_test_path = Path("conf/plugin/test")
try:
plugin_tests = (
self.parse_tests(list(plugin_test_path.glob("*.toml")), system) if plugin_test_path.exists() else []
)
except TestConfigParsingError:
exit(1) # exit right away to keep error message readable for users
filtered_tests = tests
test_scenario: Optional[TestScenario] = None
if test_scenario_path:
test_mapping = {t.name: t for t in tests}
plugin_test_scenario_mapping = ... # load mapping here
try:
test_scenario = self.parse_test_scenario(test_scenario_path, test_mapping, plugin_test_scenario_mapping)
except TestScenarioParsingError:
exit(1) # exit right away to keep error message readable for users
scenario_tests = set(tr.test.name for tr in test_scenario.test_runs)
filtered_tests = [t for t in tests if t.name in scenario_tests]
return system, filtered_tests, test_scenario
Let's also define two constants for paths:
PLUGINS_ROOT = Path("conf/plugin")
PLUGINS_TESTS_ROOT = PLUGINS_ROOT / "tests"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please find the updated code.
@amaslenn , I ran verify-configs and got this warning
|
|
||
_, tests, _ = parser.parse(tests_dir, Path()) | ||
|
||
filtered_test_names = {t.name for t in tests} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify to assert tests == {"test-1"}
?
_, filtered_tests, _ = parser.parse(tests_dir, test_scenario_path) | ||
|
||
filtered_test_names = {t.name for t in filtered_tests} | ||
assert len(filtered_tests) == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] IMO it is OK to simplify to have only one regular test and one plugin.
|
||
test_mapping = {t.name: t for t in tests} | ||
plugin_test_scenario_mapping = {} | ||
if PLUGIN_ROOT.exists() and list(PLUGIN_ROOT.glob("*.toml")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's log.debug the existence of these directories and files, could be useful in future.
@@ -136,8 +139,14 @@ def _parse_data(self, data: Dict[str, Any]) -> TestScenario: | |||
total_weight = sum(tr.weight for tr in ts_model.tests) | |||
normalized_weight = 0 if total_weight == 0 else 100 / total_weight | |||
|
|||
prologue, epilogue = None, None | |||
if ts_model.prologue: | |||
prologue = self.plugin_mapping.get(ts_model.prologue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about meeting expectations, ours and our users'. Warning will likely go unnoticed, so we should answer to ourselves:
- If we raise an error here, it will make the plugin feature more visible to users, while we prefer it to "just work".
- However, plugins are meant to save time on costly runs. If we let it proceed without plugins when they’re enabled, it won’t fulfil its intended purpose.
Summary
This PR introduces plugins to CloudAI. Plugins are tests that run either before or after each test in a test scenario. They are defined globally within a test scenario and are automatically executed for each test. There are two types of plugins: prologues and epilogues. Prologues run before the tests, while epilogues are executed after the tests. Multiple prologues and epilogues can be specified in each scenario.
An example of how plugins are defined within a test scenario:
You can see the prologue and epilogue fields. These are used to look up the corresponding plugin file. A plugin file is a separate test scenario file as shown below:
If any of the tests in the prologue fail, the main test or the epilogue tests will not run. In other words, the main test and epilogue run conditionally when the prologue is successful. The tests in plugins have time limits, just as tests in the main scenario do. Output files should be stored in the output directory, in a subdirectory called "prologue" or "epilogue," following a proper directory hierarchy. Plugins are not supported for NeMo 1.0 (NeMo launcher).
Note
Test Plan
2.1 Success
2.2 Failure