`pymatviz` import cost way too high #209

janosh · 2024-09-25T12:48:50Z

takes almost 2 seconds to import the latest version of pymatviz. used to be a lot less. i suspect #194 may have caused this problem without me noticing at the time. current import waterfall can be generated with:

time python -X importtime -c 'import pymatviz'
...
fraction
import time:       157 |        157 |       pymatgen.analysis.diffraction.core
import time:       624 |        966 |     pymatgen.analysis.diffraction.xrd
import time:       276 |       1241 |   pymatviz.xrd
import time:       175 |        175 |   pymatviz.histogram
import time:       992 |    2308598 | pymatviz
python -X importtime -c 'import pymatviz'  1.88s user 0.33s system 82% cpu 2.705 total

should figure out which modules take the longest to load, and decide if they really need to be imported from pymatviz/__init__.py or if they could be lazy-loaded instead (i.e. only loaded if the user actually accesses them) and put some regression tests in place to ensure no future PRs blow up the import time again. might also make sense to see how large code bases like pandas keep their import time low

The text was updated successfully, but these errors were encountered:

janosh · 2024-09-25T13:15:39Z

here's a rough first draft of what the import time regression test might look like. doesn't work yet and the values in BASELINE_IMPORT_TIMES are just guesses, need to be changed to correct values once import times are shorter again

import sys
import time
from importlib.util import find_spec, module_from_spec
from types import ModuleType

import pytest


def uncached_import(module_name: str) -> ModuleType:
    # Find the module spec
    spec = find_spec(module_name)
    if spec is None:
        raise ImportError(f"No module named '{module_name}'")

    # Create a new module based on the spec
    module = module_from_spec(spec)

    # Add the module to sys.modules temporarily
    sys.modules[module_name] = module

    try:
        # Load the module
        spec.loader.exec_module(module)
    finally:
        # Remove the module from sys.modules
        sys.modules.pop(module_name, None)

    return module


def measure_import_time(module_name: str) -> float:
    # Remove the module and its submodules from sys.modules if they're already there
    to_remove = [
        name
        for name in sys.modules
        if name == module_name or name.startswith(f"{module_name}.")
    ]
    for name in to_remove:
        sys.modules.pop(name, None)

    # Measure import time
    start_time = time.perf_counter()
    uncached_import(module_name)
    end_time = time.perf_counter()

    return end_time - start_time


def measure_import_times(modules: dict[str, float]) -> dict[str, float]:
    return {module: measure_import_time(module) for module in modules}


# Hard-coded baseline import times (replace these with actual measured values)
BASELINE_IMPORT_TIMES = {
    "pymatviz": 0.8000,
    "pymatviz.ptable": 1.1,
    "pymatviz.io": 0.0300,
    "pymatviz.scatter": 0.0400,
    "pymatviz.phonons": 0.0200,
}


def test_import_time_regression() -> None:
    import_times = measure_import_times(BASELINE_IMPORT_TIMES)

    for module, baseline_time in BASELINE_IMPORT_TIMES.items():
        measured_time = import_times[module]

        # rel=0.2 allows up to 20% increase in import time (CI is slower than local)
        assert measured_time == pytest.approx(
            baseline_time, rel=0.2
        ), f"{module} import took {measured_time:.4f}s, expected {baseline_time:.4f}s"

    # Additional assertion to check if any module takes an unusually long time to import
    max_import_time = max(import_times.values())
    assert (
        max_import_time <= 2.0
    ), f"A module is taking an unusually long time to import: {max_import_time:.4f}s"

janosh · 2024-10-18T20:20:31Z

Running

pip install tuna
python -X importtime -c "import pymatviz" 2> pmv.log
tuna pmv.log

reveals that one major factor for the high import cost of pymatviz comes from monty:

pmv.log.zip

which imports

try:
    import torch
except ImportError:
    torch = None

at the module level rather than inside MontyEncoder.default where its only use is to check if isinstance(o, torch.Tensor). that import should be moved inside the actual function or perhaps replaced by a heuristic like if o.__class__.__qualname__ == 'Tensor' and o.__class__.__module__ == 'torch'

DanielYang59 · 2024-10-20T09:56:14Z

Thanks for sharing that tool (again), I would take over from here :)

janosh added ux User experience pkg Package labels Sep 25, 2024

DanielYang59 self-assigned this Sep 25, 2024

DanielYang59 mentioned this issue Oct 20, 2024

Lazily import torch/pydantic in json module, speedup from monty.json import by 10x materialsvirtuallab/monty#713

Merged

DanielYang59 linked a pull request Oct 20, 2024 that will close this issue

Speedup import and add regression check for import time #238

Open

2 tasks

janosh mentioned this issue Oct 30, 2024

bump monty to use the monty.json import speedup patch, add import time regression test, lazy load some rarely used but costly modules materialsproject/pymatgen#4128

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pymatviz` import cost way too high #209

`pymatviz` import cost way too high #209

janosh commented Sep 25, 2024

janosh commented Sep 25, 2024 •

edited

Loading

janosh commented Oct 18, 2024

DanielYang59 commented Oct 20, 2024

pymatviz import cost way too high #209

pymatviz import cost way too high #209

Comments

janosh commented Sep 25, 2024

janosh commented Sep 25, 2024 • edited Loading

janosh commented Oct 18, 2024

DanielYang59 commented Oct 20, 2024

`pymatviz` import cost way too high #209

`pymatviz` import cost way too high #209

janosh commented Sep 25, 2024 •

edited

Loading