Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pymatviz import cost way too high #209

Open
janosh opened this issue Sep 25, 2024 · 3 comments · May be fixed by #238
Open

pymatviz import cost way too high #209

janosh opened this issue Sep 25, 2024 · 3 comments · May be fixed by #238
Assignees
Labels
pkg Package ux User experience

Comments

@janosh
Copy link
Owner

janosh commented Sep 25, 2024

takes almost 2 seconds to import the latest version of pymatviz. used to be a lot less. i suspect #194 may have caused this problem without me noticing at the time. current import waterfall can be generated with:

time python -X importtime -c 'import pymatviz'
...
fraction
import time:       157 |        157 |       pymatgen.analysis.diffraction.core
import time:       624 |        966 |     pymatgen.analysis.diffraction.xrd
import time:       276 |       1241 |   pymatviz.xrd
import time:       175 |        175 |   pymatviz.histogram
import time:       992 |    2308598 | pymatviz
python -X importtime -c 'import pymatviz'  1.88s user 0.33s system 82% cpu 2.705 total

should figure out which modules take the longest to load, and decide if they really need to be imported from pymatviz/__init__.py or if they could be lazy-loaded instead (i.e. only loaded if the user actually accesses them) and put some regression tests in place to ensure no future PRs blow up the import time again. might also make sense to see how large code bases like pandas keep their import time low

@janosh janosh added ux User experience pkg Package labels Sep 25, 2024
@DanielYang59 DanielYang59 self-assigned this Sep 25, 2024
@janosh
Copy link
Owner Author

janosh commented Sep 25, 2024

here's a rough first draft of what the import time regression test might look like. doesn't work yet and the values in BASELINE_IMPORT_TIMES are just guesses, need to be changed to correct values once import times are shorter again

import sys
import time
from importlib.util import find_spec, module_from_spec
from types import ModuleType

import pytest


def uncached_import(module_name: str) -> ModuleType:
    # Find the module spec
    spec = find_spec(module_name)
    if spec is None:
        raise ImportError(f"No module named '{module_name}'")

    # Create a new module based on the spec
    module = module_from_spec(spec)

    # Add the module to sys.modules temporarily
    sys.modules[module_name] = module

    try:
        # Load the module
        spec.loader.exec_module(module)
    finally:
        # Remove the module from sys.modules
        sys.modules.pop(module_name, None)

    return module


def measure_import_time(module_name: str) -> float:
    # Remove the module and its submodules from sys.modules if they're already there
    to_remove = [
        name
        for name in sys.modules
        if name == module_name or name.startswith(f"{module_name}.")
    ]
    for name in to_remove:
        sys.modules.pop(name, None)

    # Measure import time
    start_time = time.perf_counter()
    uncached_import(module_name)
    end_time = time.perf_counter()

    return end_time - start_time


def measure_import_times(modules: dict[str, float]) -> dict[str, float]:
    return {module: measure_import_time(module) for module in modules}


# Hard-coded baseline import times (replace these with actual measured values)
BASELINE_IMPORT_TIMES = {
    "pymatviz": 0.8000,
    "pymatviz.ptable": 1.1,
    "pymatviz.io": 0.0300,
    "pymatviz.scatter": 0.0400,
    "pymatviz.phonons": 0.0200,
}


def test_import_time_regression() -> None:
    import_times = measure_import_times(BASELINE_IMPORT_TIMES)

    for module, baseline_time in BASELINE_IMPORT_TIMES.items():
        measured_time = import_times[module]

        # rel=0.2 allows up to 20% increase in import time (CI is slower than local)
        assert measured_time == pytest.approx(
            baseline_time, rel=0.2
        ), f"{module} import took {measured_time:.4f}s, expected {baseline_time:.4f}s"

    # Additional assertion to check if any module takes an unusually long time to import
    max_import_time = max(import_times.values())
    assert (
        max_import_time <= 2.0
    ), f"A module is taking an unusually long time to import: {max_import_time:.4f}s"

@janosh
Copy link
Owner Author

janosh commented Oct 18, 2024

Running

pip install tuna
python -X importtime -c "import pymatviz" 2> pmv.log
tuna pmv.log

reveals that one major factor for the high import cost of pymatviz comes from monty:

pmv.log.zip

which imports

try:
    import torch
except ImportError:
    torch = None

at the module level rather than inside MontyEncoder.default where its only use is to check if isinstance(o, torch.Tensor). that import should be moved inside the actual function or perhaps replaced by a heuristic like if o.__class__.__qualname__ == 'Tensor' and o.__class__.__module__ == 'torch'

Screenshot 2024-10-18 at 16 14 09

@DanielYang59
Copy link
Collaborator

Thanks for sharing that tool (again), I would take over from here :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg Package ux User experience
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants