Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak from NetCDFOutputWriter #3777

Open
Yixiao-Zhang opened this issue Sep 11, 2024 · 7 comments
Open

Possible memory leak from NetCDFOutputWriter #3777

Yixiao-Zhang opened this issue Sep 11, 2024 · 7 comments

Comments

@Yixiao-Zhang
Copy link
Contributor

Recently, many of my simulations that run on clusters have crashed due to out-of-memory errors. I find that NetCDFOutputWriter seems to cause memory leak, which can by reproduced by the code below:

using Printf
using Oceananigans
using Oceananigans.OutputWriters: write_output!

const arch = CPU()

const Nx = 50
const Ny = 50
const Nz = 50

grid = RectilinearGrid(
    arch,
    size = (Nx, Ny, Nz),
    x = (0., 1.),
    y = (0., 1.),
    z = (0., 1.),
    topology = (Bounded, Bounded, Bounded),
)

# Model
model = NonhydrostaticModel(;
    grid = grid,
)

output_writer = NetCDFOutputWriter(
    model,
    model.velocities,
    filename = "output.nc",
    schedule = TimeInterval(1.0),
)

for i in 1:1000
    write_output!(output_writer, model)
    # GC.gc()
    @info i
    @info Printf.@sprintf "Max. RSS:  %9.3f MiB\n" Sys.maxrss()/2^20
end

The total memory usage reported by Sys.maxrss keeps increasing over time, the rate which is roughly the output data size. Forcing Gc.gc() slows down the trend but cannot stop the increase.

I believe it is a bug in NCDatasets. See Alexander-Barth/NCDatasets.jl#266. The version of NCDatasets is 0.14.5 in my case.

@glwagner
Copy link
Member

Thanks for the x-ref with Alexander-Barth/NCDatasets.jl#266, let's keep an eye on how it develops over there.

@Alexander-Barth
Copy link

As @Yixiao-Zhang found, there is already a reproducer in pure C about this issue (Unidata/netcdf-c#2626). As it is specific to the NetCDF 4 format, it might also be in libhdf5 (Unidata/netcdf-c#2626 (comment)).

I am wondering, if NetCDFOutputWriter cannot keep the NetCDF file open between writes (and maybe just calling NCDatasets.sync to flush all changes on disk).

@Yixiao-Zhang
Copy link
Contributor Author

Oceananigans used to keep NetCDF files open between writes (#568). However, it was changed in the commit 9af559d. @ali-ramadhan might know the reason.

It seems to me that a quick fix is to downgrade NetCDF_jll. It is also astonishing to me that memory leak seems to be amplified by allocations between ccalls. Maybe it is worth discussing on Julia Discourse?

@Yixiao-Zhang Yixiao-Zhang changed the title Possibly memory leak from NetCDFOutputWriter Possible memory leak from NetCDFOutputWriter Sep 13, 2024
@ali-ramadhan
Copy link
Member

I wish I left more info in PR #1229 haha but I think keeping the file open between writes was actually an oversight.

My thinking was that you don't want to leave the file open when you don't explicitly need it to be open in case the simulation or Julia crashes. In some cases, this could corrupt the file or could leave the file unreadable by other processes until the Julia process exits. NetCDF files and NCDatasets.jl may not suffer from these issues, but I was probably doing it to be safe.

If downgrading NetCDF_jll works that seems like a fine temporary fix? I don't think Oceananigans.jl needs super new features from NetCDF?

@Alexander-Barth
Copy link

It is also astonishing to me that memory leak seems to be amplified by allocations

Yes, I think that this is probably due to memory fragmentation (generated by the memory leak in netcdf-c). Julia needs a contiguous block of memory for its arrays. Even if netcdf-c leaks just a couple of bytes, such contiguous large blocks will be more and more difficult to find.

@Yixiao-Zhang
Copy link
Contributor Author

NetCDF_jll <400.900 works for me. Adding the following lines to Project.toml of Oceananigans or the environment for running Oceananigans should work, but it seems not the best practice since it adds a stale dependency.

[deps]
NetCDF_jll = "7243133f-43d8-5620-bbf4-c2c921802cf3"

[compat]
NetCDF_jll = "<400.900"

@glwagner
Copy link
Member

This kind of fix has indeed been discouraged in the past... Perhaps we need a new patch for NCDatasets that we can add compat for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants