Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics about the solver-service behaviour #70

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

moyodiallo
Copy link
Contributor

No description provided.

@moyodiallo moyodiallo marked this pull request as ready for review July 18, 2023 08:16
@tmcgilchrist
Copy link
Member

@moyodiallo Could you update this to the EIO version of the solver?

@moyodiallo
Copy link
Contributor Author

The metrics is adapted to EIO version #71. It needs some tests with Grafana graphics.

cc: @mtelvers

Copy link
Member

@tmcgilchrist tmcgilchrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else looks good. @moyodiallo co-ordinate with @mtelvers to deploy it to a solver worker and check the statistics are exported correctly. Do we still have a test solver pool available?

service/pool.ml Outdated
handle request |> Promise.resolve set_reply;
Atomic.decr t.running;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern (incr, handle, decr) feels unsafe in the presence of exceptions from handle request cc @talex5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle request needs to catch those exceptions otherwise we're losing a worker and that will result as a bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if a worker crashes then the whole service exits, so it's not strictly necessary to handle exceptions here, although it would make the code more robust to future changes. However, I don't think you need this counter. The number of running workers is always min n_workers n_jobs, so you can just calculate it as needed.

Copy link
Contributor Author

@moyodiallo moyodiallo Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right, I did not get this min n_workers n_jobs. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we don't have n_jobs in the pool. All we have is waiting jobs (jobs Eio.Stream.t). It won't be precise if consider waiting jobs as n_jobs, ex. 2 jobs waiting and all the 8 workers busy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be useful to see is:

  1. n_workers - static capacity of this solver_worker
  2. queued_requests - requests waiting for a Fiber to run on
  3. running_workers - Fibers actively running a solve job

That would let us answer questions like total solver_worker capacity available, utilisation of the total capacity, and saturation of the total capacity and per solver_worker saturation.

The number of running workers is always min n_workers n_jobs, so you can just calculate it as needed.

I thought from reading the code and https://github.com/ocaml-multicore/eio?search=1#multicore-support, Pool pre_forked a number of OS threads (Domains?) which would wake up when work was added to the Eio.Stream? Is that accurate @talex5 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right - there is a fixed pool of workers and they will all be busy unless there just aren't any jobs in the queue.

The issue here is that run_worker is running in a worker domain, and so can't (currently) update Prometheus metrics itself (which I guess it why it's updating an atomic instead and waiting for the main domain to report that). But the main domain can work out how many workers are running just by knowing how many jobs are outstanding, so this isn't necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right there's a way to do that.

service/solver.ml Outdated Show resolved Hide resolved
moyodiallo and others added 2 commits October 18, 2023 15:39
solver instead of worker make sense.

Co-authored-by: Mark Elvers <[email protected]>
The pool stream wasn't accumulating the requests. So it was incorrect to
consider its length as waiting requests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants