Which metric can I use to be able to horizontal scale rr? #969

lanphan · 2022-01-25T12:28:29Z

lanphan
Jan 25, 2022

Hello @rustatian ,
I'd like to open new discussion for my new question.
As you know that we run rr in k8s cluster, and it seems cpu/ram metric of pod is not the right metric for us to scale rr. I know that rr has metric plugin and there are some default metrics exposed there, is there any metric we can use to scale rr?

Answered by rustatian

Jun 24, 2022

isn't rr_http_requests_queue_sum a bit too late to scale? Customers are already waiting.

Metric is just a value. When to scale (the red point) depends on your application. If you decide that 300 waiting requests affect your users, try to start scaling when it would be 250 queued requests 😃

PHP FPM exposes total/running/idle workers, with these metrics you can scale in advance (for example if 80% of all workers are running/have less than 20% idle workers, scale up), before customers are queued. With rr_http_requests_queue_sum you can scale only if 100% of workers are running and there are some customers in queue, waiting.

RR has metrics for the free/busy/invalid number of workers, as w…

View full answer

rustatian · 2022-01-25T12:34:01Z

rustatian
Jan 25, 2022
Maintainer

Hey @lanphan )) Nice to meet you again 😄
Especially for the horizontal scaling, you may use the response latency from your gateway plus rr_http_requests_queue_sum from the RR. You may also ask the community in our discord server (roadrunner->general).

1 reply

Warxcell Jan 25, 2022

isn't rr_http_requests_queue_sum a bit too late to scale? Customers are already waiting.
PHP FPM exposes total/running/idle workers, with these metrics you can scale in advance (for example if 80% of all workers are running/have less than 20% idle workers, scale up), before customers are queued. With rr_http_requests_queue_sum you can scale only if 100% of workers are running and there are some customers in queue, waiting. Considering that scaling have a big latency (because of different factors like prometheus scrape interval, kubernetes scrape interval, pod startup time, Load balancer target registration) - it could take few minutes to scale up - doesn't help a lot.

I don't see same metrics exposed by RR?

rustatian · 2022-01-25T12:37:05Z

rustatian
Jan 25, 2022
Maintainer

Also, it's better to use 2-3 RR instances in 1 node (with 10-20 workers) (RR supports SO_REUSEPORT, so you may start 2-3-n+1 RR instances with exactly the same config, packets will be routed by the system), instead of 1 RR instance with 500 workers inside 😄

0 replies

lanphan · 2022-01-25T12:41:32Z

lanphan
Jan 25, 2022
Author

@rustatian , thanks so much again.
Yes, we did deploy like you said, 10-20 workers per instance only.

1 reply

rustatian Jan 25, 2022
Maintainer

You may also read this thread: #799 about how to calculate the number of workers based on the average RPS.

rustatian · 2022-06-24T10:14:16Z

rustatian
Jun 24, 2022
Maintainer

isn't rr_http_requests_queue_sum a bit too late to scale? Customers are already waiting.

Metric is just a value. When to scale (the red point) depends on your application. If you decide that 300 waiting requests affect your users, try to start scaling when it would be 250 queued requests 😃

PHP FPM exposes total/running/idle workers, with these metrics you can scale in advance (for example if 80% of all workers are running/have less than 20% idle workers, scale up), before customers are queued. With rr_http_requests_queue_sum you can scale only if 100% of workers are running and there are some customers in queue, waiting.

RR has metrics for the free/busy/invalid number of workers, as well as consumed memory and it's state.

# HELP rr_http_total_workers Total number of workers used by the HTTP plugin
# TYPE rr_http_total_workers gauge
rr_http_total_workers 1
# HELP rr_http_uptime_seconds Uptime in seconds
# TYPE rr_http_uptime_seconds counter
rr_http_uptime_seconds 10
# HELP rr_http_worker_memory_bytes Worker current memory usage
# TYPE rr_http_worker_memory_bytes gauge
rr_http_worker_memory_bytes{pid="12"} 3.2493568e+07
# HELP rr_http_worker_state Worker current state
# TYPE rr_http_worker_state gauge
rr_http_worker_state{pid="12",state="ready"} 0
# HELP rr_http_workers_invalid HTTP workers currently in invalid,killing,destroyed,errored,inactive states
# TYPE rr_http_workers_invalid gauge
rr_http_workers_invalid 0
# HELP rr_http_workers_memory_bytes Memory usage by HTTP workers.
# TYPE rr_http_workers_memory_bytes gauge
rr_http_workers_memory_bytes 3.2493568e+07
# HELP rr_http_workers_ready HTTP workers currently in ready state
# TYPE rr_http_workers_ready gauge
rr_http_workers_ready 1
# HELP rr_http_workers_working HTTP workers currently in working state
# TYPE rr_http_workers_working gauge
rr_http_workers_working 0

To activate these metrics, add the following to your config:

http:
  address: 127.0.0.1:12811
  max_request_size: 1024
  middleware: [ "http_metrics" ] # <----

0 replies

Warxcell · 2023-12-08T15:50:33Z

Warxcell
Dec 8, 2023

I recently adjusted HPA to scale based on free workers.

this is prometheus-adapter config

  custom:
    - seriesQuery: 'rr_http_total_workers{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace:
            resource: "namespace"
          pod:
            resource: "pod"
      name:
        matches: "rr_http_total_workers"
        as: "php_workers_utilization"
      metricsQuery: 'avg(100 - (rr_http_workers_ready{<<.LabelMatchers>>} / rr_http_total_workers{<<.LabelMatchers>>} * 100)) by (<<.GroupBy>>)'

and then add this part to HPA config

  metrics:
    - type: Pods
      pods:
        metric:
          name: php_workers_utilization
        target:
          type: AverageValue
          averageValue: "% of not available workers"

seems to be working sweet!

Test setup:

each pod have 1 worker.

Started single pod, 50% scale on 50% of not available workers
Do request which takes 30 seconds to complete - CPU stays down, but HPA scales because 100% of 1 workers is not available. :)

6 replies

Warxcell Dec 8, 2023

BTW, in 2023.3 we introduced a dynamic workers scaling. You can add/remove workers on the fly 🛸 link

Cool, but that doesn't work well with Kubernetes, because resources within pod are constrained - by adding new worker - it will run-out-of-memory, unless you give the pod twice as memory - which defeats the purpose of dynamic workers in first place, because you are paying for the memory, but not using it.

With the HPA above - you pay for resources, only when needed (cpu/memory utilization or if user hits IO bound request)

RobCannon May 9, 2024

The sample from @Warxcell really needs to be in the documentation. I am still playing around with it, but this is what is needed to scale Roadrunner apps in Kubernetes.

rustatian May 9, 2024
Maintainer

Hey @RobCannon 👋
Could you please create a PR in the docs with what you ended up?

EDIT: roadrunner-server/docs repository

RobCannon May 9, 2024

I have tweaked it a little bit to simplify. I think the metric is correct, but the HPA is not scaling correctly. Here is my new custom query:

          custom:
            - seriesQuery: 'rr_http_workers_working{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace:
                    resource: "namespace"
                  pod:
                    resource: "pod"
              name:
                matches: "rr_http_workers_working"
                as: "rr_http_workers_utilization"
              metricsQuery: 'avg(rr_http_workers_working{<<.LabelMatchers>>} / rr_http_total_workers{<<.LabelMatchers>>} * 100) by (<<.GroupBy>>)'

Instead of 100 - (workers_ready / workers * 100), I am just using workers_working / workers * 100.

I will update when everything is completely working.

rustatian May 9, 2024
Maintainer

Keep in mind, that RR allows you to dynamically add/remove workers on-the-fly: https://docs.roadrunner.dev/php-worker/scaling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoadRunner

Which metric can I use to be able to horizontal scale rr? #969

{{title}}

Replies: 5 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

RoadRunner

Which metric can I use to be able to horizontal scale rr? #969

lanphan Jan 25, 2022

Replies: 5 comments · 8 replies

rustatian Jan 25, 2022 Maintainer

Warxcell Jan 25, 2022

rustatian Jan 25, 2022 Maintainer

lanphan Jan 25, 2022 Author

rustatian Jan 25, 2022 Maintainer

rustatian Jun 24, 2022 Maintainer

Warxcell Dec 8, 2023

Warxcell Dec 8, 2023

RobCannon May 9, 2024

rustatian May 9, 2024 Maintainer

RobCannon May 9, 2024

rustatian May 9, 2024 Maintainer

lanphan
Jan 25, 2022

Replies: 5 comments 8 replies

rustatian
Jan 25, 2022
Maintainer

rustatian
Jan 25, 2022
Maintainer

lanphan
Jan 25, 2022
Author

rustatian Jan 25, 2022
Maintainer

rustatian
Jun 24, 2022
Maintainer

Warxcell
Dec 8, 2023

rustatian May 9, 2024
Maintainer

rustatian May 9, 2024
Maintainer