Add: exponential backoff for CAS operations on floats #1661

imorph · 2024-10-26T12:40:09Z

What problem this PR is solving

Hi!
This kind of issues: cockroachdb/cockroach#133306 pushed me to investigate if it is possible to optimize client metrics library in terms of CPU and performance especially under contention. So mostly this PR is about that.

Proposed changes

Add sort of exponential backoff for tight loops on CAS operations, which potentially will decrease contention and as a result will lead to better latencies and lower CPU consumption. This is well known trick that is used in many projects that deal with concurrency.

In addition this logic was refactored into single place in code because case of atomic update of float64 can be meet in different parts of the codebase.

Results

All test results are from AWS c7i.2xlarge type of instance.

main vs proposed for histograms:

goos: linux
goarch: amd64
pkg: github.com/prometheus/client_golang/prometheus
cpu: Intel(R) Xeon(R) Platinum 8488C
                           │ ../../bhOLD.txt │           ../../bhNEW.txt           │
                           │     sec/op      │    sec/op     vs base               │
HistogramWithLabelValues-8      87.81n ±  1%   90.36n ±  1%   +2.90% (p=0.002 n=6)
HistogramNoLabels-8             36.15n ±  1%   39.55n ±  0%   +9.42% (p=0.002 n=6)
HistogramObserve1-8             31.23n ±  1%   32.52n ±  1%   +4.15% (p=0.002 n=6)
HistogramObserve2-8            242.45n ± 16%   86.89n ±  0%  -64.16% (p=0.002 n=6)
HistogramObserve4-8             372.8n ± 14%   175.0n ± 19%  -53.05% (p=0.002 n=6)
HistogramObserve8-8             953.0n ±  1%   415.4n ± 32%  -56.41% (p=0.002 n=6)
HistogramWrite1-8               1.456µ ± 11%   1.454µ ± 12%        ~ (p=0.699 n=6)
HistogramWrite2-8               3.103µ ±  3%   3.125µ ±  7%        ~ (p=0.485 n=6)
HistogramWrite4-8               6.087µ ±  3%   6.041µ ±  4%        ~ (p=0.937 n=6)
HistogramWrite8-8               11.71µ ±  2%   11.90µ ±  3%   +1.61% (p=0.004 n=6)
geomean                         554.5n         434.5n        -21.64%

                           │ ../../bhOLD.txt │          ../../bhNEW.txt           │
                           │      B/op       │    B/op     vs base                │
HistogramWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
HistogramNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                    ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                           │ ../../bhOLD.txt │          ../../bhNEW.txt           │
                           │    allocs/op    │ allocs/op   vs base                │
HistogramWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
HistogramNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                    ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

changes leading to slight increase of latency in single-threaded use case but significantly decrease it in contended case.

main vs proposed for summaries:

goos: linux
goarch: amd64
pkg: github.com/prometheus/client_golang/prometheus
cpu: Intel(R) Xeon(R) Platinum 8488C
                         │ ../../bsOLD.txt │           ../../bsNEW.txt           │
                         │     sec/op      │    sec/op     vs base               │
SummaryWithLabelValues-8      264.0n ±  1%   267.6n ±  3%   +1.34% (p=0.017 n=6)
SummaryNoLabels-8             186.6n ±  1%   186.6n ±  0%        ~ (p=0.920 n=6)
SummaryObserve1-8             22.57n ±  1%   23.66n ±  0%   +4.83% (p=0.002 n=6)
SummaryObserve2-8            192.75n ± 31%   52.13n ±  1%  -72.95% (p=0.002 n=6)
SummaryObserve4-8             297.9n ± 27%   126.7n ± 16%  -57.47% (p=0.002 n=6)
SummaryObserve8-8             937.5n ±  5%   319.1n ± 48%  -65.96% (p=0.002 n=6)
SummaryWrite1-8               135.1n ± 32%   134.1n ±  9%        ~ (p=0.485 n=6)
SummaryWrite2-8               319.9n ±  3%   327.4n ±  4%        ~ (p=0.240 n=6)
SummaryWrite4-8               665.7n ±  5%   664.0n ±  9%        ~ (p=0.699 n=6)
SummaryWrite8-8               1.402µ ±  9%   1.318µ ± 11%        ~ (p=0.180 n=6)
geomean                       274.3n         198.6n        -27.59%

                         │ ../../bsOLD.txt │          ../../bsNEW.txt           │
                         │      B/op       │    B/op     vs base                │
SummaryWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
SummaryNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                  ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                         │ ../../bsOLD.txt │          ../../bsNEW.txt           │
                         │    allocs/op    │ allocs/op   vs base                │
SummaryWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
SummaryNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                  ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

same is true for summaries as well

Some additional illustrations:
Gap between implementations is wider for higher number of parallel/contending goroutines.

Side by side comparison of backoff/no-backoff implementations are in atomic_update_test.go file:

BenchmarkAtomicUpdateFloat_SingleGoroutine-8   	94284844	       12.59 ns/op
BenchmarkAtomicNoBackoff_SingleGoroutine-8     	97815223	       12.24 ns/op
BenchmarkAtomicUpdateFloat_1Goroutine-8        	91764830	       15.45 ns/op
BenchmarkAtomicNoBackoff_1Goroutine-8          	13282502	       90.47 ns/op
BenchmarkAtomicUpdateFloat_2Goroutines-8       	67590978	       19.70 ns/op
BenchmarkAtomicNoBackoff_2Goroutines-8         	13490074	       89.20 ns/op
BenchmarkAtomicUpdateFloat_4Goroutines-8       	88974640	       14.74 ns/op
BenchmarkAtomicNoBackoff_4Goroutines-8         	13480230	       89.29 ns/op
BenchmarkAtomicUpdateFloat_8Goroutines-8       	85171494	       13.94 ns/op
BenchmarkAtomicNoBackoff_8Goroutines-8         	13474435	       89.39 ns/op
BenchmarkAtomicUpdateFloat_16Goroutines-8      	80687557	       14.96 ns/op
BenchmarkAtomicNoBackoff_16Goroutines-8        	13556298	       90.21 ns/op
BenchmarkAtomicUpdateFloat_32Goroutines-8      	71862175	       14.98 ns/op
BenchmarkAtomicNoBackoff_32Goroutines-8        	13522851	       88.90 ns/op

CPU profiles

for go test -bench='BenchmarkHistogramObserve*' -run=^# -count=6
old top:

      flat  flat%   sum%        cum   cum%
     0.56s  0.34%  0.34%    166.28s 99.93%  github.com/prometheus/client_golang/prometheus.benchmarkHistogramObserve.func1
     0.57s  0.34%  0.68%    165.71s 99.59%  github.com/prometheus/client_golang/prometheus.(*histogram).Observe
    32.65s 19.62% 20.30%    128.03s 76.94%  github.com/prometheus/client_golang/prometheus.(*histogram).observe
    56.26s 33.81% 54.11%     95.38s 57.32%  github.com/prometheus/client_golang/prometheus.(*histogramCounts).observe
    29.38s 17.66% 71.77%     39.12s 23.51%  github.com/prometheus/client_golang/prometheus.atomicAddFloat (inline)
     0.25s  0.15% 71.92%     37.11s 22.30%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket
    33.38s 20.06% 91.98%     36.86s 22.15%  sort.SearchFloat64s (inline)
     9.10s  5.47% 97.45%      9.10s  5.47%  math.Float64frombits (inline)
     2.05s  1.23% 98.68%      3.46s  2.08%  sort.Search
     1.41s  0.85% 99.53%      1.41s  0.85%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket.SearchFloat64s.func1

new top:

      flat  flat%   sum%        cum   cum%
     0.48s  1.94%  1.94%     24.61s 99.31%  github.com/prometheus/client_golang/prometheus.benchmarkHistogramObserve.func1
     0.41s  1.65%  3.59%     24.13s 97.38%  github.com/prometheus/client_golang/prometheus.(*histogram).Observe
     7.89s 31.84% 35.43%     19.77s 79.78%  github.com/prometheus/client_golang/prometheus.(*histogram).observe
        5s 20.18% 55.61%     11.88s 47.94%  github.com/prometheus/client_golang/prometheus.(*histogramCounts).observe
         0     0% 55.61%      6.88s 27.76%  github.com/prometheus/client_golang/prometheus.atomicAddFloat (inline)
     5.99s 24.17% 79.78%      6.88s 27.76%  github.com/prometheus/client_golang/prometheus.atomicUpdateFloat
     0.18s  0.73% 80.51%      3.95s 15.94%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket
     0.51s  2.06% 82.57%      3.77s 15.21%  sort.SearchFloat64s (inline)
     2.01s  8.11% 90.68%      3.26s 13.16%  sort.Search
     1.25s  5.04% 95.72%      1.25s  5.04%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket.SearchFloat64s.func1

Downsides

Everything comes with the price, so in this case

code is slightly more complex
some magic constants introduced (10ms and 320ms -- mostly result of empirical testing)
single-threaded performance weaker by 4%-5%
time.sleep introducing additional syscall (? not sure about that)

Further improvements

Some random jitter can be added to avoid thundering herd problem
I briefly explored hybrid approach when there is staged backoff with first stage using runtime.Gosched() and only then time.sleep but it was not looking impressive from results point of view
better code layout?

@vesari
@ArthurSens
@bwplotka
@kakkoyun

Signed-off-by: Ivan Goncharov <[email protected]>

imorph · 2024-10-26T13:34:51Z

TestSummaryDecay test is failing, but this fail seems unrelated to my changes (may be I missing something but looks like it is different code path).

Most likely it fils because in the summary struct, observations added via Observe are buffered in hotBuf. When hotBuf reaches its capacity or expires, an asynchronous flush is triggered by calling asyncFlush. This flush moves the observations to coldBuf and processes them in a separate goroutine. The Write method processes the coldBuf synchronously but does not include observations still in hotBuf or those being processed asynchronously in the background goroutine. Test is rapidly adding observations and periodically calling sum.Write(). Due to the asynchronous processing of observations, some observations may not have been fully processed and included in the quantile calculations when Write is called. This could lead to random test failures.

By i might be wrong.

sean- · 2024-10-29T16:37:38Z

Prom histograms have historically been slow relative to other histogram implementations.

https://github.com/sean-/bench-go-histograms/tree/main?tab=readme-ov-file#benchmark-for-go-histogram-implementations

kakkoyun · 2024-10-31T14:36:17Z

I plan to review this PR this weekend. Sorry for the tardiness.

imorph added 4 commits October 26, 2024 13:14

add: exponential backoff for CAS operations of floats

9829910

Signed-off-by: Ivan Goncharov <[email protected]>

add: some more benchmark use cases (higher contention)

27067c1

Signed-off-by: Ivan Goncharov <[email protected]>

fmt: fumpted some files

d06c03f

Signed-off-by: Ivan Goncharov <[email protected]>

add: license header

62d6f57

Signed-off-by: Ivan Goncharov <[email protected]>

imorph mentioned this pull request Oct 27, 2024

Faster h.findBucket(v) #1662

Open

ArthurSens mentioned this pull request Oct 31, 2024

Flaky test: TestSummaryDecay #1666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: exponential backoff for CAS operations on floats #1661

Add: exponential backoff for CAS operations on floats #1661

imorph commented Oct 26, 2024

imorph commented Oct 26, 2024

sean- commented Oct 29, 2024

kakkoyun commented Oct 31, 2024

Add: exponential backoff for CAS operations on floats #1661

Are you sure you want to change the base?

Add: exponential backoff for CAS operations on floats #1661

Conversation

imorph commented Oct 26, 2024

What problem this PR is solving

Proposed changes

Results

CPU profiles

Downsides

Further improvements

imorph commented Oct 26, 2024

sean- commented Oct 29, 2024

kakkoyun commented Oct 31, 2024