Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add: exponential backoff for CAS operations on floats #1661

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

imorph
Copy link

@imorph imorph commented Oct 26, 2024

What problem this PR is solving

Hi!
This kind of issues: cockroachdb/cockroach#133306 pushed me to investigate if it is possible to optimize client metrics library in terms of CPU and performance especially under contention. So mostly this PR is about that.

Proposed changes

Add sort of exponential backoff for tight loops on CAS operations, which potentially will decrease contention and as a result will lead to better latencies and lower CPU consumption. This is well known trick that is used in many projects that deal with concurrency.

In addition this logic was refactored into single place in code because case of atomic update of float64 can be meet in different parts of the codebase.

Results

All test results are from AWS c7i.2xlarge type of instance.

main vs proposed for histograms:

goos: linux
goarch: amd64
pkg: github.com/prometheus/client_golang/prometheus
cpu: Intel(R) Xeon(R) Platinum 8488C
                           │ ../../bhOLD.txt │           ../../bhNEW.txt           │
                           │     sec/op      │    sec/op     vs base               │
HistogramWithLabelValues-8      87.81n ±  1%   90.36n ±  1%   +2.90% (p=0.002 n=6)
HistogramNoLabels-8             36.15n ±  1%   39.55n ±  0%   +9.42% (p=0.002 n=6)
HistogramObserve1-8             31.23n ±  1%   32.52n ±  1%   +4.15% (p=0.002 n=6)
HistogramObserve2-8            242.45n ± 16%   86.89n ±  0%  -64.16% (p=0.002 n=6)
HistogramObserve4-8             372.8n ± 14%   175.0n ± 19%  -53.05% (p=0.002 n=6)
HistogramObserve8-8             953.0n ±  1%   415.4n ± 32%  -56.41% (p=0.002 n=6)
HistogramWrite1-8               1.456µ ± 11%   1.454µ ± 12%        ~ (p=0.699 n=6)
HistogramWrite2-8               3.103µ ±  3%   3.125µ ±  7%        ~ (p=0.485 n=6)
HistogramWrite4-8               6.087µ ±  3%   6.041µ ±  4%        ~ (p=0.937 n=6)
HistogramWrite8-8               11.71µ ±  2%   11.90µ ±  3%   +1.61% (p=0.004 n=6)
geomean                         554.5n         434.5n        -21.64%

                           │ ../../bhOLD.txt │          ../../bhNEW.txt           │
                           │      B/op       │    B/op     vs base                │
HistogramWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
HistogramNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                    ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                           │ ../../bhOLD.txt │          ../../bhNEW.txt           │
                           │    allocs/op    │ allocs/op   vs base                │
HistogramWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
HistogramNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                    ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

changes leading to slight increase of latency in single-threaded use case but significantly decrease it in contended case.

main vs proposed for summaries:

goos: linux
goarch: amd64
pkg: github.com/prometheus/client_golang/prometheus
cpu: Intel(R) Xeon(R) Platinum 8488C
                         │ ../../bsOLD.txt │           ../../bsNEW.txt           │
                         │     sec/op      │    sec/op     vs base               │
SummaryWithLabelValues-8      264.0n ±  1%   267.6n ±  3%   +1.34% (p=0.017 n=6)
SummaryNoLabels-8             186.6n ±  1%   186.6n ±  0%        ~ (p=0.920 n=6)
SummaryObserve1-8             22.57n ±  1%   23.66n ±  0%   +4.83% (p=0.002 n=6)
SummaryObserve2-8            192.75n ± 31%   52.13n ±  1%  -72.95% (p=0.002 n=6)
SummaryObserve4-8             297.9n ± 27%   126.7n ± 16%  -57.47% (p=0.002 n=6)
SummaryObserve8-8             937.5n ±  5%   319.1n ± 48%  -65.96% (p=0.002 n=6)
SummaryWrite1-8               135.1n ± 32%   134.1n ±  9%        ~ (p=0.485 n=6)
SummaryWrite2-8               319.9n ±  3%   327.4n ±  4%        ~ (p=0.240 n=6)
SummaryWrite4-8               665.7n ±  5%   664.0n ±  9%        ~ (p=0.699 n=6)
SummaryWrite8-8               1.402µ ±  9%   1.318µ ± 11%        ~ (p=0.180 n=6)
geomean                       274.3n         198.6n        -27.59%

                         │ ../../bsOLD.txt │          ../../bsNEW.txt           │
                         │      B/op       │    B/op     vs base                │
SummaryWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
SummaryNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                  ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                         │ ../../bsOLD.txt │          ../../bsNEW.txt           │
                         │    allocs/op    │ allocs/op   vs base                │
SummaryWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
SummaryNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                  ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

same is true for summaries as well

Some additional illustrations:
Gap between implementations is wider for higher number of parallel/contending goroutines.
image
image

Side by side comparison of backoff/no-backoff implementations are in atomic_update_test.go file:

BenchmarkAtomicUpdateFloat_SingleGoroutine-8   	94284844	       12.59 ns/op
BenchmarkAtomicNoBackoff_SingleGoroutine-8     	97815223	       12.24 ns/op
BenchmarkAtomicUpdateFloat_1Goroutine-8        	91764830	       15.45 ns/op
BenchmarkAtomicNoBackoff_1Goroutine-8          	13282502	       90.47 ns/op
BenchmarkAtomicUpdateFloat_2Goroutines-8       	67590978	       19.70 ns/op
BenchmarkAtomicNoBackoff_2Goroutines-8         	13490074	       89.20 ns/op
BenchmarkAtomicUpdateFloat_4Goroutines-8       	88974640	       14.74 ns/op
BenchmarkAtomicNoBackoff_4Goroutines-8         	13480230	       89.29 ns/op
BenchmarkAtomicUpdateFloat_8Goroutines-8       	85171494	       13.94 ns/op
BenchmarkAtomicNoBackoff_8Goroutines-8         	13474435	       89.39 ns/op
BenchmarkAtomicUpdateFloat_16Goroutines-8      	80687557	       14.96 ns/op
BenchmarkAtomicNoBackoff_16Goroutines-8        	13556298	       90.21 ns/op
BenchmarkAtomicUpdateFloat_32Goroutines-8      	71862175	       14.98 ns/op
BenchmarkAtomicNoBackoff_32Goroutines-8        	13522851	       88.90 ns/op

CPU profiles

for go test -bench='BenchmarkHistogramObserve*' -run=^# -count=6
old top:

      flat  flat%   sum%        cum   cum%
     0.56s  0.34%  0.34%    166.28s 99.93%  github.com/prometheus/client_golang/prometheus.benchmarkHistogramObserve.func1
     0.57s  0.34%  0.68%    165.71s 99.59%  github.com/prometheus/client_golang/prometheus.(*histogram).Observe
    32.65s 19.62% 20.30%    128.03s 76.94%  github.com/prometheus/client_golang/prometheus.(*histogram).observe
    56.26s 33.81% 54.11%     95.38s 57.32%  github.com/prometheus/client_golang/prometheus.(*histogramCounts).observe
    29.38s 17.66% 71.77%     39.12s 23.51%  github.com/prometheus/client_golang/prometheus.atomicAddFloat (inline)
     0.25s  0.15% 71.92%     37.11s 22.30%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket
    33.38s 20.06% 91.98%     36.86s 22.15%  sort.SearchFloat64s (inline)
     9.10s  5.47% 97.45%      9.10s  5.47%  math.Float64frombits (inline)
     2.05s  1.23% 98.68%      3.46s  2.08%  sort.Search
     1.41s  0.85% 99.53%      1.41s  0.85%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket.SearchFloat64s.func1

new top:

      flat  flat%   sum%        cum   cum%
     0.48s  1.94%  1.94%     24.61s 99.31%  github.com/prometheus/client_golang/prometheus.benchmarkHistogramObserve.func1
     0.41s  1.65%  3.59%     24.13s 97.38%  github.com/prometheus/client_golang/prometheus.(*histogram).Observe
     7.89s 31.84% 35.43%     19.77s 79.78%  github.com/prometheus/client_golang/prometheus.(*histogram).observe
        5s 20.18% 55.61%     11.88s 47.94%  github.com/prometheus/client_golang/prometheus.(*histogramCounts).observe
         0     0% 55.61%      6.88s 27.76%  github.com/prometheus/client_golang/prometheus.atomicAddFloat (inline)
     5.99s 24.17% 79.78%      6.88s 27.76%  github.com/prometheus/client_golang/prometheus.atomicUpdateFloat
     0.18s  0.73% 80.51%      3.95s 15.94%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket
     0.51s  2.06% 82.57%      3.77s 15.21%  sort.SearchFloat64s (inline)
     2.01s  8.11% 90.68%      3.26s 13.16%  sort.Search
     1.25s  5.04% 95.72%      1.25s  5.04%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket.SearchFloat64s.func1

Downsides

Everything comes with the price, so in this case

  • code is slightly more complex
  • some magic constants introduced (10ms and 320ms -- mostly result of empirical testing)
  • single-threaded performance weaker by 4%-5%
  • time.sleep introducing additional syscall (? not sure about that)

Further improvements

  • Some random jitter can be added to avoid thundering herd problem
  • I briefly explored hybrid approach when there is staged backoff with first stage using runtime.Gosched() and only then time.sleep but it was not looking impressive from results point of view
  • better code layout?

@vesari
@ArthurSens
@bwplotka
@kakkoyun

@imorph
Copy link
Author

imorph commented Oct 26, 2024

TestSummaryDecay test is failing, but this fail seems unrelated to my changes (may be I missing something but looks like it is different code path).

Most likely it fils because in the summary struct, observations added via Observe are buffered in hotBuf. When hotBuf reaches its capacity or expires, an asynchronous flush is triggered by calling asyncFlush. This flush moves the observations to coldBuf and processes them in a separate goroutine. The Write method processes the coldBuf synchronously but does not include observations still in hotBuf or those being processed asynchronously in the background goroutine. Test is rapidly adding observations and periodically calling sum.Write(). Due to the asynchronous processing of observations, some observations may not have been fully processed and included in the quantile calculations when Write is called. This could lead to random test failures.

By i might be wrong.

@sean-
Copy link

sean- commented Oct 29, 2024

Prom histograms have historically been slow relative to other histogram implementations.

https://github.com/sean-/bench-go-histograms/tree/main?tab=readme-ov-file#benchmark-for-go-histogram-implementations

@kakkoyun
Copy link
Member

I plan to review this PR this weekend. Sorry for the tardiness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants