You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to horizontally scale the OTel collector and have the SDK (somewhat evenly) distribute requests to collector instances.
I have a Headless Service for my collector that returns all instances when querying via DNS:
$ dig otelcol
;; ANSWER SECTION:
otelcol. 600 IN A 172.22.0.5
otelcol. 600 IN A 172.22.0.8
However, because the Go HTTP Client which this package uses keeps the tcp connection alive, the SDK sticks to the first ever returned address until it becomes unreachable.
This also applies to regular k8s Services, because once the tcp conn is opened, no further loadbalancing from the k8s side takes place.
There is golang/go#34511 requesting this for the standard library, but no real progress has been made since 2019.
Proposed Solution
Instead of relying on the HTTP Client to determine the endpoint out of the DNS list, do the following:
manually keep a list of endpoints
refresh it every n seconds (once per minute?)
for each write request, choose an ip of above list on a round-robin / random basis
If deemed acceptable, I am happy to contribute this functionality
Alternatives
Disable Keepalive
By disabling TCP keepalive, a new connection is made on every request, which includes a DNS lookup.
I confirmed this works by mangling with SDK internals, but is inefficient.
This however leads to a DNS lookup on every request, which is undesirable
Have users deploy server-side loadbalancers
Of course this can be fixed server-side by deploying another layer of load-balancing proxies (nginx, etc) in front of the otel collector.
This greatly complicates the pipeline setup though, as one might end up with 3 layers (http loadbalancing, stateless collector for sticky otlp loadbalancing, stateful collector for processing)
The text was updated successfully, but these errors were encountered:
Also, if we start doing that, it's a feature we're introducing to a stable component. We won't be able to remove it when/if Go fixes this and it's necessary anymore.
Using a custom round tripper/transport is also not going to be possible for now. See #2632
Disabling keep alives could be a valid option we add to the HTTP exporters clients.
Problem Statement
I want to horizontally scale the OTel collector and have the SDK (somewhat evenly) distribute requests to collector instances.
I have a Headless Service for my collector that returns all instances when querying via DNS:
$ dig otelcol ;; ANSWER SECTION: otelcol. 600 IN A 172.22.0.5 otelcol. 600 IN A 172.22.0.8
However, because the Go HTTP Client which this package uses keeps the tcp connection alive, the SDK sticks to the first ever returned address until it becomes unreachable.
This also applies to regular k8s Services, because once the tcp conn is opened, no further loadbalancing from the k8s side takes place.
There is golang/go#34511 requesting this for the standard library, but no real progress has been made since 2019.
Proposed Solution
Instead of relying on the HTTP Client to determine the endpoint out of the DNS list, do the following:
If deemed acceptable, I am happy to contribute this functionality
Alternatives
Disable Keepalive
By disabling TCP keepalive, a new connection is made on every request, which includes a DNS lookup.
I confirmed this works by mangling with SDK internals, but is inefficient.
Use custom RoundTripper
In the Go issue the use of https://github.com/CAFxX/balancer is suggested.
This however leads to a DNS lookup on every request, which is undesirable
Have users deploy server-side loadbalancers
Of course this can be fixed server-side by deploying another layer of load-balancing proxies (nginx, etc) in front of the otel collector.
This greatly complicates the pipeline setup though, as one might end up with 3 layers (http loadbalancing, stateless collector for sticky otlp loadbalancing, stateful collector for processing)
The text was updated successfully, but these errors were encountered: