You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently our JS client offers two kinds of smart fetches:
We fetch through DNS, and then if the request completely failes (after 5 seconds) we retry as a fallback using a node return from the orchestator.
We immediately race both DNS AND multiple nodes returned from the orchestrator and take the request that returns a first byte (
We proposes a third "request hedging" approach:
Initiate a request with DNS
If a time equaling Saturn's P90 TTFB passes without receiving a first byte, start a second request to an orchestrator node, and take which ever returns a first byte first (cancleling the other)
Why
Our first approach suffers from being a very non-ideal experience -- while the fallback prevents a complete failure, it only falls back after such a long time (i.e. 5 seconds) as to provide a terrible experience for the user.
Our second approach we have found to generate a high amount of duplicate traffic -- we've even overloaded the log ingestor a couple times this way.
This approach essentially aims to improve the fallback experience of our first approach without incurring the problems associated with the second approach.
Cost
To get this done we would need to:
implement the third "request hedging" approach in the JS client, available as an option on the request
for P90 TTFB for now we might just manually copy whatever our current value is
a future improvement might retrieve this value from the orchestrator along with the list of proximate nodes
deploy and test as an experiment on the arc network
deploy as the primary strategy in ARC and the service worker
The text was updated successfully, but these errors were encountered:
I'm going to push that on something like this, we shouldn't start on it until we have defined and implemented the metric(s) which tell us if there is a problem, how bad it is, and whether what we do fixes it.
If there's a reasonable hypothesis that this is, in fact, a problem with the service we should prioritize work in, then maybe we start this work with a task to get the data to build the case.
At the end of the day this feature (likely) improves tail TTFBs. Do we think that's a high priority investment area right now?
I understand this was suggested as a counter measure from a post-mortem on node operator error rates.
Can we therefore clarify as to how much this proposed item addresses reliability and production improvement vs tail performance. I see that the following production problem is being addressed: "duplicate traffic -- we've even overloaded the log ingestor a couple times this way."
The best way to evaluate this would be to assume 30 customers using the service worker (for example if we open the portal in the near-term). Will this cause more production problems and overloading the log ingestor even more?
What
Currently our JS client offers two kinds of smart fetches:
We fetch through DNS, and then if the request completely failes (after 5 seconds) we retry as a fallback using a node return from the orchestator.
We immediately race both DNS AND multiple nodes returned from the orchestrator and take the request that returns a first byte (
We proposes a third "request hedging" approach:
Why
Our first approach suffers from being a very non-ideal experience -- while the fallback prevents a complete failure, it only falls back after such a long time (i.e. 5 seconds) as to provide a terrible experience for the user.
Our second approach we have found to generate a high amount of duplicate traffic -- we've even overloaded the log ingestor a couple times this way.
This approach essentially aims to improve the fallback experience of our first approach without incurring the problems associated with the second approach.
Cost
To get this done we would need to:
The text was updated successfully, but these errors were encountered: