Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Budgeting with a conditional check against an always ready stream blocks the condition #2542

Open
Nemo157 opened this issue May 17, 2020 · 23 comments
Assignees
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. M-coop Module: tokio/coop

Comments

@Nemo157
Copy link
Contributor

Nemo157 commented May 17, 2020

Version

0.2.20

Platform

playground

Description

Taking the first correct example from the tokio::select! docs and running it in the playground results in an infinite hang (playground with println commented out).

The tested code includes an additional time::delay_for(Duration::from_millis(10)).await; bit of work in the some_async_work which avoids this issue.

This has the same underlying problem as rust-lang/futures-rs#2157, the budgeting is only applied to the conditional and blocks it from ever becoming true, while the actual work continues running because it is not subject to the budgeting.

@Darksonn Darksonn added A-tokio Area: The main tokio crate C-bug Category: This is a bug. I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. M-coop Module: tokio/coop labels May 17, 2020
@Darksonn
Copy link
Contributor

I guess select! should increment the budget to solve this?

@Nemo157
Copy link
Contributor Author

Nemo157 commented May 17, 2020

That doesn't fix any generic combinators from outside the Tokio ecosystem.

@Darksonn
Copy link
Contributor

Darksonn commented May 17, 2020

If you put an infinite loop into the async fn, there is not really much Tokio can do? If you block the executor, the executor is blocked.

This is exactly the issue that coop was introduced to help with.

@Darksonn
Copy link
Contributor

Hmm, I guess coop does interact here, but still, this is broken for the same reason that you should not use block_on inside futures. The issues about FuturesUnordered and Shared were not like this one, and would have a different fix from this one.

@carllerche
Copy link
Member

I'm not 100% sure what the best strategy is here. Leaving select! out of budgeting was an explicit decision. The reasoning is the sub calls will increment the budget. The example here is running a no-op. This could be representative of some arbitrary CPU computation.

That doesn't fix any generic combinators from outside the Tokio ecosystem.

At this point, resources external from Tokio are expected to provide their own budgeting strategies.

@carllerche
Copy link
Member

For reference, the example is:

use tokio::time::{self, Duration};

async fn some_async_work() {
    // do work
}

#[tokio::main]
async fn main() {
    let mut delay = time::delay_for(Duration::from_millis(50));

    while !delay.is_elapsed() {
        tokio::select! {
            _ = &mut delay, if !delay.is_elapsed() => {
                println!("operation timed out");
            }
            _ = some_async_work() => {
                println!("operation completed");
            }
        }
    }
}

@carllerche
Copy link
Member

One option that I can think of is combinators could check to ensure that if Ready is returned, the budget has been decremented.

@jonhoo
Copy link
Contributor

jonhoo commented May 17, 2020

I wonder how much of a problem this is in practice? How many futures running on a tokio runtime contain no calls into tokio resources? I think I agree with the analysis above that this is really a non-future-friendly busy loop in disguise, and is a disjoint problem from the one coop was added to solve.

That said, of course, it would be nice to be able to do the right thing in as many cases as we can, and maybe coop can be used to help with this particular case. It's a weird one though; the user is selecting in a loop over a future that is always ready and polls no resource. To me, this seems like something that wouldn't happen in real code. I feel like either the future would do something (in which case the problem goes away) or they wouldn't be using select! (in which case adding budgeting to select wouldn't fix the problem).

Or, phrased differently, do we think this particular construct (select! with a non-polling future) actually happens, and that fixing select! for that particular instance adds meaningful value?

@Darksonn
Copy link
Contributor

Darksonn commented May 17, 2020

Well, if you want a more realistic case where this would happen, consider this:

use tokio::time::{self, Duration};
use futures::channel::mpsc::channel;
use futures::stream::StreamExt;

#[tokio::main]
async fn main() {
    let mut delay = time::delay_for(Duration::from_millis(50));

    let (send, recv) = channel::<()>(10);

    drop(send);
    let mut recv = recv.fuse();

    loop {
        tokio::select! {
            _ = &mut delay => {
                println!("operation timed out");
                break;
            }
            msg = recv.next() => {
                if let Some(()) = msg {
                    println!("operation completed");
                }
            }
        }
    }
}

playground

I just had someone make a similar mistake over on Discord, albeit that was with futures' select! macro, and the next() was on an empty FuturesUnordered.

@jonhoo
Copy link
Contributor

jonhoo commented May 17, 2020

Thanks, that example is helpful. I feel like it only goes to prove my point though. If the user uses futures::select (not tokio::select), then there's nothing we can do about this. If the user uses a tokio channel, then the problem already goes away because of coop. So the question here is about whether we specifically want to implement a fix for when people use tokio::select with a future that never yields and which uses no tokio resources.

@Nemo157
Copy link
Contributor Author

Nemo157 commented May 17, 2020

So, the context in which I noticed this was while trying to write an example of what would happen with some utilities during starvation for rust-lang/futures-rs#2135. That resulted in me noticing that this would hang:

let stream = futures::stream::select_all(vec![
    futures::stream::repeat(0).boxed(),
    tokio::time::interval(Duration::from_millis(10)).map(|_| 1).boxed(),
    tokio::time::interval(Duration::from_millis(100)).map(|_| 2).boxed(),
]);
let mut counts = [0, 0, 0];
stream
    .take_until(tokio::time::delay_for(Duration::from_secs(1)))
    .for_each(|val| { counts[val] += 1; async {} }).await;

Which then lead to rust-lang/futures-rs#2157, and eventually this as I was trying to find an example using only Tokio utilities (take_until is basically the equivalent of a looped select! which breaks when the future completes).

@Matthias247
Copy link
Contributor

These would all run endlessly with and without budgeting, right? The idea is that budgeting improve fairness in some situations, but it can not fix all of these.

In the synchronous world we would call those things deadlocks (or livelocks), and there is no automagical mitigation against those.

The examples could all be "fixed" by running the timer off another thread than the current executor thread. But I don't think that's desirable, since it will decrease performance. async/await is about reaching highest performance and thereby explicitly opting into a harder cooperative programming model. If you want that - you have to live within the constraints of the model.

I actually don't think the described issues are too bad - you will immediately observe them during a debug run. What would be far more painful is if an app would run into an endless blocking situation somewhere later during normal operation. Do we have any examples where this could happen? Obviously we can easily build a contrived example, but that isn't helpful to determine how serious the issue might be.

I also think there might be some potential for new runtime diagnostics in tokio, which could e.g. detect blocked executor threads and tasks. That could at least help to point out these issues in a live application.

@Darksonn
Copy link
Contributor

These would all run endlessly with and without budgeting, right? The idea is that budgeting improve fairness in some situations, but it can not fix all of these.

Without budgeting, polling the timer after the timeout has expired returns Poll::Ready, so in that case it would only block the thread until the timeout. It would still block the thread until then of course.

I guess the idea is that budgeting would make it more comparable to a

while now() < timeout {
    tokio::task::yield_now().await;
}

so at least other tasks could run, whenever it yields.

@Matthias247
Copy link
Contributor

Without budgeting, polling the timer after the timeout has expired returns Poll::Ready

That will depend on how the timer is implemented. It might only change it's state if the timer handler is running inside the eventloop - which won't happen if the code never yields to the loop.

It might be currently implemented to also check the time on each poll - but I would not be on it being always that way.

@Darksonn
Copy link
Contributor

Yes of course, and in some sense that is what we are experiencing here. If you run it with pre-coop Tokio, it does complete, because it happened to have such an implementation back then.

@jonhoo
Copy link
Contributor

jonhoo commented May 18, 2020

I'm definitely conflicted here. I'm not opposed to finding a way to have select! avoid infinite loops, but I also feel like it's the "wrong" solution. It would only work in very specific circumstances; for example, it would not fix @Nemo157's code in #2542 (comment). In that code, I think the issue actually lies in take_until, which (presumably) keeps polling the stream while it's available before checking the until future?

@Nemo157
Copy link
Contributor Author

Nemo157 commented May 18, 2020

No, take_until will check the future before each poll of the underlying stream, but there's no way for Delay to indicate to take_until the difference between its initial "I'm not ready yet" Pendings and the later "you need to yield to the executor before I can become ready" Pendings.

@jonhoo
Copy link
Contributor

jonhoo commented May 18, 2020

Ah, sorry, yes, you're right. I think the issue here then is actually with for_each. It is basically demonstrating a variant of the issue that FuturesUnordered had — if neither the stream nor the closure's future ever yields Pending, then it never yields to the executor, which is not okay. I think the fix here is the same as rust-lang/futures-rs#2049 — to change for_each so that it forces periodic yields if it hasn't yielded for a while. This, to me, is an indicator that rust-lang/futures-rs#2053 isn't framed quite right — it's not just about sub-executors, it's about any future that may never yield Pending.

I think it's important to stress here that I don't think we should place the onus on implementors of Stream to yield Pending if they're continuously yielding Ready. Either of those yield control back to the caller. It seems most reasonable to me that the top-level future is the one that has to ensure that it occasionally yields, even if all of its dependencies are continuously providing more data.

@carllerche
Copy link
Member

Is there any action to take here?

@jonhoo
Copy link
Contributor

jonhoo commented Mar 21, 2021

Honestly, I think the action here is to standardize coop and have every resource call into it 😅 But barring that, I think the fix is to at least ensure that any combinators that we have control over yield periodically.

@Arnavion
Copy link
Contributor

Arnavion commented Jun 2, 2021

I disagree that forcing periodic yields is the way this should be solved. Periodically yielding means the entire call frame has to unwind all the way to the executor and back every time.

Currently tokio::time::Sleep is lying in its Future impl - Future::poll is supposed to be the way the future makes forward progress, but Sleep doesn't actually do this and relies on another task to mark it as completed.

I hit this problem when I was using a Sleep to enforce my own equivalent of budgeting:

fn poll(self, cx) {
    loop {
        if let Ready(()) = self.timeout.poll(cx) { // self.timeout is a Sleep
            self.timeout.reset(...);
            yield_now().poll(cx);
            return Poll::Pending;
        }

        let work = ready!(self.stream_of_work.poll_next(cx));
        process(work);
    }
}

This was in a single-threaded executor, and under program load stream_of_work would have sufficient time to become ready again by the time it was poll()ed again.

I would like to have a way of being in control of when the cooperative-yielding happens rather than letting tokio's budget system handle it. Making the budget system public doesn't help either - my code can encode budgeting strategies that suit my needs rather than just a single number that tokio's system uses.

Alternatively, if every timer checking for elapsed-ness on every poll is a perf concern, it would be nice to have some way the user can drive the timer wheel forward from within the user task. Eg given some fn tokio::time::make_progress() -> (), the code above could call that every N iterations.

@Darksonn
Copy link
Contributor

Darksonn commented Jun 2, 2021

@Arnavion It seems like your use-case would be better served by storing a std::time::Instant as the timeout and comparing it to Instant::now() in each iteration. The Sleep type is intended for the use-case where you want to receive a wakeup at some point in the future, but you are using it exclusively for checking whether a duration has elapsed — you don't even need the wakeups that the runtime is sending you when the timeout elapses since the thing you're actually waiting for is stream_of_work.poll_next.

@Arnavion
Copy link
Contributor

Arnavion commented Jun 2, 2021

but you are using it exclusively for checking whether a duration has elapsed

No. I simplified the example a bit too much perhaps, but there's more work being done with the timeout elapses than just yielding and resetting the timer. So there's a reason to have the timer register a waker when stream_of_work also returns Pending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. I-hang Program never terminates, resulting from infinite loops, deadlock, livelock, etc. M-coop Module: tokio/coop
Projects
None yet
Development

No branches or pull requests

6 participants