Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track sent items using SmallVec #1657

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

martinthomson
Copy link
Member

DO NOT MERGE THIS - IT MAKES THINGS SLOWER

I thought that this was worth sharing though.

I was looking at the profiles that @KershawChang shared and noticed that write_stream_frame was eating a decent amount of CPU because there was an allocation in there. It took me a while to track it down, but it turns out that this is the first place that we add a RecoveryToken to the Vec of those that we hold in each SentPacket.

The obvious thing to test in this case is SmallVec. SmallVec is designed for places where you mostly add a few items to a collection. It holds a small array into which values are deposited unless more than the array size is added, when those values are allocated on the heap, just like a normal Vec. This means that if you stay within the array, you avoid a heap allocation, which can save a bunch of work. The cost is a little complexity, the extra stack allocation, and that your stuff needs to move if you ever overflow the backing array. So if your prediction is wrong, it is more expensive.

I tried this out and it makes performance worse. With a backing array of size 1, the performance change isn't observable. But larger values progressively degrade performance. Increasing the array size to 8 makes it worse than the value of 4 I started with. My theory is that the increased size of the base object starts is the problem. This collection is typically kept in a SentPacket. Our acknowledgements code needs extra time to move the larger struct around.

The size of the performance regression suggests that there are some ways we could further improve how SentPacket is handled.

Benchmark results
Run multiple transfers with varying seeds
                        time:   [42.395 ms 42.547 ms 42.712 ms]
                        change: [+3.3754% +4.3870% +5.2776%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Run multiple transfers with the same seed
                        time:   [42.513 ms 42.584 ms 42.657 ms]
                        change: [+3.4443% +3.7777% +4.0976%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

@codecov-commenter
Copy link

codecov-commenter commented Feb 16, 2024

Codecov Report

Attention: Patch coverage is 99.09910% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 92.93%. Comparing base (76630a5) to head (809926f).

❗ Current head 809926f differs from pull request most recent head 167816c. Consider uploading reports for the commit 167816c to get more accurate results

Files Patch % Lines
neqo-transport/src/send_stream.rs 92.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1657      +/-   ##
==========================================
- Coverage   93.08%   92.93%   -0.15%     
==========================================
  Files         117      119       +2     
  Lines       36422    37263     +841     
==========================================
+ Hits        33903    34632     +729     
- Misses       2519     2631     +112     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

martinthomson and others added 2 commits February 19, 2024 22:15
Co-authored-by: Lars Eggert <[email protected]>
Signed-off-by: Martin Thomson <[email protected]>
@martinthomson
Copy link
Member Author

@larseggert, I'm looking at the profiles that were generated here and it looks like these are off the main branch. The flamegraph for transfer shows calls into BTreeMap, but I don't see any way that could happen.

@larseggert
Copy link
Collaborator

I can't get to it this week. Wonder if the checkout action needs another parameter?

@larseggert
Copy link
Collaborator

Let's see what the new flamegraphs show (also because LTO is now enabled.)

Copy link

Benchmark results

Performance differences relative to 76630a5.

  • drain a timer quickly time: [395.29 ns 402.55 ns 409.33 ns]
    change: [-0.6787% +0.9859% +2.7112%] (p = 0.26 > 0.05)
    No change in performance detected.

  • coalesce_acked_from_zero 1+1 entries
    time: [193.97 ns 194.36 ns 194.79 ns]
    change: [-0.5237% -0.1538% +0.2215%] (p = 0.41 > 0.05)
    No change in performance detected.

  • coalesce_acked_from_zero 3+1 entries
    time: [235.93 ns 236.55 ns 237.18 ns]
    change: [-0.8063% -0.4103% -0.0496%] (p = 0.04 < 0.05)
    Change within noise threshold.

  • coalesce_acked_from_zero 10+1 entries
    time: [235.78 ns 236.59 ns 237.55 ns]
    change: [-0.4030% +3.6386% +13.911%] (p = 0.42 > 0.05)
    No change in performance detected.

  • coalesce_acked_from_zero 1000+1 entries
    time: [217.27 ns 222.01 ns 232.88 ns]
    change: [+1.4742% +5.6444% +13.185%] (p = 0.06 > 0.05)
    No change in performance detected.

  • RxStreamOrderer::inbound_frame()
    time: [117.57 ms 117.65 ms 117.73 ms]
    change: [+0.0927% +0.1982% +0.2949%] (p = 0.00 < 0.05)
    Change within noise threshold.

  • transfer/Run multiple transfers with varying seeds
    time: [124.35 ms 124.61 ms 124.88 ms]
    thrpt: [32.031 MiB/s 32.099 MiB/s 32.169 MiB/s]
    change:
    time: [+5.4576% +5.7824% +6.1037%] (p = 0.00 < 0.05)
    thrpt: [-5.7526% -5.4664% -5.1751%]
    💔 Performance has regressed.

  • transfer/Run multiple transfers with the same seed
    time: [124.82 ms 125.00 ms 125.18 ms]
    thrpt: [31.953 MiB/s 32.000 MiB/s 32.047 MiB/s]
    change:
    time: [+5.2953% +5.5022% +5.6981%] (p = 0.00 < 0.05)
    thrpt: [-5.3909% -5.2152% -5.0290%]
    💔 Performance has regressed.

Client/server transfer results

Transfer of 134217728 bytes over loopback.

Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative
msquic msquic 428.3 ± 58.8 359.7 538.1 1.00
neqo msquic reno on 1921.6 ± 35.0 1850.3 1960.9 1.00
neqo msquic reno 1917.7 ± 30.3 1857.9 1959.7 1.00
neqo msquic cubic on 1898.8 ± 56.6 1775.9 1971.2 1.00
neqo msquic cubic 1901.9 ± 46.7 1811.5 1960.3 1.00
msquic neqo reno on 4418.4 ± 170.0 4222.4 4692.8 1.00
msquic neqo reno 4510.8 ± 232.7 4225.0 4969.3 1.00
msquic neqo cubic on 4475.1 ± 154.6 4259.7 4672.4 1.00
msquic neqo cubic 4434.9 ± 168.7 4204.7 4705.9 1.00
neqo neqo reno on 3664.9 ± 172.2 3387.1 3957.8 1.00
neqo neqo reno 3805.0 ± 275.7 3529.5 4374.1 1.00
neqo neqo cubic on 4329.7 ± 141.6 4114.5 4523.6 1.00
neqo neqo cubic 4372.4 ± 141.9 4184.3 4553.5 1.00

⬇️ Download logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants