Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

next: Changes queued for the v2016.05 release #888

Merged
merged 142 commits into from
May 12, 2016
Merged

next: Changes queued for the v2016.05 release #888

merged 142 commits into from
May 12, 2016

Conversation

lukego
Copy link
Member

@lukego lukego commented Apr 20, 2016

No description provided.

petebristow and others added 30 commits December 31, 2015 12:41
Link from the top-level README.md to make the current state of
platform support more apparent to users.

Inspired by #701.
Message: master branch supports Linux/x86-64. Your branch can support
any platform you like.
Previously a warning but promoted on suggestion from @wingo.
This was written for the @alexangergall's ALX appliance
but I believe he has replaced it with other code.
This enables assembly language code to access freelists in a fairly
straightforward manner.
braindump on performance parameters
This new document explains the three core Snabb data structure:
packet, link, and app. For now this is a stand-alone markdown file but
it could be incorporated into the introduction in the manual.

This is imported from lukego/blog#11.
Perftune - performance tuning notes
Removed a stray hyphen before the filename extension.
This is a simple and permissive trademark license taken directly from
the PostgreSQL one:

    https://wiki.postgresql.org/wiki/Trademark_Policy
@eugeneia
Copy link
Member

eugeneia commented May 2, 2016

Seems like the performance regression is real. Source: #896

@lukego
Copy link
Member Author

lukego commented May 2, 2016

Great work, SnabbBot !!! I will investigate now.

@lukego
Copy link
Member Author

lukego commented May 2, 2016

@eugeneia I am having trouble running this benchmark in the Docker environment on Interlaken:

[luke@interlaken:~/git/snabb/src]$ scripts/dock.sh 'sudo SNABB_PCI0=01:00.0 program/snabbnfv/selftest.sh bench'                                                                               
Defaulting to SNABB_TELNET0=5000
Defaulting to SNABB_TELNET1=5001
USING program/snabbnfv/test_fixtures/nfvconfig/test_functions/other_vlan.ports
Defaulting to MAC=52:54:00:00:00:
Defaulting to IP=fe80::5054:ff:fe00:
Defaulting to GUEST_MEM=512
Defaulting to HUGETLBFS=/hugetlbfs
Defaulting to QUEUES=1
Defaulting to QEMU=/root/.test_env/qemu/obj/x86_64-softmmu/qemu-system-x86_64
failed to connect to server
failed to connect to server
Waiting for VM listening on telnet port 5000 to get ready...^C^C^C [TIMEOUT]

Any hints what might be wrong? I compiled Snabb in dock.sh. This is on branch next.

@lukego
Copy link
Member Author

lukego commented May 2, 2016

oh hm seem to be two issues. First I used the wrong PCI address. Second seems that I need to fully qualify the PCI address for some reason.

This now does work on interlaken:

scripts/dock.sh 'sudo SNABB_PCI0=0000:81:00.0 program/snabbnfv/selftest.sh bench'

Groovy. I have a test env for debugging the performance issue now.

@lukego
Copy link
Member Author

lukego commented May 2, 2016

@eugeneia Good catch on s/2015/2016/ in the title of the PR, but to me it feels like the PR title is owned by the submitter and it seems intrusive for somebody else to change it directly rather than point out the problem.

@eugeneia
Copy link
Member

eugeneia commented May 2, 2016

@lukego You are right, sorry.

@lukego
Copy link
Member Author

lukego commented May 2, 2016

Current theory is that the performance regression sneaked in on #882 (increase descriptors from 512 to 2048). This branch passed CI but only barely: the log shows loss of around 10% on the iperf test. Digging in...

@lukego
Copy link
Member Author

lukego commented May 2, 2016

@eugeneia I think our tests on interlaken and lugano-1 are colliding at the moment. I was naughty and did not use lock but am now. Can you hop onto #lab anyway so that we can coordinate?

@lukego
Copy link
Member Author

lukego commented May 2, 2016

... or maybe I am mistakenly thinking they are yours because I see eugeneia in docker ps but that is only talking about who created the image and not who is running the container :)

@lukego
Copy link
Member Author

lukego commented May 6, 2016

Sorry about the wait. Currently I am concerned about the lower results we are seeing on the iperf-1500 benchmark on davos and I would like to resolve or at least understand that before releasing from this branch.

I see two related issues really:

  • new: This branch seems to have lower average performance than master.
  • old: Both branches have high variability in benchmark scores.

These issues interact: the variability makes it harder to compare results between branches.

I am now approaching this in the "scientific testing" style by collecting more performance samples and comparing them using R scripts. I am becoming incrementally less ad-hoc in my R scripting with support from @domenkozar and http://hydra.snabb.co/. Here is what I have so far:

  • Report written in Rmarkdown and then built and published by Hydra.
  • Source to the report that lives on a Github gist. There are two files: data.csv with the data available so far and rmd.nix that builds the report.
  • Hydra Jobset that builds and archives this report every time I update the gist. If you click your way through then you will come to the HTML report.

The immediate next step is to experiment with a few changes to next and bring them into this report to see if they restore performance.

Going forward I am very interested to have the CI measure variability and reduce this over time. I cannot account for this at the moment. I am keen to make the timeline viewer into the tool of choice for post-mortem analysis of runs with unexpected results e.g. the slowest of the samples here.

@lukego
Copy link
Member Author

lukego commented May 6, 2016

So based on the idea that performance decreased because 382827d increased the NIC descriptor ring size from 512 packets to 2048 packets I tried a compromise by patching next to use 1024 packets instead.

See Report for full details. Overview below.

Using 1024 packets in the descriptor ring seems to partly resolve the performance regression:

r

The average is still lower but Tukey's Test tells us that this is not statistically significant. Specifically, with the data we have the best we can say with 95% confidence is that the next1024 performance is between 0.5 Gbps slower and 0.09 Gbps faster than master.

Seen in this table excerpt where we have the average difference (diff) and the lower bound on the confidence interval (lwr) and the upper bound on the confidence interval (`upr):

##                       diff        lwr         upr     p adj
## next1024-master -0.2042000 -0.5054358  0.09703584 0.2466440

So it is easy to believe that there is still a modest performance reduction on this benchmark but the noise from uncontrolled variation means that a lot of testing would be required to confirm this. This is the point where I start to be more interested in understanding the variation itself.

I also briefly checked packetblaster to see if it still works okay once we reduce the default descriptor rings size from 2048 to 1024 and it seems basically okay to me. Excerpt from davos with two ports:

03:00.0 TXDGPC (TX packets)     14,880,151      GOTCL (TX octets)       952,305,792
03:00.1 TXDGPC (TX packets)     14,880,119      GOTCL (TX octets)       952,353,472
03:00.0 TXDGPC (TX packets)     14,878,160      GOTCL (TX octets)       952,131,840
03:00.1 TXDGPC (TX packets)     14,877,169      GOTCL (TX octets)       952,112,704
03:00.0 TXDGPC (TX packets)     14,880,168      GOTCL (TX octets)       952,328,704
03:00.1 TXDGPC (TX packets)     14,880,166      GOTCL (TX octets)       952,314,944
03:00.0 TXDGPC (TX packets)     14,880,096      GOTCL (TX octets)       952,339,648
03:00.1 TXDGPC (TX packets)     14,880,561      GOTCL (TX octets)       952,360,768
03:00.0 TXDGPC (TX packets)     14,880,600      GOTCL (TX octets)       952,359,488
03:00.1 TXDGPC (TX packets)     14,880,186      GOTCL (TX octets)       952,363,392
03:00.0 TXDGPC (TX packets)     14,879,073      GOTCL (TX octets)       952,247,616
03:00.1 TXDGPC (TX packets)     14,879,371      GOTCL (TX octets)       952,212,800
03:00.0 TXDGPC (TX packets)     14,880,556      GOTCL (TX octets)       952,372,800
03:00.1 TXDGPC (TX packets)     14,880,554      GOTCL (TX octets)       952,375,296

So the action plan that comes to my mind is:

  1. I push the num_descriptors=1024 default value as a compromise and we ship that in v2016.05 if SnabbBot and @eugeneia are okay with that.
  2. Over the 2016.06 development cycle we start using R+Hydra to measure the variation in CI results.
  3. We make an ongoing effort to both minimize variation and increase average performance.

Though this is negotiable if other people have strong opinions.

This value is probably not perfectly optimal. See discussion:
#888 (comment)
@lukego
Copy link
Member Author

lukego commented May 6, 2016

Relatedly:

SnabbBot reports low scores for the iperf benchmark mainly for three reasons:

  • CPU is low frequency (1.8 GHz)
  • CPU does not support AVX2 so falls back to SSE2 checksum.
  • SSE checksum is slower in hardware and also not entirely optimized in our software.

Merging the new checksum routine from #899 may well provide a boost to all platforms including these pre-Haswell machines that lack AVX2. Could be something for v2016.06.

@lukego
Copy link
Member Author

lukego commented May 8, 2016

I did a little more R hacking. I am really excited to be starting to "productionize" some of this performance analysis to become part of the release process. This time it is only the iperf-1500 benchmark but this should be easy to extend to cover all benchmarks and run them all automatically.

The new report has updated visualizations, wikipedia links explaining how to interpret each plot, and my own interpretation of the results. The Hydra build publishes everything needed to reproduce the report.

TLDR pretty picture:

r3

@domenkozar
Copy link
Member

@lukego nice. Wouldn't it be better to remove some outliers before plotting results?

@lukego
Copy link
Member Author

lukego commented May 9, 2016

@domenkozar Good question. I would like to understand what the outliers mean first. Is it that we have uncontrolled background activities on the test server and need to improve our scheduling? Is it that we have a problem in our test scripts that fails to control something important? Is it that we have a bug in our software that causes the JIT to generate the wrong code in certain instances? It's possible that the outliers are boring but it is also possible that they are the most important points of all.

The best strategy I have in mind would be if each blue dot becomes a hyperlink to a Hydra build for that one specific test case. Then we can click on data points to zoom in on the best/worst/middle results and understand what is different between the runs.

Can we setup Hydra to publish each individual test run on its own page? R will be able to do the hyperlinking if we include some identifying information in the CSV file and use the SVG backend for the graph.

I am imagining the CSV file we want Hydra to produce for R will be something like...

BRANCH, COMMIT,  BENCHMARK,  RUN, SCORE, SERVER, URL
master, adfa4b7, iperf-1500, 1,   5.0,   davos,  https://hydra.snabb.co/build/445

Meanwhile... I added a section to the report with some numeric statistics including median and quartiles that are not too sensitive to outliers:

## branch: master
##       branch        gbps      
##  master  :50   Min.   :2.310  
##  next    : 0   1st Qu.:4.883  
##  next1024: 0   Median :5.295  
##                Mean   :5.089  
##                3rd Qu.:5.577  
##                Max.   :6.020  
## -------------------------------------------------------- 
## branch: next
##       branch        gbps      
##  master  : 0   Min.   :2.100  
##  next    :46   1st Qu.:4.340  
##  next1024: 0   Median :4.555  
##                Mean   :4.519  
##                3rd Qu.:4.755  
##                Max.   :5.370  
## -------------------------------------------------------- 
## branch: next1024
##       branch        gbps      
##  master  : 0   Min.   :3.050  
##  next    : 0   1st Qu.:4.543  
##  next1024:50   Median :4.915  
##                Mean   :4.884  
##                3rd Qu.:5.170  
##                Max.   :5.940```

@lukego
Copy link
Member Author

lukego commented May 9, 2016

... one interesting observation from the numeric table above is that there are only 46 values available for the branch next. This is also an anomaly to explain... did I grab the log file too early? or were there some errors (and if so how should we detect and present those?)

@eugeneia eugeneia self-assigned this May 10, 2016
@eugeneia eugeneia merged commit 7fddfe8 into master May 12, 2016
eugeneia added a commit that referenced this pull request May 12, 2016
takikawa pushed a commit to takikawa/snabb that referenced this pull request Aug 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants