Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] IP checksum in AVX2 assembler (prototype rewrite) #899

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

lukego
Copy link
Member

@lukego lukego commented Apr 27, 2016

This branch rewrites the AVX2 checksum routine with DynASM assembler instead of GCC intrinsics.

My motivations and perceived benefits are a little subjective:

  • Practice writing assembler code with DynASM - a powerful secret weapon.
  • Practice using x86 SIMD instructions - untapped potential growing exponentially.
  • Reduce reliance on GCC - a boring complex external dependency.
  • Get Snabb to compile on older Redhat/CentOS - GCC lacks AVX2 support.
  • Get to know this checksum routine better - originally ported from another project.
  • Have a chance to try and squeeze more performance.

I will be satisfied if the assembler version is at least as short as the C version and also at least as fast.

Looks promising so far. The version here is basically working but seems to be missing a carry bit somewhere (off-by-one on some tests). The code is a little over half the size of the C version. The performance seems at least as good.

Just have simple microbenchmarks for now. I would like to get a more thorough performance test like #755 upstream to verify this change. Meanwhile, here is how it looks in comparison to the C code based on the microbenchmark included on the PR (1 million iterations on the same input). On case with 150 byte input:

ASM
EVENT                                             TOTAL       /call      /block       /byte
cycles                                       45,552,172      45.552       9.718       0.304
instructions                                 84,139,687      84.140      17.950       0.561
call                                          1,000,000       1.000       0.213       0.007
block                                         4,687,500       4.688       1.000       0.031
byte                                        150,000,000     150.000      32.000       1.000

C
EVENT                                             TOTAL       /call      /block       /byte
cycles                                       64,792,114      64.792      13.822       0.432
instructions                                248,176,655     248.177      52.944       1.655
call                                          1,000,000       1.000       0.213       0.007
block                                         4,687,500       4.688       1.000       0.031
byte                                        150,000,000     150.000      32.000       1.000

Here is a little table with some other values:

Code vs Bytes 31 32 100 150 500 1,500 5,000 10,000 100,000
C (cycles) 84* 85* 263* 65 96 206 498 970 10,005
ASM (cycles) 44 19 47 46 51 116 402 772 9,085

EDIT: Marked with * the small sizes where our C code actually punts to non-SIMD.

So: fun and encouraging so far but more work to be done. Could still be some big mistakes that invalidate this code and/or results for now.

cc @tonyrog and @jsnell who may also be interested.

See source code comments for implementation status/notes.
@lukego lukego changed the title [wip] lib/checksum_simd.dasl: IP checksum in AVX2 assembler (prototype) [wip] IP checksum in AVX2 assembler (prototype rewrite) Apr 27, 2016
Fixes an overflow bug where the 32-bit accumulators were summed using
a 16-bit add instruction. Checksums now seem to be correct (same as
existing routine) for up to 128KB inputs.
@lukego
Copy link
Member Author

lukego commented Apr 28, 2016

Fixed the bug with 4993c37. This code works fine up to 128KB inputs in casual testing. That limitation seems okay to me i.e. not worth writing more code to increase it because packets are not that big.

The next step is to integrate and test/benchmark more extensively both with synthetic benchmarks and end-to-end tests (offloading checksums from QEMU VMs).

From the comment:

    This routine executes a VZEROUPPER instruction before returning in
    order to flush 256-bit AVX register state and avoid potential
    expensive SSE-AVX transition penalties. This is a cheap form of
    insurance against taking ~ 75 cycle penalties when mixing SSE and
    AVX code in the same program. See
    https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties
    and particularly section 3.3.
The assembler routine now has the same function interface as the C
functions and should be able to serve as a drop-in replacement.
snabbmark can now measure the performance of built-in IP checksum
routines and presents results for apples-to-apples comparison.

The benchmark parameters are currently hard-coded. Length is randomly
chosen from a "log uniform" distribution (favoring smaller values but
drawn from a large range). Alignment is randomized.

The intention is to favor robust routines that are not sensitive to
alignment and predictable branches.

Currently the alignment is forced to be even. Initially this was to be
realistic for normal protocols but I discovered that odd addresses
actually crash the SSE implementation. Have to address that bug
separately.
@lukego
Copy link
Member Author

lukego commented Apr 30, 2016

Added snabbmark checksum benchmark:

$ taskset -c 1 sudo ./snabb snabbmark checksum
VARIANT          BYTES/PACKET    BYTES/CYCLE  CYCLES/PACKET
base                  631.331          0.326       1935.081
asm                   631.331          4.244        148.770
avx2                  631.331          2.743        230.180
sse2                  631.331          2.318        272.319

This is intended to be fairly harsh and realistic. The input sizes, contents, and alignments are randomized. I draw the input sizes from a log-uniform distribution which is my current favorite for packet sizes (mostly small but also including large and jumbo sizes). I also update the assembler routine to have the same interface as the others (added the initial argument).

The assembler routine shows the best results by far. The older AVX routine is likely suffering from the logic that selectively falls back on the generic routine on small inputs (both for the unpredictable branch and because the current implementation of generic checksum is beyond awful -- should fix that as a matter of principle even if we are using the SIMD one in practice.)

This is starting to look like effort well spent! IP checksum is the main hotspot for Virtio-net with client/server workloads e.g. running iperf in VM. Cycles saved here should translate directly into extra capacity for the NFV application.

Before there were three separate C checksum implementations (generic,
SSE, AVX) that are each compiled with different compiler settings. These
were fairly complex due to SIMD intrinsics. The SSE implementation was
also incorrect and would segfault with odd-numbered addresses.

Now there is one C checksum routine that is compiled with two different
compiler settings (default/SSE and AVX). The checksum routine is written
in a very simple style that GCC successfully vectorizes
automatically (tested with GCC 4.8.5, 4.9.3, 5.3.0).

I experimented with "waving a voodoo chicken" in a few different
ways (# accumulators = 1, 2, 4; accumulator size = 32bit, 64bit). This
formulation seems to work best for GCC. This does feel like hokus-pokus
that exposes us to GCC behavior that is not nailed down, but that
bothers me less than the high-brow intrinsics code.

I have retained the AVX2 assembler implementation with DynASM because I
have not been able to beat that with GCC yet.

Current scoreboard:

    VARIANT          BYTES/PACKET    BYTES/CYCLE  CYCLES/PACKET
    base                  631.331          2.939        214.796
    asm                   631.331          4.161        151.719
    avx2                  631.331          3.416        184.825
@lukego
Copy link
Member Author

lukego commented May 1, 2016

This branch is taking a little bit of a different turn:

I found a fairly straightforward formulation of checksum in C that GCC is able to automatically vectorize when compiled with -O3. I hacked the Makefile to compile this function twice: once as cksum_avx2() using -mavx2 and once as cksum() using default settings (i.e. SSE). This is simpler and faster than the hand-coded SSE and AVX routines based on C intrinsics so I have removed those.

I have retained the AVX2 assembler variant as this is still the fastest by a significant margin.

Current scoreboard:

VARIANT          BYTES/PACKET    BYTES/CYCLE  CYCLES/PACKET
base                  631.331          2.921        216.117
asm                   631.331          4.172        151.310
avx2                  631.331          3.435        183.775

The next step is to eliminate either the C/AVX2 implementation of the assembler one. The open problem for the assembler one right now is the wart that it temporarily overwrites the memory trailing the input which is a different and more complex interface that may not suit all usages. The open problem for the C/AVX2 implementation is that it is slower than the assembler.

@jsnell
Copy link

jsnell commented May 1, 2016

My usual worry with auto-vectorization for critical hotspots is that at some point it'll fail to trigger for whatever reason, and you'll end up silently running the naive version of the code. It's ok in a closed environment where the compiler versions are guaranteed to be fixed and only get upgraded at very specific points in time, but more problematic when you as the author have little control over how this gets compiled.

Though the simplicity of that new C code is pretty sweet.

@lukego
Copy link
Member Author

lukego commented May 2, 2016

@jsnell Yes, I know what you mean. In Snabb we pin exact versions of LuaJIT and DynASM so that we can "geek out" on them but have tried to be conservative with gcc, glibc, etc. I am not especially comfortable with either the auto-vectorized nor the vectorized-with-C-intrinsics versions.

I imagine it is interesting in other projects where they are dedicated C hackers and pick one compiler (ICC, CLANG, or GCC) and geek out on its features for vectorization etc. at least until the distros come along and decide to use a different compiler, compile for a generic target architecture, skip the performance tests, etc :).

Questions I am struggling with now are:

  1. Is the trick of zero-padding the input good (simpler code thanks to known properties of inputs) or bad (complex interface due to lazy coding)?
  2. Is it worth writing an SSE assembler version by hand to reduce dependence on GCC?
  3. How does performance compare with theoretical limits? i.e. based on throughput and latency of the instructions involved (how busy is it keeping the ALUs and what is the limiting factor on performance?)

dpino added a commit to dpino/snabb that referenced this pull request Aug 10, 2017
…th-alarms

Extend snabb-softwire-v2 schema with alarms
@AkihiroSuda
Copy link

What is current status?

@eugeneia
Copy link
Member

@AkihiroSuda @dpino is hacking on a fast IP checksum implementation in the same spirit here: #1275

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants