-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] core.timeline: High-resolution event log for diagnostics #849
base: master
Are you sure you want to change the base?
Conversation
Missed a reference to Agner Fog's instruction tables which seems to tell us that the instruction |
Awesome stuff, I suppose this would be very nice to look at the flow of programs and seeing hot spots. |
HDR Histogram is not the same thing, but possibly worthy of a see also: https://github.com/HdrHistogram/HdrHistogram |
Quick braindump on latest details I am working out:
I am now aiming to quickly make this into something we can merge and then talk about enabling by default. The initial use cases and goals that I have in mind now are:
So far in simple tests it seems like the basic overhead of calling an assembler routine is extremely low but I am not sure at what rate it is still efficient to call |
I added a timeline.md readme file explaining the idea and how the interface has evolved. Feedback would be welcome! I have a much updated implementation but still in the process of making that match up with the API. |
I spotted two typos and fixed them (entires -> entries). And a question: In the example, shouldn't the string `example` appear somewhere in the resulting log ("The log now contains there entries")?
Looks good to me. I was left with a few open questions:
|
Typo fixes, and a question
Added category to example log output (thanks for feedback @sleinen). Added functions: priority(), save(), dump() Fixed formatting (more word wrap)
Updated the timeline implementation to match the documented API. This is still a work in progress: The messages printed by the selftest all have zeros for arguments for unknown reasons; The overhead of logging has doubled for unknown reasons, possibly due to allocating with core.shm and putting a performance-critical cacheline on a 4096-aligned address that may trigger associativity problems (wild speculation, have to dig in).
Thanks for the feedback, @sleinen! I hope this is clearer now with the latest pushes? The current version should be basically feature-complete now. I plan to do one more round of bug-fixes and then re-introduce the branch as something that people can use and experiment with. The module now includes a really simple dumper that prints messages in reverse-chronological order with the timestamp deltas:
This is meant mostly as an example and I hope that we will evolve multiple interesting ways to process the timeline files. (I like the way even this simple dumper is sufficient to show that logging performance in this version is a bit irregular and slower than in earlier versions i.e. performance issue to resolve. Hopefully it will be able to illustrate bits of apps that slow down in the same way!) |
I realized that when we take the timestamp with RDTSCP this is also loading the current CPU core number and NUMA node into ECX. This patch includes that register in the log dump. This seems useful so that we can account for scheduling issues from the log. Strictly speaking the value in ECX is the contents of the TSC_AUX register but it seems that we can rely on Linux to load the core ID + NUMA node there. Here is where I got the idea: http://stackoverflow.com/questions/22310028/is-there-an-x86-instruction-to-tell-which-core-the-instruction-is-being-run-on
The string pattern matching for detecting arguments ($arg) was wrong and this caused all args to be logged as 0.
Show core number and NUMA node. Write slightly better log messages for selftest. Simplify the logging loop in the selftest to simply log three messages per iteration. This way the timestamps will indicate the logging overhead. (Previous version had some funny branches that could add a few dozen cycles to the timestamps which is a bit distracting.)
I fixed the issue with arguments being 0-valued (was a string matching bug) and too many cycles between messages (was because I had made the loop that calls the logger more complicated). I also added a nice feature that the core ID and NUMA node are stored for each log message. This information turns out to be available "for free" because the instruction that grabs the timestamp ( Here is how the selftest output looks now:
|
I implemented the I also renamed the priorities in a hopefully more helpful way. The main ones are:
I hope this makes it easy to choose the right priority when defining messages and deciding how much logging to enable. |
Fixed an off-by-one error (braino) in choosing which timestamps to calculate the delta from. Make the cycle column wider to better accommodate large values.
The selftest function now executes a few loops to see how long it takes to log a million events that are disabled by priority. Log excerpt: numa core -cycles category message 0 3 5014856 selftest invoked many events(1000000) 0 3 5025592 selftest invoked many events(1000000) 0 3 5176004 selftest invoked many events(1000000) So looks like five cycles per event is the ballpark cost (including the loop that makes the individual FFI calls).
I added a few "log a million priority-disabled events in a loop" to the selftest to estimate the cost of calling an event. Looks like around five cycles per event in total (1m event loop in 5m cycles). See 9b659f9. Could be that you could even reasonably estimate the cost of the JIT compilation (interpretation, tracing, optimization, code generation) on the first iteration of the event logging loop:
the first iteration takes around 150K cycles (~42 microseconds on this machine) longer than the next two. That strikes me as pretty fast compilation! Though some experience with timeline will be needed to know how safely such conclusion can be drawn from the logs. |
Idea for the future: I notice that we have room for a 40-bit Performance Counter in the log entry. It could be really need to log the number of instructions in addition to the number of cycles. Then you could see the rate of instructions-per-cycle for every section of code. This could be handy as a heuristic to indicate whether or not you are really suffering from things like cache misses and branch mispredictions e.g. if instructions-per-cycle (IPC) is 1-2 that might be fine, more than 2 might be really good, less than 1 not so great. Or in other words if you want to optimize code the IPC hints you whether you should be trying to execute fewer instructions (e.g. making the JIT happier) or making your instructions run faster (e.g. trying harder to keep things in cache or keep branches predictable). |
Timeline
Here is a big idea: How about if we create a unified "timeline" of all the events that are happening in a collection of Snabb processes and the hardware they are running on:
The timeline would be a collection of log files that are high-performance (millions of messages per second), high-resolution (cycle precision based on a common time source), and ubiquitous (we would explicitly log events throughout Snabb and source external events e.g. from the kernel too).
This timeline could then serve as a reference for everybody involved in monitoring and troubleshooting the fine-grained behavior of a Snabb application e.g. developers, users, testers, support organizations, etc.
Here is an example (just a mock-up) of what a timeline might look like when printing the combined logs from one Snabb process and also the PMU hardware to see cache misses and branch mispredictions. Both logs are timestamped using the CPU Time Stamp Counter (TSC) and so they can be shown in parallel:
Things to notice in this example:
SYSTEM
column can tell us what is going on in the LuaJIT runtime system: compiling new JIT traces, garbage collection, and so on. We could also log the addresses of the generated machine code, both for Lua traces and DynASM code, to make them easy to cross-reference.ENGINE
column shows us the breathe loop operation. Here you could look for e.g. a breath that took an unexpectedly long time to complete (perhaps flagged in your histogram Add core.histogram, use it to measure latency of app.breathe() #795).NIC
andFIREWALL
columns are showing individual processing steps for each app. This can be used directly like a simple profiler for application developers to divide-and-conquer performance problems by breaking processing into discrete steps (possibly an excellent thing to do for LuaJIT trace performance anyway, must talk with @alexandergall about his latest experiences and coding style).PEBS
column tells us what performance events are sampled by the CPU. These can be cross-referenced either temporally ("what else is going on in parallel on the timeline?"), by instruction pointer (which trace / dynasm code does this refer to?), or by address (what kind of memory is this cache miss on: Lua, DMA, or QEMU?). Note that PEBS hardware can log these events into memory and timestamp them with a suitable time source (TSC).The timeline could be sliced and diced. Perhaps you only want to look at the
FIREWALL
column to see how many cycles elapse between each processing step. Perhaps you want to post-process the timeline to calculate statistics e.g. histogram of the durations of certain events. Perhaps you want to import the timeline into a GUI tool that lets you zoom in and out to look for patterns and anomalies.Implementation notes
The log function is written in assembler and the selftest function is able to log one message every 36 CPU cycles. I hope that the actual cost will be less in real uses due to instruction-level parallelism. I am not sure exactly how much overhead is acceptable. I believe that the
RDTSCP
instruction is providing quite a good time source: on the one hand not interfering with out-of-order execution and on the other hand not being overly confused by it.Related work
This idea is essentially a successor to the
lib.blackbox
(#754) idea. The trouble with the blackbox idea is that you need a GDB-like debugger to understand the log. The timeline seems simpler and more intuitive.Google image search showed up the QNX profiler as the GUI nearest what I have in mind.
One more thing that we need to ubiquitously instrument the codebase with is counters. Could we do both at the same time? (Perhaps we could unify events and counters e.g. incrementing a counter logs an event?)
Intel whitepaper explaining more about
RDTSCP
instruction as a time source: How to benchmark code execution times.Feedback
If you have thoughts or see problems then leave a comment and I will try to reference/summarize that up here on the PR :).