silicon_sprawl_

RecordStream

2022-12-19T00:00:00+00:00

Note: this post is a historical relic, originally written in 2014.

This is a tutorial introduction to RecordStream, half-heartedly adapted from a presentation I gave at SeaGL 2013 some months ago.

Recs is the best ideas of Microsoft’s PowerShell applied to the Unix environment. It’s a collection of scripts for lightweight, ad-hoc data analysis based around a common internal representation.

It’s comprised of:

Input scripts that convert some input data source to newline delimited JSON
Data processing scripts that work on newline delimited JSON
Output scripts that convert newline delimited JSON to something pretty (like a table, HTML, gnuplot, etc.)

There are two major advantages over rolling your own data analysis scripts or relying solely on the traditional Unix utilities:

You spend less time shuffling loosely formatted plaintext from one utility to another
Useful data manipulation and output scripts are already written for you

This will be example driven. You’re encouraged to follow along at home. Try the commands piece by piece, get a feel for how the different commands are composed and fit together.

We’re starting with an access log in roughly common log format:

> head –1 access.log
54.243.31.205 - - [06/Oct/2013 17:10:21 +0000] "GET / HTTP/1.1" 200 3698 "-" "Amazon Route 53 Health Check Service" "0.078"

I’ll define a helper for dealing with it in recs. This is a bit nasty, but it’s something we only need to write once per log format, if at all (check the recs-from* scripts to see if your input format is already covered).

function recs-fromaccesslog() {
recs-frommultire \
--re 'ip=^(\d+\.\d+\.\d+\.\d+) ' \
--re 'date=\[([^\]]+)\]' \
--re 'method,path="(\S+) (\/.*) HTTP' \
--re 'status,bytes=" (\d+) (\d+) "' \
--re 'ua,latency="([^"]*)" "([^" ]*)"$' \
"$*"
}

With that done, we can easily shove our access log into recs’ internal format (newline delimited JSON records):

> head -1 access.log | recs-fromaccesslog access.log
{"ua":"Amazon Route 53 Health Check Service", "bytes":"3698",
"ip":"54.243.31.205",
"ate":"06/Oct/2013 17:10:21 +0000", "status":"200",
"path":"/",
"method":"GET",
"latency":"0.078"}

Given this log, we’ll try to answer a few simple questions.

1. Which of our clients are slow?

Our access log isn’t columnar and doesn’t have an easily usable delimiter, so parsing fields out is rather annoying. We have to arbitrarily choose something that’ll work as a delimiter for the fields we’re trying to pull out (user agent and server-side latency), then do some field counting. The result is none too pretty.

> head -5 access.log | cut -d'"' -f 6,8
Amazon Route 53 Health Check Service"0.078
Amazon Route 53 Health Check Service"0.003
Amazon Route 53 Health Check Service"0.163
Amazon Route 53 Health Check Service"0.204
Amazon Route 53 Health Check Service"0.031

We can sort based on our latency field, it just takes a bit more field counting…

> head -5 access.log | cut -d'"' -f 6,8 | sort -t'"' -n -k 2
Amazon Route 53 Health Check Service"0.003
Amazon Route 53 Health Check Service"0.031
Amazon Route 53 Health Check Service"0.078
Amazon Route 53 Health Check Service"0.163
Amazon Route 53 Health Check Service"0.204

Now we’ve got the p100 latency, and a collection of worst offenders.

What if we wanted our latency first so it was a bit more readable (and so we didn’t have to jump through so many hoops with sort)?

Cut doesn’t do field reordering, so for this we have to jump to Perl/AWK/Ruby/Python (pick your poison). I’m fond of Perl.

> head -5 access.log | perl -lne 'print "$2 $1" if /"([^"]*)" "([^"]*)"$/' | sort -n
0.003 Amazon Route 53 Health Check Service
0.031 Amazon Route 53 Health Check Service
0.078 Amazon Route 53 Health Check Service
0.163 Amazon Route 53 Health Check Service
0.204 Amazon Route 53 Health Check Service

The output’s much nicer, but that command isn’t getting any prettier.

What if we wanted something more complex? Say, clients by IP and UA, sorted by latency? Since we don’t have a nice delimiter, we’re heading deeper and deeper into the world of regular expressions…

> head -5 access.log | perl -lne 'print "$3 $2 $1" if /^(\S+) .*" "([^"]*)" "([^"]*)"/' | sort -n
0.003 Amazon Route 53 Health Check Service 54.241.32.109
0.031 Amazon Route 53 Health Check Service 54.245.168.45
0.078 Amazon Route 53 Health Check Service 54.243.31.205
0.163 Amazon Route 53 Health Check Service 54.228.16.13
0.204 Amazon Route 53 Health Check Service 54.251.31.173

That’s a grossly inefficient regex, but realistically, that’s about what I’d manage if I was interactively processing a log during an event.

If we wanted p90 instead of p100, it’d be a manual process based on line number:

> wc -l access.log
13563 access.log
> echo '0.9 * 13563' | bc
12206.7
> cat access.log | perl -lne 'print $1 if /"([^"]*)"$/' | sort -n | head -12206 | tail -1
0.208

With recs, this is a simple matter of converting our access log to JSON, grouping by user agent, computing arbitrary percentiles, then sorting and printing out.

> recs-fromaccesslog access.log | recs-collate --key ua --aggregator percs=percmap,'50 100',latency | recs-sort --key percs/100 -r | recs-totable -k percs/100,ua
percs/100 ua
--------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
29.155 Amazon Route 53 Health Check Service
1.210 Mozilla/5.0 (iPhone; CPU iPhone OS 7_0_2 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A501 Safari/9537.53
0.000 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)
0.000 -
0.000 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36

Now that our data’s in this format, it’s much easier to play around and try to get interesting insights. We don’t need to do any new cut’ing, grep’ing, or manual bc - we can just change our grouping parameters. For example, grouping by both user agent and IP address:

> recs-fromaccesslog access.log | recs-collate --key ip,ua --aggregator percs=percmap,'50 100',latency | recs-sort --key percs/50,percs/100 -r | recs-totable
ip percs ua
--------------- ----------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
174.239.197.180 {"50":"0.712","100":"1.210"} Mozilla/5.0 (iPhone; CPU iPhone OS 7_0_2 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A501 Safari/9537.53
54.232.40.109 {"50":"0.223","100":"0.316"} Amazon Route 53 Health Check Service
54.232.40.77 {"50":"0.209","100":"0.409"} Amazon Route 53 Health Check Service
54.251.31.173 {"50":"0.185","100":"23.189"} Amazon Route 53 Health Check Service
54.252.79.141 {"50":"0.184","100":"24.183"} Amazon Route 53 Health Check Service
54.251.31.141 {"50":"0.184","100":"0.446"} Amazon Route 53 Health Check Service
54.228.16.13 {"50":"0.162","100":"26.193"} Amazon Route 53 Health Check Service
54.228.16.45 {"50":"0.159","100":"27.159"} Amazon Route 53 Health Check Service
54.252.79.173 {"50":"0.151","100":"29.155"} Amazon Route 53 Health Check Service
54.248.220.45 {"50":"0.129","100":"0.274"} Amazon Route 53 Health Check Service
54.248.220.13 {"50":"0.121","100":"25.112"} Amazon Route 53 Health Check Service
54.243.31.245 {"50":"0.077","100":"0.139"} Amazon Route 53 Health Check Service
54.243.31.205 {"50":"0.077","100":"0.085"} Amazon Route 53 Health Check Service
162.208.41.4 {"50":"0.032","100":"0.036"} Mozilla/5.0 (iPhone; CPU iPhone OS 7_0_2 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A501 Safari/9537.53
54.245.168.13 {"50":"0.031","100":"0.038"} Amazon Route 53 Health Check Service
54.245.168.45 {"50":"0.031","100":"0.037"} Amazon Route 53 Health Check Service
54.241.32.77 {"50":"0.004","100":"29.004"} Amazon Route 53 Health Check Service
54.241.32.109 {"50":"0.003","100":"0.027"} Amazon Route 53 Health Check Service
122.10.92.22 {"50":"0.000","100":"0.000"} Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)
162.208.41.4 {"50":"0.000","100":"0.000"} -
162.208.41.4 {"50":"0.000","100":"0.000"} Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36

Or grouping by user agent and URL path:

> recs-fromaccesslog access.log | recs-collate --key path,ua --aggregator percs=percmap,'50 100',latency | recs-sort --key percs/50,percs/100 -r | recs-totable
path percs ua
--------------- ------------------------------ ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
/slowpath {"50":"26.193","100":"29.155"} Amazon Route 53 Health Check Service
/nginx-logo.png {"50":"0.806","100":"1.210"} Mozilla/5.0 (iPhone; CPU iPhone OS 7_0_2 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A501 Safari/9537.53
/poweredby.png {"50":"0.712","100":"1.121"} Mozilla/5.0 (iPhone; CPU iPhone OS 7_0_2 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A501 Safari/9537.53
/ {"50":"0.166","100":"1.160"} Mozilla/5.0 (iPhone; CPU iPhone OS 7_0_2 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A501 Safari/9537.53
/ {"50":"0.134","100":"0.553"} Amazon Route 53 Health Check Service
{"50":"0.000","100":"0.000"} Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)
{"50":"0.000","100":"0.000"} -
/poweredby.png {"50":"0.000","100":"0.000"} Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36
/nginx-logo.png {"50":"0.000","100":"0.000"} Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36
/ {"50":"0.000","100":"0.000"} Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36
/favicon.ico {"50":"0.000","100":"0.000"} Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36

Moving on to another question…

2. In which time periods did we have bad latency?

We can emit a “good” or “bad” flag for some predefined latency threshold (in this case, 10s). We also do a bit of clever timestamp matching to group by hour.

> cat access.log | perl -lne 'print "$1 ",$2 > 10 ? "bad" : "good" if /(\d+\/\S+\/\d+ \d\d):\d\d:.*"([^"]*)"$/' | uniq -c
1621 06/Oct/2013 17 good
1726 06/Oct/2013 18 good
1593 06/Oct/2013 19 good
1900 06/Oct/2013 20 good
1903 06/Oct/2013 21 good
1322 06/Oct/2013 22 good
2 06/Oct/2013 22 bad
3 06/Oct/2013 22 good
1 06/Oct/2013 22 bad
5 06/Oct/2013 22 good
1 06/Oct/2013 22 bad
11 06/Oct/2013 22 good
1 06/Oct/2013 22 bad
4 06/Oct/2013 22 good
1 06/Oct/2013 22 bad
5 06/Oct/2013 22 good
1 06/Oct/2013 22 bad
540 06/Oct/2013 22 good
1915 06/Oct/2013 23 good
1008 07/Oct/2013 00 good

With recs we can easily get latency metrics batched by arbitrary time periods:

> recs-fromaccesslog access.log | recs-normalizetime --key date --threshold '1 hr' --strict | recs-collate -k n_date --aggregator percs=percmap,'50 100',latency | recs-sort -k n_date | recs-xform '{{n_date}} = localtime({{n_date}})' | recs-totable
n_date percs
------------------------ -----------------------------
Sun Oct 6 10:00:00 2013 {"50":"0.132","100":"1.210"}
Sun Oct 6 11:00:00 2013 {"50":"0.135","100":"0.258"}
Sun Oct 6 12:00:00 2013 {"50":"0.134","100":"0.277"}
Sun Oct 6 13:00:00 2013 {"50":"0.134","100":"0.351"}
Sun Oct 6 14:00:00 2013 {"50":"0.134","100":"0.446"}
Sun Oct 6 15:00:00 2013 {"50":"0.146","100":"29.155"}
Sun Oct 6 16:00:00 2013 {"50":"0.134","100":"0.274"}
Sun Oct 6 17:00:00 2013 {"50":"0.134","100":"0.376"}

Rust emit=asm Can Be Misleading

2020-11-09T00:00:00+00:00

The short version

Cargo builds like:

$ RUSTFLAGS="--emit asm" cargo build --release
$ cargo rustc --release -- --emit asm

Do not always output assembly equivalent to the machine code you’d get from:

$ cargo build --release

Possibly rustc --emit=asm has some uses, like examining a single file with no external dependencies, but it’s not useful for my normal case of wanting to look at the asm for an arbitrary release build.

The long version

Previously I rewrote my ray tracer to use crossbeam::scope and crossbeam::queue instead of rayon. Internally rayon leans heavily on crossbeam::deque for its work-stealing implementation, so my expectation was that this change would be neutral or a slight improvement, depending on how good of a job the compiler had been doing to condense rayon’s abstractions.

Instead it was a ~15% regression.

Looking at the asm, pt. 1

The asm output appeared sane. I saw no expensive indirection, calls, etc. - things were getting properly inlined and optimized.

Understanding rayon

I first questioned my understanding of rayon and spent some time digging through its guts. It’s well-engineered, and it’s impressive that clang’s able to condense all of its abstractions down into basically no overhead - but I also didn’t see anything fundamentally novel or surprising going on that would give it a significant performance edge. The splitting/work assignment portion of the vec codepath looked like it would lead to slightly more even partitioning than my hand-built crossbeam method, but not by a lot, and definitely not by 15%. So that was bust. I did notice that crossbeam needed to heap allocate the closure I was using for my thread body, so perhaps that caused some additional overhead, but it should have been negligible.

CPU profiling

At this point I dumped both versions into Instruments and did some basic CPU profiling. rayon’s a bit annoying to poke around in because you end up with extremely deep stacks of join frames, but nothing really stood out. The crossbeam version was simply slower with no major red flags.

More in-depth CPU profiling

I’d been looking for an excuse to try Intel VTune for awhile, but since it’s only supported on Windows and Linux and is best run on bare-metal, it had always been slightly too much effort to stand up for smaller projects. It seemed warranted for this one! I had an existing Windows bootcamp partition, so figured I’d see just how much hassle it was to get everything working in that before I dusted off something to run Linux.

Sidebar: turns out Rust on Windows is… really nice. I’m not a Windows dev. There are things I admire about the ecosystem (like a good first-party debugger and some decent OS APIs), but apart from some Java way back in high school I’ve never even tried to compile software on a Windows machine. It always looked like a nightmare for C/C++ projects - I’m familiar enough with the code side of cross-platform support, but as for actually building things… I think cmake can spit out a Visual Studio project? And I keep hearing about WSL? So I went in with significant trepidation. Turns out it took all of ten minutes to install the VS C++ tools, rustup, a rust toolchain, vtune, and get everything building and working together. Pretty impressive.

VTune itself is a complex beast. Most (all?) of the data in it is stuff you could get out of perf, but the collection and workflow is streamlined - it does a good job of keeping track of previous runs, grouping them in a way so you don’t lose anything, surfacing useful information based on top-level categories (eg. “I want to look at memory access”), and providing a diff view between runs. It looks particularly useful for guiding iterative optimization and refinement. It’s a bit less useful when I’m comparing the performance of two fairly different programs, because many of the stack traces are unique to either the rayon or crossbeam version, so “you have 100% more of these rayon stack traces in this run” is not helpful. Looking through the data I saw that I was getting flagged on uarch perf, retiring instructions maybe 5% worse in the crossbeam version. Thinking that could be stalling waiting on memory, I ran a memory access profile and saw:

Crossbeam version is on the left, rayon version is on the right. Okay, 3s runtime difference - that’s commensurate with the perf regression I’m seeing. Interesting, we’re memory bound twice as frequently. That’s strange because our memory access pattern should be pretty similar. We’re doing over twice as many stores. We’re doing some additional loads. We’re…

Wait.

We’re doing over twice as many stores?! That doesn’t make sense.

Replacing crossbeam::scope

Perhaps heap allocating the closures was more expensive than I thought, or had bad knock-on effects. It’s a long shot, but the whole point of side projects is following some of those random tangents. I set about eliminating crossbeam::scope and using std::thread directly instead. This was a quick and dirty test: the entire point of scope is to create an abstraction that communicates to the borrow checker that threads we’ve spun off have been joined, otherwise it doesn’t know when a thread’s borrow is guaranteed to have ended and requires that data references from a thread’s closure are all static lifetime. In this case I’m manually joining the threads, so I can do a transmute to placate the compiler. Don’t ship code like this, it defeats the purpose of using Rust in the first place - you’d have a better experience with C++. But it can be really handy to circumvent these sorts of checks when doing quick prototyping/performance analysis to decide if it’s worth the time to build out a safe abstraction. I would welcome a “just build this without the borrow checker” mode for cases like this, though I’m probably in the minority and I don’t expect that would be an easy feature to add.

My testing code looked roughly like this:

let pixels = unsafe {
 mem::transmute::<&mut [V3], &'static mut [V3]>(pixels)
};
let handle = std::thread::spawn(move || {
 // code that uses &pixels
});
handle.join().unwrap();

As expected, no significant performance gains were had.

Looking at the asm, pt. 2…

Something isn’t adding up so I want to look at the assembly again, but I’d like to clearly distinguish between my unchanged business logic and the rayon/crossbeam coordination code. The majority of my business logic is behind a single function named cast; adding #[inline(never)] to that single ray processing function should give me a nice seam between rayon and my business logic.

Build, run and the rayon version slows down… in fact it runs exactly as slow as the crossbeam version.

I try adding #[inline(always)] to the cast function in the crossbeam version, and lo and behold it speeds up to match the original rayon version, my regression disappears.

But, how’s that possible? The first thing I did was look at inlining. Maybe my quick once-over missed it, maybe I misread and this whole circuitous path is all my fault?

I generated assembly output for both the inlined and noninlined versions of the crossbeam ray tracer:

$ rg ecl_rt4cast inline.s
21293: .asciz "_ZN6ecl_rt4cast17hc1100eade04dff75E"
$ rg ecl_rt4cast noinline.s
21293: .asciz "_ZN6ecl_rt4cast17hc1100eade04dff75E"

I’m building release with symbols, so that string is expected. But neither version, not even the non-inlined version, is making calls to cast(). Curious.

$ wc -l inline.s
203969 inline.s
$ wc -l noinline.s
203969 noinline.s

Now I feel like I’m being gaslighted. These are the exact same length. A diff shows that the only changes are some arbitrary IDs in debug info. I have a difficult relationship with optimizing compilers, so my first thought is maybe clang’s being clang again and I should go validate the binaries instead…

$ objdump -d ecl_rt_inline | rg ecl_rt4cast
<no output>
$ objdump -d ecl_rt_noinline | rg ecl_rt4cast
0000000100003190 __ZN6ecl_rt4cast17hc1100eade04dff75E:
100003299: e9 af 01 00 00 jmp 431 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x2bd>
1000034a2: eb 1f jmp 31 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x333>
1000034c6: 74 38 je 56 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x370>
1000034e5: 0f 82 f5 00 00 00 jb 245 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x450>
1000034ee: 72 1d jb 29 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x37d>
1000034f0: e9 eb 00 00 00 jmp 235 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x450>
100003503: 0f 83 d7 00 00 00 jae 215 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x450>
100003515: 0f 87 16 03 00 00 ja 790 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x6a1>
10000351e: 0f 82 1f 03 00 00 jb 799 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x6b3>
100003527: 0f 82 2b 03 00 00 jb 811 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x6c8>
100003530: 0f 82 37 03 00 00 jb 823 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x6dd>
100003539: 0f 82 40 03 00 00 jb 832 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x6ef>
100003590: 0f 84 1a ff ff ff je -230 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x320>
1000035db: e9 d0 fe ff ff jmp -304 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x320>
10000360a: 0f 86 a1 01 00 00 jbe 417 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x621>
100003637: 0f 87 57 02 00 00 ja 599 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x704>
100003668: 0f 86 3d 02 00 00 jbe 573 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x71b>
100003682: 0f 84 41 01 00 00 je 321 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x639>
100003707: 0f 85 93 fb ff ff jne -1133 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x110>
10000371a: 0f 86 9f 01 00 00 jbe 415 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x72f>
100003723: 0f 86 a8 01 00 00 jbe 424 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x741>
10000372c: 0f 86 b1 01 00 00 jbe 433 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x753>
1000037ac: e9 57 fc ff ff jmp -937 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x278>
1000037c7: eb 12 jmp 18 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x64b>
100009740: e8 4b 9a ff ff callq -26037 <__ZN6ecl_rt4cast17hc1100eade04dff75E>

Bingo - note the callq. Clearly my crossbeam version wasn’t inlining as aggresively as the rayon version, possibly due to the Box::new(closure). Instructing the compiler to do so brought performance in line with expectations. It’s silly that the compiler wasn’t inlining it in the first place, this function has a single callsite and inlining it improves both runtime performance and binary size.

That means --emit=asm does something entirely unexpected. I dug around and sure enough there are reports that running with --emit=asm will build with a different configuration due to interaction with ThinLTO and codegen units.

Fin

It’s not ideal to rely on disassemblers because they’re also fallible. In the same way that going from C to asm loses fidelity and makes decompiling from asm to C difficult, going from asm to machine code also loses fidelity and there can be inconsistenices when disassembling machine code back into asm.

The common disassemblers like objdump are linear sweep and can suffer from mistaking data for code. There’s another family of disassemblers based on recursive traversal that avoid those problems, but come with their own set of tradeoffs.

Note that the learning curve on disassemblers can be steep. These tools are often packaged into a suite and targeted towards reverse engineering and malware analysis, they come with far more features than “give me a good disassembly and make it easy to visualize/browse.” Hopefully it’ll be easier to match the --emit=asm build config to a normal release build config in the future, but until then I’ll be getting comfortable with Ghidra.

Rust Ray Tracer, an Update (and SIMD)

2020-11-06T00:00:00+00:00

About a month ago I ported my C99 ray tracer side project to Rust. The initial port went smoothly, and I’ve now been plugging away adding features and repeatedly rewriting it in my spare hours. In parallel I’m getting up to speed on a large, production Rust codebase at work. The contrast between the two has been interesting - I have almost entirely positive things to say about Rust for large, multi-threaded codebases, but it hasn’t been as good of a fit for the ray tracer.

It’s not a bad fit, but C/C++ are almost perfectly suited for this domain. Many of Rust’s flagship features aren’t applicable and/or get in the way - for example, the borrow checker doesn’t get me anything that ASAN wouldn’t in this specific use case, though does cause some additional headaches.

What follows are a few of the quirks I’ve come across.

Overhead of Thread Locals

There was a recent blog post about this, so I won’t get into it very much.

Suffice to say that thread locals in C already have more overhead than I’d like since they introduce a level of indirection on use, and the additional overhead of lazy initialization is significant. I found myself golfing down TLS access whereever possible (“I’ll persist this in TLS, but copy it out to/write it back from the stack”).

Nightly has an attribute that can be used to get a barebones thread local, but I’m trying to avoid nightly if possible.

Ultimately I got rid of TLS use entirely, but it meant moving away from rayon.

Difficulty of expressing mutable array access

At its core a ray tracer is a giant array of pixels. You read a pixel, do some math, and write it back. This is trivial to parallelize by assigning disjoint sets of indices to threads, but often ends up being a little difficult to express in Rust. In particular, non-contiguous, cross-thread write access seems impossible to model safely without doing a copy pass over the array (ie. using split to slice it up into contiguous owned chunks, then later copying/rearranging it into the required non-contiguous order).

This makes it a bit annoying to write a tile-based instead of row or pixel-based tracer.

‘Undefined’ Undefined Behavior

I’ve found it hard to tell what is and isn’t undefined behavior in Rust. There’s the Rustonomicon, but it’s sparse in places. In particular, I don’t have a good feel for what transmutes are and aren’t safe. One route is to outsource all that concern to something like bytemuck and let Lokathor worry about it. But for this project I’ve been avoiding taking deps unless completely necessary, because…

Compilation speed

…compilation speed is atrocious. My work builds take an ungodly amount of time. I’ve been very picky about dependent libraries to keep this ray tracer’s incremental build as low as possible.

Operator overloading and numeric traits

I used to dismiss operator overloading as a frivolous feature, but it’s been valuable for floating point and SIMD math. Compilers generally aren’t going to do as much algebraic rearranging/simplification with those types, and it’s much easier to notice and tease out shared operations when operator overloading is used. That said, I would love to be able to do arbitrary overrides for <, >, etc. because SIMD types aren’t a good fit for std::cmp::PartialOrd.

As much as I like traits and bounded generics, they’ve been painful when it comes to numeric types. A core type in my ray tracer is Vec3, a struct of three f32s. I wanted to make it generic across a SIMD type to let me work with 8 Vec3s at once, so instead of three f32s it would have three 8-wide f32s in struct-of-arrays form. This proved to be… not worth the hassle. In C++ I could write the Vec3 logic (dot product, cross product, etc.) as usual, parameterize it by f32 or f32x8, then go implement whatever mathematical overloads were missing. In Rust I need a set of unified traits between f32 and f32x8. Either I need to define that unified trait myself, which is a lot of boilerplate, or I can use something like the num crate, which would require implementing more functionality than I actually use (and some of which isn’t applicable to SIMD).

Ultmately I didn’t bother.

Rayon

Rayon is a fantastic library. It was much nicer to work with than OpenMP, and iter_bridge makes it dead simple to plug in anywhere.

Ultimately I ditched it for two reasons:

I couldn’t find a way to directly control thread init, which meant I couldn’t replace my thread locals with stack variables. You can mostly get around this by using the _init methods that take a closure, reading a thread local onto the stack then writing it back when the thread finishes its jobs.
It does far more than I need, which came out in a number of small ways - like making profiler output harder to read because of a large number of nested joins.

I ultimately switched to using crossbeam directly, spinning up my own thread pool reading off of a simple mpmc queue. Interestingly this is as fast as rayon with iter_bridge, but is measurably slower than rayon’s custom parallel iterators for Vecs. I’m still looking at why exactly that is, but it seems like rayon is doing a better job of load-balancing work. Ray tracers have a large number of pixels that can be processed in parallel, but each pixel has a variable amount of work, so you need to strike a balance between making batches too big (then one thread finishes early and you don’t fully utilize the machine) and too small (more thread contention to grab jobs). I need to add logging to rayon’s join splitting, but my hunch is that it’s doing a better job of keeping the batch size as high as possible without causing cores to go idle.

Update: See this post for more investigation of the performance regression.

SIMD

There are a few different places where SIMD is applicable in a ray tracer:

Do Vec3 operations in SIMD. This is a common initial idea, but it’s not particularly fruitful.
Process multiple pixels or multiple rays for the same pixel in SIMD. This is very useful, though requires writing SIMD versions of some libm functions (notably trig functions). It’s also where you start hitting ray coherency problems - if you shoot 8 rays in a batch at roughly the same area of the scene, it’s likely that they’ll behave similarly. But as soon as they hit an object and bounce they all head in different directions, and pretty quickly you end up with dead lanes. Unless your scene is very simple that’s still going to be a net win. Then coherency issues can come up again once you’ve calculated your hits and need to process materials - a ray of light hitting a lake leads to very different math from a ray of light hitting a tree. A good strategy for dealing with such things is to switch from doing a depth-first traversal of the scene to breadth-first, letting you accumulate enough state to batch likes with likes and pull, say, ‘8 tree hits’, ‘8 water hits’, etc. from the work queue all at once. The tradeoff is now you have a significant amount of additional memory use and possibly more thread synchronization, so it’s easy to accidentally make everything worse and slower (I’ve heard it’s more effective on GPUs, but know less about that). One very good paper on this style of optimized breadth-first CPU ray tracing is this one from Intel.
Perform intersection checks for a single ray in SIMD. This isn’t as big of an improvement as the former, but given the effort it has great bang for your buck. Most of the work to add SIMD was defining pass-through functions for intrinsics, with a few gnarlier ones here and there (eg. hmin). The trickier optimization work came from going back over the code and looking for any small places that I could simplify the calculations - little things like removing a negation or redundant multiply, switching to fma, etc. added up to substantial improvements.

This was my first time using AVX2, and I didn’t realize it’s essentially “SSE but bigger.” In particular I was surprised that you can’t permute across 128-bit lanes.

Other surprises were that rsqrt with a refinement iteration was slower than simply calling sqrt (though the Intel optimization manual did warn me about this on Skylake - I have so much other math going on that it led to port contention). And the cost of float conversions add up very quickly - initially I was lazy and only implemented an 8-wide f32 type, then would cast in/out if I needed some integer type instead. Adding a proper i32x8 got me a few percentage points of runtime improvement.

Rust’s current SIMD support is the absolute bare minimum. Intrinsics are exposed, all must be used in unsafe, and if you dig you can find some docs on repr(simd). There’s also a smattering of SIMD crates, some good, some bad, some seemingly unmaintained. There’s nothing as complete or useful as Agner Fog’s VCL. There is however an active working group adding portable SIMD abstractions to the core. That’s very exciting, and looks like it’s shaping up nicely.

Debugging

Debugging ray tracers is surprisingly fun; you end up with a lot of “how on earth did that happen” moments. Here are a few of my recent head scratchers:

Reference Image

This is my current reference scene. Not too exciting - I need to invest some time in building out a more complex scene and possibly adding obj/triangle support. But the performance work tends to be more fun.

Blurred

I have no idea what happened here. I found this in my output folder over the course of doing the refactor from rayon to crossbeam, so I don’t know exactly what broke - but I thought it was neat.

Ripples

This came from some bad floating point math - I think I messed up the intersection calculation in some way, but don’t remember exactly how. I thought the ripple effect was kinda fun.

Fun House Mirrors

“Maybe I don’t need to normalize my vectors here…”

tries it

“Nope, I definitely need to normalize there.”

Inside Out

This came from trying to use a fast inverse sqrt without a refinement step. A lot of my intersections were messed up, so rays ended up bouncing around inside objects and things got weird.

Porting a C99 Ray Tracer to Rust

2020-09-27T00:00:00+00:00

I needed to pick up Rust for work, so I ported my existing ray tracer to the language for a little practice. It’s now in the unimaginatively named ecl_rt.

Overall it was a pleasant experience. I particularly like Rust’s object system (am I allowed to call it that?), the bounded generics, how it handles numeric primitives (requiring explicit conversion, giving easy, explicit control of overflow behavior), the focus on expressions, and the rayon library. The ML aspects are refreshing, but easy to overdo.

The development environment is fairly good, though I did hit a bug in rust analyzer and at one point had to wipe my build directory because cargo got confused and everything started failing to link.

The initial port was ~40% slower than the equivalent C99 codebase. Replacing the rand crate with the same custom PRNG I use in ecl_rt_legacy closed the gap to 15-20%. That’s still much higher than I’d like, but I haven’t had time to dig into it in depth. I can say that it’s not related to threading and nothing in the Rust assembly looks too off - bounds checking isn’t hurting me very much, there aren’t a lot of extra function calls, etc. It seems that clang is optimizing the giant wad of floating point calculations slightly better, but I haven’t looked into what exactly it’s doing differently.

Whiteout

My initial render wasn’t too far off - at least it made an image! I forgot to average the pixel color values back down, so everything trended towards full white the longer the ray tracer ran.

Upside Down

With whiteout fixed, the next obvious problem is the image is upside down. This is common because images are represented as a giant flat buffer of pixel values, so when you go from an in memory representation to an image library or format you need to agree on how that buffer is stacked and unstacked, ie. does the first item in the buffer correspond to the top left or the bottom left pixel of the image. So I just reversed the buffer…

Mirrored Colors

Oops. In reversing the buffer I accidentally reversed my color channels, so red is blue and blue is red.

Band Artifacts

I flipped the colors, but tried to get too clever with the PRNG. In this image I tried to seed my PRNG with the thread ID - which would be fine, except rayon was calling my seed function each time it checked out work from the job pool. Instead of a thread seeding once, it would reseed with the thread ID approximately fifty times over the course of the run. This causes visibile artifacting and the bands you see in the image. Rather than adjust it to only seed once, I opted to used the rand crate for reseeding (it’s so few calls that the overhead is negligible).

Finally…

And now everything is sorted and we’re comparable to the ecl_rt reference image!

DLX Redux

2020-09-08T00:00:00+00:00

Note: this post was from a college side project circa 2010. It held up fairly well, so I'm reposting it as-is. You've been warned.

I was never all that interested in working Sudokus by hand. I've known a few people who were straight up addicted to it, but I never understood the draw. To me, games become much less fun when I know that a relatively straightforward method of solving them exists. It's happened with Mastermind, Checkers, and, to a lesser extent, Chess. Playing them feels like a waste of time when I could be learning a more complex and interesting game (like Go) instead.

But while playing them isn't terribly fun, writing a solver can be a blast! In high school I tried to write one for Sudoku, but at the time I didn't have enough experience to do a proper job. I was still wrestling with teaching myself Scheme and recursion, so my program didn't make good use of backtracking. In fact, the only types of problems it could reliably solve were the most trivial of puzzles where you can definitively place a value at each stage in the solution and never have to branch. I had come across some literature on Knuth's Algorithm X, but it was over my head and I quickly got lost.

Just last Tuesday I stumbled across the same literature, namely Knuth's Dancing Links paper. I had a bit of free time from the semester wrapping up and thought it'd be fun to give it another shot. I ended up spending around two days on it, and made a pretty nice little solver. The code is located here. The source doesn't have enough comments, but it makes sense if you read and refer to the paper. Also, it needs a parser for input files to be suitable for general use. This was the first project I worked on with Eclipse and Egit, so there are a few extra workspace files in the tree.

The general class of problems that Sudoku belongs to is called exact cover. The core problem is that given a universe U and a group of subsets S, you want to find a subgroup S' such that every element in U is contained by exactly one of the subsets in S'. Basically, you want a group of subsets that don't overlap and "cover" every element in the stated universe.

As a concrete example, suppose that:
U = {A, B, C}
and our subsets are:
S1 = {A}
S2 = {A, B}
S3 = {B, C}

The only valid solution is {S1, S3} since it covers all of the elements exactly once.

A popular method of solving this style of problem is with Knuth's (somewhat menacingly named) Algorithm X. The algorithm itself isn't all that complex; it's a pretty straightforward backtracking technique.

The basic data structure is a binary matrix where your columns are the universe and your rows are the sets you can choose from.
For this simple example, the matrix would look like:

	A	B	C
S1	1	0	0
S2	1	1	0
S3	0	1	1

It's easiest to view the columns as constraints that must be satisfied. In this case, we need an A, B, and C. Our goal is to condense this into an empty matrix, showing that all constraints have been satisfied.

We proceed by eliminating a row, placing it in our temporary solution. When we eliminate a row, we remove the constraints that it satisfies (the columns where it has a 1). When we remove those constraints, we also eliminate other rows that satisfy the constraint.

For example, if we include S2 in our temporary solution then constraints A and B are satisfied. Columns A and B will be removed. Since A and B have been satisfied, we must remove all other rows that also satisfy them as otherwise we'd have overlap. Thus, S1 and S3 are also eliminated. We are left with C as an unsatisfied constraint and no potential solutions left, so S2 was an incorrect choice and we must backtrack and try again.

While the algorithm is solid, the runtime isn't particularly good if it's implemented as a multidimensional array. The problem is that it's likely to be a large sparse matrix, so we'll end up spending a lot of time just iterating over a row or column looking for the next position that has a 1.

Dancing Links is a clever implementation strategy centered around the operation of removing and reinserting a node in a circular doubly-linked list. Essentially, you can pop the node out such that it's no longer in the list, but knows where it should go if you need to shove it back in later. By using this little trick and modeling the matrix as circular, four direction, doubly linked list (a torus, or donut shape) we can improve the complexity of finding the next 1 from O(N) to O(1).

So the only thing left to do is fit Sudoku onto the exact cover problem. For that we need an initial matrix that represents the standard 9x9 Sudoku game.

For determining columns, there are four constraints that we have to account for: each box, row, and column must have the numbers 1-9 exactly once, and each cell can only have one number (no cheating by writing in two and leaving another cell empty or some such). Each of these four constraints is actually going to break down into 81 individual constraints for a total of 324 columns.

For determining rows, we must list every valid position for each number. This is going to be 9 rows * 9 cols * 9 numbers for a total of 729 rows.

Once we create the necessary structure, we can remove the rows representing initially filled positions and solve it as a normal exact cover problem with DLX. As we add solutions to our temporary set we keep track of the row name, then just use that after termination to print out a solution (if one exists).

And that's about it! It really is a very cool implementation technique, and exact cover relates to a number of other interesting problems, so if you've got some time to spare I'd highly suggest flipping through Knuth's paper.

Learning About Ray Tracing

2020-08-02T00:00:00+00:00

I’ve had more time than usual for side projects since I’ve been stuck inside the past few months. I spent the majority of June digging into graphics, getting acquainted with the field by building a ray tracer. The initial version is now online as ecl_rt.

Rather than write yet another post about building a ray tracer, I’ll point to the handful of resources (from the multitude available) that were actually useful:

Ray Tracing in One Weekend is the canonical introductory tutorial. I didn’t love it - I wasn’t on board with the code structure and found it very light on explanation. I still think it’s a decent way to get something on the screen fast, so I’d recommend going through it quickly to get a prototype working and then move on.
Physically Based Rendering is dense and long, but also deep, insightful, and a pleasure to read. I wish I had picked it up earlier instead of spending so much time on various other books/tutorials. In general, I think you should do the minimum amount of work to get something on the screen and get the basic background knowledge to understand this book, then simply work through PBRT cover to cover. It’s incredible that the whole thing is available online for free (though I’d recommend picking up a physical copy if you expect you’ll be spending a lot of time with it).
Aras’ blog series on path tracers is a lot of fun. He implemented a ray tracer in every imaginable way; it makes for great reading to compare some of the paths I didn’t take (eg. Metal or other modern GPU frameworks).

There’s a lot of work left, but it’s still fun to look at how far things have come. Here’s a few images showing the evolution of my ray tracer’s output, from the very first image it rendered to the current state:

Building Pipelines with Circular Buffers, not Queues

2020-06-15T00:00:00+00:00

Structuring programs as pipelines is a nice way to separate business logic and introduce parallelism - if you do it right it gets you both clarity and performance.

Typically this is done by tying threads together with some form of concurrent queue, such as a channel in Golang, ConcurrentLinkedQueue in Java, or concurrent_queue in C++ (Intel TBB or Microsoft PPL).

Using a simple integer pipeline as an example, we’ll have an initial phase writing random integers, one phase that multiplies its input by two, one phase that increments its input, and a final phase that prints the result.

With queues, it would look something like this:

But the overhead of multiple queues can be quite high and variable, so is often unacceptable in low-latency programs. An alternative is to use a single circular buffer and have each thread hold a cursor into it. This pattern has significantly better behavior on current hardware and requires minimal synchronization. It’s variously known as event sourcing, the LMAX Disruptor, or “that giant circular buffer pattern.”

A shared circular buffer for our example would instead look like this:

One way to think about this is that we’re moving the executor to the data instead of the data to the executor.

A few of the advantages:

Extremely good data locality. The prefetcher will pull data for the next item into the cache before we need it and we’ll keep the CPU well-fed and happy.
No data needs to be copied between phases, whereas the queue needs a copy in/out of the queue. As the struct gets large the queue needs to start using a pointer indirect, which again hurts locality and puts more pressure on the gc. Since we don’t incur any expensive copies, the buffer can continue to store large structs directly. If our struct is written appropriately we also won’t need to do any expensive clean operation on struct reuse.
Low contention. Each phase coordinates with a single atomic and one sync operation can batch multiple items at once (ie. we only do one sync to take ownership of all queued items for our phase), compared to a queue which typically must synchronize on each item.
Very few pointers for the gc to scan, possibly just the pointer to the circular buffer and pointers between phases. With care we could code it to generate zero garbage when in steady state.
Performance is consistent. Where the queue has multiple buffers that need to be sized, locks that may be contended, etc. it’s much easier in the circular buffer to quantify the total amount of work in the system and the worst-case performance under full load.

A very barebones example:

package main

import (
 "fmt"
 "math/rand"
 "runtime"
 "sync/atomic"
)

type data struct {
 num int
}

type phase struct {
 _ [7]int64 // padding
 cursor int64
 _ [7]int64 // padding
 upstream *phase
}

const bufSize = 64 // must be power of 2
const bufMask int64 = bufSize - 1

var circularBuf [bufSize]data

func runPhase(p *phase, f func(int64)) {
 curr := int64(0)
 for {
 upstreamLimit := atomic.LoadInt64(&p.upstream.cursor)
 for curr != upstreamLimit {
 f(curr&bufMask)
 curr++
 }
 atomic.StoreInt64(&p.cursor, curr)
 runtime.Gosched()
 }
}

func runWriter(p *phase) {
 r := rand.New(rand.NewSource(1))
 curr := int64(0)
 for {
 upstreamLimit := atomic.LoadInt64(&p.upstream.cursor)
 if curr == upstreamLimit {
 // empty buffer
 upstreamLimit = curr + bufSize - 1
 }
 for curr&bufMask != upstreamLimit&bufMask {
 circularBuf[curr&bufMask].num = r.Intn(100)
 curr++
 }
 atomic.StoreInt64(&p.cursor, curr)
 runtime.Gosched()
 }
}

func main() {
 // writeRandInt -> multTwo -> addOne -> printResult
 var printResult, addOne, multTwo, writeRandInt phase
 writeRandInt.upstream = &printResult
 printResult.upstream = &addOne
 addOne.upstream = &multTwo
 multTwo.upstream = &writeRandInt

 go runWriter(&writeRandInt)

 go runPhase(&addOne, func(i int64) {
 circularBuf[i].num++
 })

 go runPhase(&multTwo, func(i int64) {
 circularBuf[i].num *= 2
 })

 go runPhase(&printResult, func(i int64) {
 fmt.Println(circularBuf[i])
 })

 select {} // block forever
}

This code is meant to show off the core concept in the smallest amount of code possible. Fully building this out you would hide the cursor logic behind a nice API and the final business logic would look very similar to a queue-based implementation looping on a consume function.

A few specific notes about the implementation:

The cursors are not truncated to the size of the buffer each time they’re incremented, instead they count towards integer max and wrap. This makes it easy to disambiguate completely empty buffers from completely full buffers.
The example has no backoff or wait strategy. Busy spin is what you’d want for a high-load, low-latency system, but something that trades a small amount of performance to let the CPU idle is preferable in other cases. Ideally this would be implemented with direct calls to gopark/goready, but those aren’t exposed externally by the runtime. A condvar can be used instead.
The example also has no batching strategy except “grab everything available”. This will lead to clumping, but fixing is trivial.
On x86_64, atomic loads are compiled to mov and atomic stores are compiled to xchg. arm64 compiles these to ldar and stlr respectively. This is standard, but was my first time looking at the asm for atomics in golang, so I was happy to see solid codegen.
The conditional for the empty queue case in the writer is unfortunate. Ideally we would write that conditional as straightline code, eg.

upstream = atomic.LoadInt64(&p.upstream.cursor)
empty = curr^upstream == 0
upstreamLimit = (upstream * !empty) + ((curr + bufSize - 1) * empty)

This would generate a cmp but no jmp. Unfortunately I know of no way to express this in go, and the optimizer doesn’t do it for us. It is a common pattern in C and other systems programming languages. Since we know the numbers are positive but we’re saving them in 2s complement, in this case we do have a path to doing this with computation, but it’s silly and mostly academic.

upstream := atomic.LoadInt64(&p.upstream.cursor)
notEmpty := curr ^ upstream
upmult := (notEmpty >> 63) - (-notEmpty >> 63)
upstreamLimit := (upstream * upmult) + ((curr + bufSize - 1) * (^upmult & 1))

Update: turns out there is a way to express this. At least as of go 1.19, this generates the assembly I’m looking for - straightline code with a cmp but not jmp.

upstream := atomic.LoadInt64(&p.upstream.cursor)
empty := int64(0)
if curr^upstream == 0 {
 empty = 1
}
upstreamLimit := (upstream * (empty^1)) + ((curr + bufSize - 1) * empty)

One last note: channels in go are deeply integrated with the runtime and do things like make explicit gopark/goready calls, copy values from one goroutine’s stack directly into another’s, etc. You could do a lot worse, and should make sure they don’t fit your needs before rolling your own.

Fast Subnet Matching

2020-06-07T00:00:00+00:00

Determining if a subnet contains a given IP is a fundamental operation in networking. Router dataplanes spend all of their time looking up prefix matches to make forwarding decisions, but even higher layers of application code need to perform this operation - for example, looking up a client IP address in a geographical database or checking a client IP against an abuse blocklist.

Routers have extremely optimized implementations, but since these other uses may be one-off codepaths in a higher-level language (eg. some random Go microservice), they’re not written with the same level of care and optimization. Sometimes they’re written with no care or optimization at all and quickly become bottlenecks.

Here’s a list of basic techniques and tradeoffs to reference next time you need to implement this form of lookup; I hope it’s useful in determining a good implementation for the level of optimization you need.

Multiple Subnets

If you have multiple subnets and want to determine which of them match a given IP (eg. longest prefix match), you should be reaching for something in the trie family. I won’t cover the fundamentals here, but do recommend The Art of Computer Programming, Vol. 3 for an overview.

Be extremely skeptical of any off-the-shelf radix libraries:

Many do not do prefix compression
Many support N instead of two edges, which may lead to unnecessary memory overhead
Many will operate on some form of string type to be as generic as possible, again contributing to memory overhead
All be difficult to adapt to different stride lengths

I would highly recommend writing your own implementation if performance is a concern at all. Most common implementations are either too generic or are optimized for exact instead of prefix match.

unibit to multibit to compressed

A radix 2 trie that does bit-by-bit comparison with compression for empty nodes is a good starting point. To further speed it up, you’ll want to compare more than one bit at a time - this is typically referred to as a multibit stride.

Multibit strides will get you significantly faster lookup time at the cost of some memory - in order to align all comparisons on the stride size, you’ll need to expand some prefixes.

As an example, let’s say you’re building a trie that contains three prefixes:

Prefix 1: 01*
Prefix 2: 110*
Prefix 3: 10*

A unibit trie would look like this:

If instead we want to use a multibit trie with a stride of two bits, then prefix 2 needs to be expanded into its two sub-prefixes, 1101* and 1100*. Our multibit trie would look like this:

Note how this trie has incresed our memory usage by duplicating prefix 2, but has reduced our memory accesses and improved locality (there are far fewer pointers chased in this diagram), thus trading memory usage for lookup performance.

Most of the time a multibit trie is where you can stop. If you need to optimize further, especially if you need to start reducing memory usage, then you’ll want to explore the literature on compressed tries. The general idea with many of these is to use a longer or adaptive stride, but find clever ways to remove some of the redundancy it introduces. Starting points include LC-tries, Luleå tries, and tree bitmaps.

Modified traversals

There are some common, related problems that can be solved by small modifications to the traversal algorithm:

If instead of finding the longest prefix match you need to find all containing subnets, simply keep track of the list of all matching nodes instead of the single most recent node as you traverse and return the full set at the end.
If you need to match a containing subnet on some criteria other than most specific match, for example declaration order from a config file, express this as a numerical priority and persist it alongside the node. As you traverse, keep track of the most recently visited node and only replace it if the currently visited is a higher priority.

Sidenote on PATRICIA tries

PATRICIA tries are a radix 2 trie that saves a count of bits skipped instead of the full substring when doing compression. You don’t want this! They’re great for exact match lookup, like what you’d want in a trie of filenames, but saving only the skip count causes prefix matches to backtrack, resulting in significantly worse performance. It’s unfortunate that they’re so often associated with networking; in some cases the name is misused and people say PATRICIA when they simple mean radix 2.

Single Subnet

If you have a large number of IPs and want to check if a single subnet contains them, spend a little time looking at your assembler output to choose a good implementation. If available, you’re best off using 128-bit literals to support IPv6. C, C++, Rust, and many systems languages will support this. Unfortunately Go and Java do not, so you’ll have to piece it together with two 64-bit integers - slightly cumbersome, and slightly more overhead as we’ll see.

In IPv4, subnet contains checking is easy since everything fits in a word, roughly:

// checking if 1.2.3.0/8 contains 1.2.3.4
uint32_t prefix = 0x01020300; // prefix address, packed big endian
uint32_t client = 0x01020304; // client address, packed big endian
uint8_t mask = 8; // netmask, range 0-32
uint32_t bitmask = 0xFFFFFFFF << (32 - mask); // invert the mask to get a count of number of zeros
if ( (prefix & bitmask) == (client & bitmask) ) {
 // subnet contains client
}

IPv6 is when things get interesting. 128-bit long IPv6 addresses means juggling two machine words. In computing the bitmask we need a mask for the upper and the lower portion of the address.

uint64_t upper_prefix, lower_prefix, upper_client, lower_client = ; // assume these are initialized
uint8_t mask = ;// netmask, range 0-128
uint64_t upper_bitmask = UINT64_MAX;
uint64_t lower_bitmask = UINT64_MAX;
if (mask < 64) {
 lower_bitmask <<= mask;
} else {
 upper_bitmask = lower_bitmask << (64 - mask);
 lower = 0;
}

if ((upper_prefix & upper_bitmask) == (upper_client & upper_bitmask)
 && (lower_prefix & lower_bitmask) == (lower_client && lower_bitmask)) {
 // subnet contains client
}

Rewriting with gcc/clang’s int128 emulated type:

__uint128 prefix, client = ; // assume these are initialized
uint8_t mask = ;// netmask, range 0-128
__uint128 bitmask = std::numeric_limits<__uint128_t>::max() <<= (128 - mask);

if ( (prefix & bitmask) == (client & bitmask) ) {
 // subnet contains client
}

The emulated int128s are much easier to read and work with, but how does performance compare?

Here is the source code and Godbolt link for a small test, isolating just the shift portion:

#include <cstdint>

__int128 shift128(uint8_t shift) {
 __int128 t = -1;
 t <<= shift;
 return t;
}

struct Pair {
 uint64_t first, second;
};

Pair shift64(uint8_t shift) {
 uint64_t upper = -1;
 uint64_t lower = -1;
 if (shift < 64) {
 lower <<= shift;
 } else {
 upper = lower << (shift - 64);
 lower = 0;
 }

 return Pair{upper, lower};
}

And here is the compiler’s optimized x86 assembly with comments added:

shift128(unsigned char):
 mov ecx, edi ; load mask into ecx
 mov rax, -1 ; initialize lower word
 xor esi, esi ; zero this register for use in cmov
 mov rdx, -1 ; initialize upper word
 sal rax, cl ; shift lower word by mask
 and ecx, 64 ; and our mask with 64
 cmovne rdx, rax ; move lower word into upper
 cmovne rax, rsi ; zero lower word
 ret
shift64(unsigned char):
 movzx ecx, dil ; load mask into ecx
 cmp dil, 63
 ja .L4 ; jump if mask is >= 64
 mov rdx, -1 ; initialize lower word
 mov rax, -1 ; initialize upper word
 sal rdx, cl ; shift lower word by mask
 ret
.L4:
 sub ecx, 64 ; find out how much we need to shift the upper word by
 mov rax, -1 ; initialize upper word
 xor edx, edx ; mask was >64, so just zero the lower word
 sal rax, cl ; shift upper word
 ret

There are a few interesting things to note:

sal will automatically mask its shift operand to the appropriate range, so while it’s undefined behavior in C to shift by more than the size of the target, this is fine at the asm level
and with 64 is using knowledge of undefined behavior - our shift is only well-defined within the range of 1-127, so we assume UB is impossible and ignore the range outside.
cmov is used instead of a jump. On modern hardware this should be strictly better, though is most noticeable when jumps are unpredictable. Our jumps should be very predictable here.

If we wanted, we could rewrite the int64 version in a way that would more closely match the int128 assembly:

Pair shift64_v2(uint8_t shift) {
 uint64_t upper = -1;
 uint64_t lower = -1;
 lower <<= (shift & 0x3F);
 if (shift > 0x3F) {
 upper = lower;
 lower = 0;
 }

 return Pair{upper, lower};
}

shift64_v2(unsigned char):
 mov ecx, edi
 mov rdx, -1
 mov rax, -1
 sal rdx, cl
 cmp dil, 63
 jbe .L4
 mov rax, rdx
 xor edx, edx
.L4:
 ret

Note how the assembly does not contain any explicit and with 0x3F, we’ve merely communicated to the compiler that we want the sal instruction’s default mask behvior. Our cmov has also been converted to jmp.

Previously I’d hoped that I could use the 128-bit SSE registers and mm intrinsics to operate on IPv6 addresses natively. However, operations to use SSE registers as a single 128-bit value (as opposed to 2 64-bit values, 4 32-bit values, etc.) are quite limited. In particular, _mm_slli_si128 shifts by bytes instead of bits so won’t work for our use case (though SIMD instructions would be useful for performing matches against multiple client IPs at once).

Network Programming Self-Study

2020-05-10T00:00:00+00:00

Lately I’ve been getting more questions about how to start out in network programming: what books to read, what projects to do, and how to make a career of it.

I’ve been in this space ten years now, working across layers 3 to 7 on CDNs, DNS, and protocol stacks at a couple of FAANGs and a startup. If you name a piece of software that runs in an edge network, I’ve probably seen one (or three) versions of it.

Advice is tricky. It’s easy to turn things I learned into Things Everyone Should Learn. It’s also easy to fit an inaccurate narrative to a path, to recast something as a logical progression when it was really blind stumbling around. I could spin a yarn about how networking was the first thing I did with computers, how I did a CCNA in high-school and something something destiny. But I could also tell the story of how I did that CCNA primarily to get out of taking PE, and how I found my college networking class so tedious that I dropped it in the first week and never tried again (for some inexplicable reason it was an optional elective at my university).

I’ll avoid all that - I don’t think my exact career path is all that interesting. But I’ll offer one piece of advice (because I can’t entirely resist), and a handful of books that I’ve found useful and interesting (because too many are not both). If the advice doesn’t resonate, ignore it. If a book seems boring, skip it.

General Advice

Networks are not a pure, abstract technology. Honestly nothing is, but networks in particular are physical, temporal things. They exist in a certain place at a certain time, influenced by people, technology, nature, and politics.

You will be a better developer if you involve yourself in the reality of networking. Follow NANOG, OARC, or other lists where operators hang out, try to understand the discussion, their mindsets and biases. Pay attention to things, and pay attention as they are happening. If there’s an operational event or outage, follow it, attempt to debug as it unfolds, then later compare notes with whoever was working it. Working on something independently until you get stuck and then articulating exactly what you’re stuck on is one of the most useful skills you can develop, and working events in realtime is a great way to practice.

Networking Resources

High Performance Browser Networking - this is an excellent crash course on protcols and browsers. This is 90% of what most software developers need to know about networking. It’s available for free online.
Interconnections - this is my favorite resource for learning about routing protocols. Perlman is extremely accomplished in the field and has an accessible writing style. A few newer protocols are missing, but this will give you the necessary background to pick those up easily. It’s very affordable since it’s an older book.
Network Routing - I only recommend this for the chapter on hardware, and possibly the chapter on label switching. It has a great overview of how a physical router is put together and works, but most of the book is dry and nowhere near as engaging as Perlman. Unfortunately it is an expensive text; borrow it if you can.
The Internet Peering Playbook - this book is all about the people/business side of how the Internet functions. It’s a fascinating read and even if you don’t work in the space will help you understand the dynamics of eg. cable companies, large Internet players, etc. The physical book is impossible to obtain, but the Kindle edition is inexpensive and much of the content is available for free on the DrPeering site.

Systems Programming Resources

Network programming is a form of systems programming. There are certain systems programming resources that I consider indispensible. These are generally not books to go buy all at once and read cover to cover (though you could!), but if there are specific topics you need to understand in more depth - say, lock-free datastructures or sockets or epoll - then this is where you go first. Internet resources are woefully inaccurate or out of date on many of these topics.

Perfbook - this is the primary resource for anything related to parallel programming. CPU architecture, memory access semantics, threads, locks, atomics, RCU, hazard pointers, parallel data structures. It’s a phenomenal resource, freely available and frequently updated.
Computer Systems: A Programmer’s Perspective - this is a good first stop for any hardware or systems questions. Things like how does virtual memory work, how does a linker work, and so on. Often if you need more depth it will only serve as a jumping off point to relevant OS or CPU manuals, but I still find it valuable. Unfortunately it’s quite expensive since it’s a current textbook.
The Linux Programming Interface - in the tradition of The Unix Programming Environment and TCP/IP Illustrated, this is my preferred one-stop shop for Linux APIs. Lucid, in-depth writing, broad coverage.
Systems Performance - you will need to think about performance, it comes with the territory. This is the book to read on performance. Also, check out Brendan Gregg’s blog, talks, and more recent work on BPF.

Project Ideas

Read the DNS RFCs and implement either a stub resolver or an authoritative server in your language of choice. Start with a few record types and expand as long as you’re interested. Use wireshark to view the traffic and debug.

You’ll eventually need to learn how to read RFCs, and the original DNS RFCs are straightforward. DNS isn’t encrypted so you’ll have an easy time sniffing your traffic during development. Best of all, it’s exciting getting a piece of software you wrote interacting with something you didn’t write - either using your stub resolver to query a public DNS server, or using dig/unbound to query your authoritative server. DNS is fun.
Play with a lab network. This doesn’t need to be a physical lab - GNS3 with VyOS, MikroTik, or any Linux distro running FRRouting makes a great environment for experimentation. You can build a complex network environment, packet sniff every single link to see how routers are communicating, and drop a container or VM running your own network software into the mix. If you need a goal, try setting up two separate ASes, one running IS-IS and one running OSPF. Model an Internet exchange and have them peer.

Tangent: Languages

I’m going to avoid languages except for one note: you’ll need to know C, even if it’s just enough to read others’ code. There are plenty of ways to learn it, but I’d recommend Modern C. I have some minor nits with the book, but it’s a high-quality, concise, freely available text that covers all the language features you need to know and points out many of the problematic areas.

C is a simple language. It doesn’t benefit from reading many books or tutorials. Most of the complexity lies in working with memory and dealing with optimizing compilers, so you must use it to understand it.

If you want a starter C project, try implementing malloc. You’ll learn about virtual memory, commited versus reserved pages, fragmentation, and how to write fast software. You’ll also gain an understanding of how even simple looking C stdlib functions hide significant complexity (try to imagine what complexity a higher level langauge hides). When you’re done, read about tcmalloc or jemalloc and compare notes. Run your code under asan and ubsan to find bugs.

The End

Good luck and have fun!

Addenda

Most people will get the DNS knowledge they need from the books listed in the Networking Resources section. But if you want significantly more depth (eg. if you’re starting a new job at a DNS company) then I recommend Managing Mission-Critical Domains and DNS. I like that it covers the entire ecosystem - registrars, WHOIS, DNS, DNSSEC, some major open-source implementations, and even touches on operations/DDOS.