![]() Unfortunately I found that although the AVX2 codepath was being correctly taken, the “dynamically accelerated” version of tac compiled without target-cpu=native was always benchmarking up to 15% slower than the same codebase with that equivalent of -march=native enabled (which was indistinguishable in benchmarks from the “always-AVX2” proof-of-concept). I also benchmarked against my earlier proof-of-concept code that didn’t use dynamic detection at all. In addition to inspecting the generated assembly to ensure the expected AVX2 instructions were present in the AVX2 codepath (and temporarily adding a panic!("Wrong codepath!") to the non-AVX2 branch), I compiled a copy with target-cpu=native and set it aside, then compiled another copy with RUSTFLAGS unset, and proceeded to benchmark the two on some large (~40 GiB) payloads. Until this point, I’d been compiling my local tac builds with my local environment defaults, in particular with the RUSTFLAGS environment variable set to -C target-cpu=native which would always result in the rustc compiler’s LLVM backend generating AVX2-compatible instructions. What remained at that point was to verify that the generated dynamic-detection binary actually performed as I expected it to. It helped that I had split off the core search logic into two blackbox functions, one with the original naive search logic, and the other with the AVX2-accelerated search logic, so all I needed to do was call the right function. For better or for worse, the SIMD-accelerated tac builds lay dormant (in a branch of the original code, no less) on GitHub until a couple of months ago when a number of other rust rewrites of common greybeard utilities were making the rounds and I was inspired to revisit the codebase.įortunately, in 2021 it was considerably easier to dynamically enable AVX2 support in tac, first by taking advantage of the is_x86_feature_detected!() macro (available since mid-2018) to detect early on whether or not AVX2 intrinsics are supported at runtime, and then converting a few manual invocations of nightly-only intrinsics to use SIMD-enabled equivalents exposed by the standard library instead. Three years ago, the state of rust’s support for dynamic CPU instruction support detection looked very much different than it does today, and I wasn’t looking forward to releasing a tac 2.0 that required a nightly build (aka “bleeding edge” or “unstable”) of the rust compiler. As mentioned earlier, the vectorization of tac was entirely done by hand – as in, by explicitly using the AVX2 assembly instructions to carry out vectorized operations, rather by writing vectorization-friendly high-level rust code and then praying that the compiler would correctly vectorize the logic. In perfect honesty, the biggest reason a SIMD release was put on hold was that I had what I needed: my local copy of tac that used AVX2 acceleration worked just fine to help speed up my processing of reversed server logs, and having completed the challenging part of the puzzle (trying to vectorize every possible operation in the core logic), I was not particularly interested in the remaining 20% that needed to be done – and the state of rust’s support for intrinsics was considerably lacking at the time.Īpart from requiring compiler intrinsics that were gated behind the rust nightly toolchain, I wasn’t particularly sure how to tackle “graceful degradation” when executing on AVX2-incapable hardware. In 2019, after attempting to process a few log files – each of which were over 100 GiB – I decided to revisit the subject and implemented a proof-of-concept shortly thereafter… and that’s when things stalled for a number of reasons. I had the initial idea of utilizing vectorized instructions to speed up the core line ending search logic during the initial development of tac in 2017, but was put off by the complete lack of stabilized SIMD support in rust at the time. ![]() This release has been a long time in the making. ![]() (For those that aren’t familiar with it, tac is a command-line utility to reverse the contents of a file.) This is a major release with enormous performance improvements under-the-hood, primarily featuring handwritten SIMD acceleration of the core logic that allows it to process input up to three times faster than its GNU Coreutils counterpart. NeoSmart Technologies’ open source (MIT-licensed), cross-platform tac (written in rust) has been updated to version 2.0 ( GitHub link).
0 Comments
Leave a Reply. |