Gift cards are used by phishers. In our institution, we routinely get personalized spam mails (in the name of the corresponding group lead of the recipient, sent via GMail -- this is not low-effort) that ask whether they are available and, when (accidentally) responding, ask for Apple gift cards.
> these are the string instructions like REP MOVSB
AArch64 nowadays has somewhat similar CPY* and SET* instructions. Does that make AArch64 CISC? :-) (Maybe REP SCASB/CMPSB/LODSB (the latter being particularly useless) is a better example.)
> LEA happens to be the unique instruction where the memory operand is not dereferenced
Not quite unique: the now-deprecated Intel MPX instructions had similar semantics, e.g. BNDCU or BNDMK. BNDLDX/BNDSTX are even weirder as they don't compute the address as specified but treat the index part of the memory operand separately.
Been there, done that during my PhD (code: [1]). Works reasonably well, except for compile times (for which I implemented a caching strategy). However, due to calling conventions, using LLVM isn't going to give the best possible performance. Some features like signal handling are extremely hard to implement with LLVM (I didn't, therefore). Although the overall performance results have been good, it's not an approach that I could strongly recommend.
I agree: AArch64 is a nice instruction set to learn. (Source: I taught ARMv7, AArch64, x86-64 to first-year students in the past.)
> how simple instruction encoding is on arm64
Having written encoders, decoders, and compilers for AArch64 and x86-64, I disagree. While AArch64 is, in my opinion, very well designed (also better than RISC-V), it's certainly not simple. Here's some of my favorite complexities:
- Many instructions have (sometimes very) different encodings. While x86 has a more complex encoding structure, most encodings follow the same structure and are therefore remarkably similar.
- Huge amount of instruction operand types: memory + register, memory + unsigned scaled offset, memory + signed offset, optionally with pre/post-increment, but every instruction supports a different subset; vector, vector element, vector table, vector table element; sometimes general-purpose register encodes a stack pointer, sometimes a zero register; various immediate encodings; ...
- Logical immediate encoding. Clever, but also very complex. (To be sure that I implemented the decoding correctly, I brute-force test all inputs...)
- Register constraints: MUL (by element) with 16-bit integers has a register constraint on the lowest 16 registers. CASP requires an even-numbered register. LD64B requires an even-numbered register less than 24 (it writes Xt..Xt+7).
- Much more instructions: AArch64 SIMD (even excluding SVE) has more instructions than x86 including up to AVX-512. SVE/SME takes this to another level.
Actually, nowadays Arm describes the ISA as a load-store architecture. The RISC vs. CISC debate is, in my opinion, pretty pointless nowadays and I'd prefer if we'd just stop using these words to describe ISAs.
TPDE co-author here. Nice work, this was easier than expected; so we'll have better upstream ORC support soon [1].
The benchmark is suboptimal in multiple ways:
- Multi-threading makes things just slower. When enabling multi-threading, LLJIT clones every module into a new context before compilation, which is much more expensive than compilation. There's also no way to disable this. This causes a ~1.5x (LLVM)/~6.5x (TPDE) slowdown (very rough measurement on my laptop).
- The benchmark compares against the optimizing LLVM back-end, not the unoptimizing back-end (which would be a fairer comparison) (Code: JTMB.setCodeGenOptLevel(CodeGenOptLevel::None);). Additionally, enabling FastISel helps (command line -fast-isel; setting the TargetOption EnableFastISel seems to have no effect). This gives LLVM a 1.6x speedup.
- The benchmark is not really representative, as it causes FastISel fallbacks to SelectionDAG in some very large basic blocks -- i24 occurs rather rarely in real-world code. This is the reason why the speedup from the unoptimizing LLVM back-end is so low. Replacing i24 with i16 gives LLVM another 2.2x speedup. (Hint: to get information on FastISel fallbacks, enable FastISel and pass the command line options "-fast-isel-report-on-fallback -pass-remarks-missed=sdagisel" to LLVM. This is really valuable when optimizing for compile times.)
So we get ~140ms (TPDE), ~730ms (LLVM -O0), or 5.2x improvement. This is nowhere near the 10-20x speedup that TPDE typically achieves. Why? The new bottleneck is JITLink, which is featureful but slow -- profiling indicates that it consumes ~55% of the TPDE "compile time" (so the net compile time speedup is ~10x). TPDE therefore ships its own JIT mapper, which has fewer features but is much faster.
LLVM is really powerful, and despite being not particularly fast, the JIT API makes it extremely difficult to make it not extra-slow, even for LLVM experts.
Please note that the post didn't mention the word benchmark a single time ;) It does a "basic performance measurement" of "our csmith example". Anyway, thanks for your notes, they are very welcome and valid.
Comparing TPDE against the default optimization level in ORC is not fair (because that is -O2 indeed), but that's what we get off-the-shelf. I tested the explicit FastISel setting and it didn't help on the LLVM side, as you said. I didn't try the command-line option though, thanks for the tip! (Especially the -pass-remarks-missed will be useful.)
And yeah, csmith doesn't really generate representative code, but again that was not stated either. I didn't dive into JITLink as it would be a whole post on its own, but yes feature-completeness prevailed over performance here as well -- seems characteristic for LLVM and isn't soo surprising :)
Last but not least, yes multi-threading isn't working as good as the post indicates. This seems related to the fix that JuliaLang did for the TaskDispatcher [1]. I will correct this in the post and see which other points can be addressed in the repo.
Template instantiation caching is likely to help -- in an unoptimized LLVM build, I found that 40-50% of the compiled code at object file level is discarded at link-time as redundant.
Another thing I'd consider as interesting is parse caching from token to AST. Most headers don't change, so even when a TU needs to be recompiled, most parts of the AST could be reused. (Some kind of more clever and transparent precompiled headers.) This is likely to need some changes in the AST data structures for fast serialization and loading/inserting. And that makes me think that maybe the text book approach of generating an AST is a bad idea if we care about fast compilation.
Tangentially, I'm astonished that they claim correctness while a large amount of IR is inadequately (if at all) captured in the hash (comdat, symbol visibility, aliases, constant exprs, block address, calling convention/attributes for indirect calls, phi nodes, fast math flags, GEP type, ....). I'm also a bit annoyed, because this is the type of research that is very sloppily implemented, only evaluates projects where compile time is not a big problem and then only achieves small absolute savings, and papers over inherent difficulties (here: capturing the IR, parse time) that makes this unlikely to be used in practice.
There was commercial fork of clang zapcc[1] that did caching of headers and template instantiations with an in memory client server system[2], but idk if they solved all the correctness issues or not before abandoning it.
I knew that name looked familiar, I thought about mentioning tpde here :)
That's interesting to hear that IR is missing a lot. I'm also surprised that it could provide much gain over hashing the preprocessed output - maybe my workflow is different from others, but typically a change to the preprocessed output implies a change to the IR (e.g., it's a functional change and not just a variable name change or something). Otherwise, why would I recompile it?
Parse caching does sound interesting. Also, a lot of stuff that makes its way into the preprocessed output doesn't end up getting used (perhaps related to the 40-50% figure you gave). Lazy parsing could be helpful - just search for structural chars, to determine entity start/stop ranges, and add the names to a set, then do parsing lazily
> but typically a change to the preprocessed output implies a change to the IR (e.g., it's a functional change and not just a variable name change or something). Otherwise, why would I recompile it?
For C++, this could happen more often, e.g. when changing the implementation of an inline function or a non-instantiated template in a header that is not used in the compilation unit.
In AoT compilation, unoptimized code is primarily useful for debugging and short compile-test round trips. Your point on C++ is correct, but test workloads are typically small so the cost is often tolerable and TPDE also supports -O1 IR -- nothing precludes using an -O0 back-end with optimized IR, so if performance is relevant for debugging/testing, there's still a measurable compile-time improvement. (Obviously, with -O1 IR, the TPDE-generated code is ~2-3x slower than the code from the LLVM-O1-back-end; but it's still better than using unoptimized IR. It might also be possible to cut down the -O1 pass pipeline to passes that are actually important for performance.)
In JIT compilation, a fast baseline is always useful. LLVM is obviously not a great fit (the IR is slow to generate and inspect), but for projects that don't want to roll their own IR and use LLVM for optimized builds anyway, this is an easy way to drastically reduce the startup latency. (There is a JIT case study showing the overhead of LLVM-IR in Section 7/Fig. 10 in the paper.)
> And if a project is not large one then build times should not be that much of a problem.
I disagree -- I'm always annoyed when my builds take longer than a few seconds, and typically my code changes only involve fewer compilation units than I have CPU cores (even when working on LLVM). There's also this study [1] from Google, which claims that even modest improvements in build times improve productivity.
I mean my colleagues work hard to keep our build times around 3 minutes for full build of multimillion lines of C++ code that can be rebuilt and same or few times bigger code that is prebuilt but provides tons of headers. If I was constantly annoyed by build times longer than few seconds I probably would have changed my career path couple decades ago xD.
I am both hands for faster -O1 build times though. Point taken.
Gift cards are used by phishers. In our institution, we routinely get personalized spam mails (in the name of the corresponding group lead of the recipient, sent via GMail -- this is not low-effort) that ask whether they are available and, when (accidentally) responding, ask for Apple gift cards.
reply