Working On Optimizations

This is the code I am looking at optimizing.

The assembler code is aligning instructions in the binary. These instructions are aligned by 32+16 bytes. As this is aligning with bytes and not bits, this is not aligning to a word boundary. If this code was aligning to a word boundary then it would not matter on aarch64. This is because arch64 instructions are a fixed size, unlike x86_64. I suspect that this is aligning to a page boundary, which means it might matter for aarch64.

My plan is to add in some NOP’s in aarch64 assembler, and test the results on multiple systems.
A NOP in aarch64 is 8 bytes, like every other instruction.

My changes:


template
#if defined(__GNUC__) && defined(__x86_64__)
__attribute__((aligned(32)))
#endif

#if defined(__aarch64__)
__attribute__((aligned(32)))
#endif

void DecompressAllTags(Writer* writer) {
// In x86, pad the function body to start 16 bytes later. This function has
// a couple of hotspots that are highly sensitive to alignment: we have
// observed regressions by more than 20% in some metrics just by moving the
// exact same code to a different position in the benchmark binary.
//
// Putting this code on a 32-byte-aligned boundary + 16 bytes makes us hit
// the “lucky” case consistently. Unfortunately, this is a very brittle
// workaround, and future differences in code generation may reintroduce
// this regression. If you experience a big, difficult to explain, benchmark
// performance regression here, first try removing this hack.
#if defined(__GNUC__) && defined(__x86_64__)
// Two 8-byte “NOP DWORD ptr [EAX + EAX*1 + 00000000H]” instructions.
asm(“.byte 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00”);
asm(“.byte 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00”);
#endif

#if defined(__aarch64__)
asm(“NOP; NOP”);
#endif

Benchmark aarchie:

Benchmark ddouglas:

Results:

Comparing these benchmarks to the originals, we can see that there is not noticeable performance change between tests.

Next I tried different amounts of NOP instructions
I have tried 1 – 6 NOP’s, 8 – 48 bytes. I have tested this on both aarchie and ddouglas servers.

I have found no noticeable change in any benchmark with any of these tests. I suspect that if it is a page boundary alignment issue the fixed size of the aarch64 instructions make this not a factor. Likely this would mean if it is a issue with a page boundary, that boundary is in a different spot in the binary where it has less of a performance impact. It is also possible that in x86_64 a instruction was crossing the page boundary, and in that case we would not see a performance improvement on aarch64, as instructions themselves will never cross page boundaries.

Next I plan on looking at the X86_64 asm to see if I can find any very large instructions that would be particularly sensitive to alignment.