The function I want to look at is this:
void DecompressAllTags(Writer* writer) {
// In x86, pad the function body to start 16 bytes later. This function has
// a couple of hotspots that are highly sensitive to alignment: we have
// observed regressions by more than 20% in some metrics just by moving the
// exact same code to a different position in the benchmark binary.
//
// Putting this code on a 32-byte-aligned boundary + 16 bytes makes us hit
// the "lucky" case consistently. Unfortunately, this is a very brittle
// workaround, and future differences in code generation may reintroduce
// this regression. If you experience a big, difficult to explain, benchmark
// performance regression here, first try removing this hack.
#if defined(__GNUC__) && defined(__x86_64__)
// Two 8-byte "NOP DWORD ptr [EAX + EAX*1 + 00000000H]" instructions.
asm(".byte 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00");
asm(".byte 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00");
#endif
To test I want to compare the performance on x86_64 with and without this code.
Here are the initial test results
And now the perf results after moving this code.

Now the results are a little disappointing. There is little difference between the results, likely this regression is not occurring right now. Also the compiler is removing the function call to this function, which makes it harder to see its impact.