This post is a follow up to this post.
Why this package?
This package interested me because it is designed to be a fast compressor. It is used in massive data storage senarios, it is also used by common open source databases like mongoDB. This compressor is used on a wide variaty of data, some of which is already compressed. As this compressor is used on petabytes of data it means that even a small optimization can make a massive difference.
This code is also doing lots of low level memory managment which is something I want more experiance working with.
This package is also well commented and has a small, well organized code base. Making it easier to get started with.
Plan for testing and test data?
This package ships with its own test data and test suite. This is a great tool for my testing and will help me ensure any changes I make are useful.
My test setup will be:
A Fedora x86_64 VM, 2 cores, 4gb ram
A Fedora aarch64 RaspberryPi
For final testing to ensure nothing is broken I will also test on multiple aarch64 servers, and a windows x86_64 pc.
Plan for optimization?
This package already has some optimizations that referance aarch64. These seem to be not specific to aarch64, rather just adding aarch64 to architecture using those existing optimizations. Also looking through the code I see x86_64 optimizations that do not appear to have an aarch64 equivilant. I will have to do further testing to find if there is actually optimization to be done.
In particular this area of code interests me. It is a wierd case, but could also be a interesting means of optimization
// In x86, pad the function body to start 16 bytes later. This function has
// a couple of hotspots that are highly sensitive to alignment: we have
// observed regressions by more than 20% in some metrics just by moving the
// exact same code to a different position in the benchmark binary.
//
// Putting this code on a 32-byte-aligned boundary + 16 bytes makes us hit
// the "lucky" case consistently. Unfortunately, this is a very brittle
// workaround, and future differences in code generation may reintroduce
// this regression. If you experience a big, difficult to explain, benchmark
// performance regression here, first try removing this hack.
#if defined(__GNUC__) && defined(__x86_64__)
// Two 8-byte "NOP DWORD ptr [EAX + EAX*1 + 00000000H]" instructions.
asm(".byte 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00");
asm(".byte 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00");
#endif