Lab 6 - Inline Assembler

In this lab we will be looking at this code.

What is this code?

This code uses specific inline aarch64 assembly language to scale volume samples. This code focuses on using SIMD (single instruction multiple data) specifically the SQDMULH ( Signed Saturating Doubling Multiply returning High Half ) instruction. This instruction ensure that the value is constrained from overflowing. This instruction doubles the second parameter, which becomes a 17 bit value that is treated as a fixed float. Then multiplies the two integers together giving a 32 bit and returns the high half of the result. Which keeps only the integer and not the floating point part of the result.

Comparing Results:

Here are our previous results from Lab 5:
Control: 0m4.925s
Vol 1 : 0m5.165s
Vol 2 : 0m5.391s
Vol 3 : 0m5.128s

And our SIMD result

SIMD: 0m5.062s
taking ~0.14s

This is .10s faster than any other method. Which is a very noticeable increase.

Increasing the samples to 1000000000

As these runs took a very long time to get a good average I only tested against vol3, which was previously our faster result in all tests.

Vol 3: 1m8.783s
SIMD: 0m51.065s

And the difference is quite noticeable at ~18 seconds.

Alternate approach to declaring registers?

Instead of declaring the registers you are going to use explicitly:
register int16_t* in_cursor asm(“r20”);

you can declare as a variable with no register:
int16* in_cursor;

This uses the same call to:
: [in]”+r”(in_cursor)

In this case the compiler will assign this a register on its own. Explicitly defining registers can make the code easier to read and will not stop the compiler from using those registers if the register value is no longer being used. It also can be necessary to use specific registers for a variable for various instructions.

In this case if we change all the explicitly declared registers, our output will be zero. It is hard to trace exactly why this happens, as we do not know what registers the compiler is picking for our variables.

Should we use 32767 or 32768

vol_int = (int16_t) (0.75 * 32767.0);

We use 32767 here because the range for a int16_t is -32768 to 32767 which is a 65535 value range. This includes 0 as a value. If we used 32768 and our multiplication factor was 1 our value would overflow the int16_t.

We can prove this by running this bit of code:
#include "stdio.h" #include "stdint.h"

int main(){ int16_t i = 32768; printf("%d\n", i);

i = 32767; printf("%d\n", i); return 0; }

The output will be:
-32768 32767

If we try to but a number larger than 32767 in a int16_t it will overflow.

What does it mean to “duplicate” values?

__asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

This dup instruction will duplicate a value into all the elements of a vector. In this case the vol_int value will be duplicated into all 8 lanes of the v1 vector. This then lets us use that value on all lanes of a vector instruction.

What happens if we remove the following?

: [in]"+r"(in_cursor) : "0"(in_cursor),[out]"r"(out_cursor));

If we remove the following lines we get an error on the in and out operands. The code before this cannot find the in and out operations that this code defines. The first line defines the access mode and the variable to use for input. The second line says that the input argument is also used for output. As well as defining the access mode and variable for another output.

Are the results usable? are they correct?

In order to figure out if the results were usable I printed out some inputs and the corresponding outputs.
This lets me check that the output is scaled correctly.

In: 26304 Out: 19727
26304 * 0.75 = 19728

In: 21311 Out: 15982
21311 * 0.75 = 15,983.25

In: 30745 Out: 23057
30745 * 0.75 = 23,058.75

All these results are within a reasonable margin of error for this use case.

Conclusion

After these tests, we can see that using inline assembler can be quite complicated. However the results are a noticeable increase over the previous methods. This method is also accurate enough for the use case making it the ideal method for the aarch64 architecture.