Rust Disassembly: part 1
Table of contents:What do some Rust features compile to?
Intro
I have been starting to have a look at Rust lately, mostly because WASM is growing on me and Rust has the best tool in class for it, or so I am told. I am eager to find out by myself.
Rust comes with several new idioms and structures in the language I am not used to, and being a performance enthusiast, I always get interested in what such constructs translate to.
I put together some tests that I will be discussing below, here the link to a compiler explorer compiler explorer page.
i128
One of the first things I came across was the i128 datatype, allowing to stuff up to 128 bits in an integer, I was curious to see how that was actually handled:
// testing i128
pub fn add128(a : i128, b: i128) -> i128 {
a + b
}
pub fn mul128(a : i128, b: i128) -> i128 {
a * b
}
Which translates to:
example::add128:
mov rax, rdi
add rax, rdx
adc rsi, rcx
mov rdx, rsi
ret
example::mul128:
mov r8, rdx
mov rax, rdx
mul rdi
imul rsi, r8
add rdx, rsi
imul rcx, rdi
add rdx, rcx
ret
As most of you probably expected the i128 bits are implemented in software, it becomes especially clear in the addition where
we see a first addition followed by an adc
instruction, which takes care of adding the carry flag as well in case it was set.
Pretty cool!
Destructuring
The second feature I came across is called destructuring; it is one way to access data of a tuple, mind you a tuple can have heterogeneous data types in it, that is an important detail as we will see in a second. Here an example of destructuring.
// destructuring
pub fn destructuring(a : (f32, f32,f32)) -> f32 {
let (_,_,z) = a;
z
}
pub fn destructuring2(a : (f32, f32,f32)) -> f32 {
a.2
}
Here the resulting asm:
example::destructuring:
movss xmm0, dword ptr [rdi + 8]
ret
The compiler has no problem understanding what the user wants to extract and can optimize all the destructuring that
is not needed. The compiler is merely shifting the pointer to where we wish to read: [rdi + 8]
, if we decided to access the second element we would see : [rdi + 4]
.
The destructuring2 two function generates exactly the same code. Interingly enough we can’t access an elment by an index variable, something like this is not legal:
pub fn destructuring2(a : (f32, f32,f32), b: usize) -> f32 {
a.b
}
It would be ambiguous for the language grammar to begin with, and the compiler would not be able to see what value you wish to unpack ahead of time,giving it troubles with touble heterogeneous types.
Arrays
Next on the list is arrays.
//arrays
pub fn array1() -> i32 {
let a = [1,2,3,4,5];
a[2]
}
pub fn array2( a: &[i32;5]) -> i32 {
a[4]
}
pub fn array3( a: &[i32;5], b: usize) -> i32 {
a[b]
}
Here there are some different examples of array usage. In the first and second, the compiler is able to see what the user wants to do and is able to optimize the heck out of it, here the generated assembly for the first two tests:
example::array1:
mov eax, 3
ret
example::array2:
mov eax, dword ptr [rdi + 16]
ret
As we can see, in the first case, the compiler nukes the array entirely and returns the wanted value as expected, the second examples show a simple read by an offset and return. Pretty straight forward, but what if we use an index the compiler does not know at compile time like in the 3rd example?
example::array3:
push rax
cmp rsi, 4
ja .LBB5_2
mov eax, dword ptr [rdi + 4*rsi]
pop rcx
ret
.LBB5_2:
lea rdi, [rip + .L__unnamed_1]
mov edx, 5
call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
ud2
Ouch! Here we can see Rust memory safety coming into play. The compiler is not able to make sure at compile time that the memory access will be within the bounds of the array, as such we can first see a comparison with the array size (or better, to the maximum valid index which is size -1):
cmp rsi, 4
Which is not an expensive operation per se, but what it follows might be:
ja .LBB5_2
In the case our index is higher the the biggest valid index we jump straight to panic land:
.LBB5_2:
lea rdi, [rip + .L__unnamed_1]
mov edx, 5
call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
ud2
First of all, let me say I love how it is literally called panic, second is the first time I meet the ud2
instruction,
very interesting! The ud2instruction generates an invalid instruction, and we have it after the panic mode. Not sure what exactly is needed for, maybe if for any reason panic mode doesn ot terminate the program such instruction will trigger a trap at hardware level?
About the ud2 instruction
After the publishing of the blog post, Phil Dennis-Jordan sparked an interesting discusission about what the source of ud2 could be. That lead me in the end to ask the compiler team in the Rust discord about it. The user "Hanna" pointed me out to the relevant behaviour and PR implementing it. The TLDR is that, the rust compiler flips an options in LLVM to insert that instruction everywhere there is unreachable code to make 100% sure such code could not fall through by using hardware traps.
At this point, a c/c++ programmer might think: “oh well we rust is too slow, goodbye!” Hold on for a second, let us not be too hasty, shall we? Sure accessing a random index might be sub-obtimal, but what if we are iterating?
Loops
Here our first loop example:
pub fn loop1( a: &[i32;5], b: usize) -> i32 {
let mut idx: usize = 0;
let mut total : i32 =0;
while idx < b
{
total = total + a[idx];
idx+=1;
}
total
}
In the above code, we are merely doing a reduce of the provided array, but we are passing the maximum iteration index to the function, this might be for several reasons like for example iterating a subset of the array.
Here the result
example::loop1:
push rax
test rsi, rsi
je .LBB6_1
mov eax, dword ptr [rdi]
cmp rsi, 1
je .LBB6_2
add eax, dword ptr [rdi + 4]
cmp rsi, 2
je .LBB6_2
add eax, dword ptr [rdi + 8]
cmp rsi, 3
je .LBB6_2
add eax, dword ptr [rdi + 12]
cmp rsi, 4
je .LBB6_2
cmp rsi, 5
jne .LBB6_9
add eax, dword ptr [rdi + 16]
pop rcx
ret
.LBB6_1:
xor eax, eax
.LBB6_2:
pop rcx
ret
.LBB6_9:
lea rdi, [rip + .L__unnamed_2]
mov esi, 5
mov edx, 5
call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
ud2
That is a bit more code than I expected, but a lot of exciting things here! First of all, it seems like the compiler decided to unroll the loop, cool!
The second thing we can notice is that we don’t have checks on the size of the container until we get to the last index of the array the compiler knows to be a valid index. After that, we are going to trigger panic mode. So not as bad as I initially thought.
But this is not an idiomatic use of Rust, in reality Rust prefers you to use iterators on collections, but why would that be the case, let us find out:
pub fn loop2( a: &[i32;8]) -> i32 {
let mut total : i32 =0;
for var in a.iter()
{
total += var;
}
total
}
Here we are doing the same thing as above, but we are using an iterator, to iterate all the elements. Coming from c++ I am usually a bit wary of using this kind of loops mostly because I had some terrible experiences with iterators, but that is a different story. What sort of disassembly did we get?
example::loop2:
movdqu xmm0, xmmword ptr [rdi]
movdqu xmm1, xmmword ptr [rdi + 16]
paddd xmm1, xmm0
pshufd xmm0, xmm1, 78
paddd xmm0, xmm1
pshufd xmm1, xmm0, 229
paddd xmm1, xmm0
movd eax, xmm1
ret
The above is completely different from the code we had before! As a first, we can see the loop has been completely unrolled and also vectorized!
The first three instructions load the ints into 128bits register and do a vectorized add. After that, it will be shuffling the values down to do a 2x wide add and finally the last add with the final value (I am 90% sure about this, I did not go all the way to check the provided masks for the shuffles.
To be fair, that was the best use case since 8 is a multiple of the SIMD register width, what if is not?
pub fn loop3( a: &[i32;5]) -> i32 {
let mut total : i32 =0;
for var in a.iter()
{
total += var;
}
total
}
example::loop3:
movdqu xmm0, xmmword ptr [rdi + 4]
pshufd xmm1, xmm0, 78
paddd xmm1, xmm0
pshufd xmm0, xmm1, 229
paddd xmm0, xmm1
movd eax, xmm0
add eax, dword ptr [rdi]
ret
The generated code is quite similar, but the most important thing is there are no range checks, using an iterator the compiler can guarantee there won’t be dangerous memory accesses outside the bounds of the array. This leads me to believe you can get into some “interesting” collections composition/zipping/destructuring that you often find in “idiomatic” python.
Not sure how I feel about that yet but it seems it would help keep the compiler happy and performances up.
I had quite a bit of fun checking this assembly out and learn lots in how Rust behaves; I will make sure to make another post if I find something interesting! If you liked to feel free to share around.
UPDATES
5/29/2020: Added details on ud2 instruction