MSVC’s inline assembly is easier to use, as compared to gcc’s version. It is easier to write right code than wrong one, I think. Still a simple add function is used to illustrate:
C
1
2
3
4
intadd1(inta,intb)
{
returna+b;
}
The corresponding inline version:
C
1
2
3
4
5
6
7
intadd2(inta,intb)
{
__asm{
mov eax,a;
add eax,b;
}
}
__asm keyword is used to specify a inline assembly block. From MSDN, there is another asm keyword which is not recommended:
Visual C++ support for the Standard C++ asm keyword is limited to the fact that the compiler will not generate an error on the keyword. However, an asm block will not generate any meaningful code. Use __asm instead of asm.
Symbols in C/C++ code can be used directly in inline assembly. This is much more convenient than gcc. And it is also not necessary to load parameters into registers before usage as in gcc. MSVC does the job right even in optimization case.
NOTE: Inline assembly is not supported on the Itanium and x64 processors.
Let’s see the generated code:
1
# cl /c /FA testasm_windows.c
Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
PUBLIC _add2
_TEXT SEGMENT
_a$ = 8
_b$ = 12
_add2 PROC
push ebp
mov ebp, esp
mov eax, DWORD PTR _a$[ebp]
add eax, DWORD PTR _b$[ebp]
pop ebp
ret 0
_add2 ENDP
_TEXT ENDS
Function parameters are located in [ebp+12] and [ebp+8] as referred by symbol a and b. Then, what happened if registers other than scratch registers are specified?
C
1
2
3
4
5
6
7
8
intadd3(inta,intb)
{
__asm{
mov ebx,a;
add ebx,b;
mov eax,ebx;
}
}
Output assembly code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
PUBLIC _add3
_TEXT SEGMENT
_a$ = 8
_b$ = 12
_add3 PROC
push ebp
mov ebp, esp
push ebx
mov ebx, DWORD PTR _a$[ebp]
add ebx, DWORD PTR _b$[ebp]
mov eax, ebx
pop ebx
pop ebp
ret 0
_add3 ENDP
_TEXT ENDS
As you see, MSVC automatically preserves ebx for us. From MSDN:
When using __asm to write assembly language in C/C++ functions, you don’t need to preserve the EAX, EBX, ECX, EDX, ESI, or EDI registers.
Let’s see the case when stdcall calling convention is used:
C
1
2
3
4
5
6
7
int__stdcall add4(inta,intb)
{
__asm{
mov eax,a;
add eax,b;
}
}
Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
PUBLIC _add4@8
_TEXT SEGMENT
_a$ = 8
_b$ = 12
_add4@8 PROC
push ebp
mov ebp, esp
mov eax, DWORD PTR _a$[ebp]
add eax, DWORD PTR _b$[ebp]
pop ebp
ret 8
_add4@8 ENDP
_TEXT ENDS
In stdcall, stack is cleaned up by callee. So, there’s a ret 8 before return. And the function name is mangled to _add4@8.
MSVC also supports fastcall calling convention, but it causes register conflicts as mentioned on MSDN, and is not recommended. Just test it here, the code happens to work:)
C
1
2
3
4
5
6
7
int__fastcall add5(inta,intb)
{
__asm{
mov eax,a;
add eax,b;
}
}
Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
PUBLIC @add5@8
_TEXT SEGMENT
_b$ = -8
_a$ = -4
@add5@8 PROC
push ebp
mov ebp, esp
sub esp, 8
mov DWORD PTR _b$[ebp], edx
mov DWORD PTR _a$[ebp], ecx
mov eax, DWORD PTR _a$[ebp]
add eax, DWORD PTR _b$[ebp]
mov esp, ebp
pop ebp
ret 0
@add5@8 ENDP
_TEXT ENDS
Function parameters are passed in ecx and edx when using fastcall. But they are saved to stack first. It seems we get no benefit using this calling convention. Maybe MSVC does not implement it well. The function name is mangled to @add5@8.
Last, we can tell MSVC that we want to write our own prolog/epilog code sequences using __declspec(naked) directive:
C
1
2
3
4
5
6
7
8
9
10
11
__declspec(naked)int__cdecl add6(inta,intb)
{
__asm{
push ebp;
mov ebp,esp;
mov eax,a;
add eax,b;
pop ebp;
ret;
}
}
Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
PUBLIC _add6
_TEXT SEGMENT
_a$ = 8
_b$ = 12
_add6 PROC
push ebp
mov ebp, esp
mov eax, DWORD PTR _a$[ebp]
add eax, DWORD PTR _b$[ebp]
pop ebp
ret 0
_add6 ENDP
_TEXT ENDS
Normal prolog/epilog is used here. MSVC does not generate duplicate these code when using __declspec(naked) directive.
Inline assembly is used in Linux kernel to optimize performance or access hardware. So I decided to check it first. Before digging deeper, you may wanna read the GCC Inline Assembly HOWTO to get a general understanding. In C, a simple add function looks like:
C
1
2
3
4
intadd1(inta,intb)
{
returna+b;
}
Its inline assembly version may be:
C
1
2
3
4
5
6
7
intadd2(inta,intb)
{
__asm____volatile__("movl 12(%ebp), %eax\n\t"
"movl 8(%ebp), %edx\n\t"
"addl %edx, %eax"
);
}
Or simpler:
C
1
2
3
4
5
6
intadd3(inta,intb)
{
__asm____volatile__("movl 12(%ebp), %eax\n\t"
"addl 8(%ebp), %eax"
);
}
Here’s its generated code by gcc:
1
# gcc -S testasm_linux.c -o testasm_linux.s
Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
add3:
pushl %ebp
movl %esp, %ebp
#APP
# 21 "testasm_linux.c" 1
movl 12(%ebp), %eax
movl 8(%ebp), %edx
addl %edx, %eax
# 0 "" 2
#NO_APP
popl %ebp
ret
add3:
pushl %ebp
movl %esp, %ebp
#APP
# 31 "testasm_linux.c" 1
movl 12(%ebp), %eax
addl 8(%ebp), %eax
# 0 "" 2
#NO_APP
popl %ebp
ret
Our inline assembly is surrounded by #APP and #NO_APP comments. Redundant gcc directives are already removed, the remaining are just function prolog/epilog code. add2() and add3() works fine using default gcc flags. But it is not the case when -O2 optimize flag is passed. From the output of gcc -S -O2(try it yourself), I found these 2 function calls are inlined in their caller, no function call at all. These 2 issues prevent the inline assembly from working: – Depending on %eax to be the return value. But it is silently ignored in -O2. – Depending on 12(%ebp) and 8(%ebp) as parameters of function. But it is not guaranteed that parameters are there in -O2. To solve issue 1, an explicit return should be used:
C
1
2
3
4
5
6
7
8
9
10
intadd4(inta,intb)
{
intres;
/* note the double % */
__asm____volatile__("movl 12(%%ebp), %%eax\n\t"
"addl 8(%%ebp), %%eax"
:"=a"(res)
);
returnres;
}
To solve issue 2, parameters are required to be loaded in registers first:
C
1
2
3
4
5
6
7
8
9
10
intadd5(inta,intb)
{
intres;
__asm____volatile__("movl %%ecx, %%eax\n\t"
"addl %%edx, %%eax"
:"=a"(res)
:"c"(a),"d"(b)
);
returnres;
}
add5() now works in -O2. The default calling convention is cdecl for gcc. %eax, %ecx and %edx can be used from scratch in a function. It’s the function caller’s duty to preserve these registers. These registers are so-called scratch registers. So what if we specify to use other registers other than these scratch registers, like %esi and %edi?
C
1
2
3
4
5
6
7
8
9
10
intadd6(inta,intb)
{
intres;
__asm____volatile__("movl %%esi, %%eax\n\t"
"addl %%edi, %%eax"
:"=a"(res)
:"S"(a),"D"(b)
);
returnres;
}
Again with gcc -S:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
add6:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $20, %esp
movl 8(%ebp), %esi
movl %esi, -32(%ebp)
movl 12(%ebp), %edx
movl -32(%ebp), %esi
movl %edx, %edi
#APP
# 65 "testasm_linux.c" 1
movl %esi, %eax
addl %edi, %eax
# 0 "" 2
#NO_APP
movl %eax, %ebx
movl %ebx, -16(%ebp)
movl -16(%ebp), %eax
addl $20, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
It seems that code generation of gcc in default optimize level is not so efficient:) But you should actually noticed that %esi and %edi are pushed onto stack before their usage, and popped out when finishing. These code generation is automatically done by gcc, since you have specified to use %esi(“S”) and %edi(“D”) in input list of the inline assembly. Actually, the code can be simpler by specify %eax as both input and output:
C
1
2
3
4
5
6
7
8
9
intadd7(inta,intb)
{
intres;
__asm____volatile__("addl %%edx, %%eax"
:"=a"(res)
:"a"(a),"d"(b)
);
returnres;
}
We can tell gcc to use a general register(“r”) available in current context in inline assembly:
C
1
2
3
4
5
6
7
8
9
10
intadd8(inta,intb)
{
intres;
__asm____volatile__("movl %1, %%eax\n\t"
"addl %2, %%eax"
:"=a"(res)
:"r"(a),"r"(b)
);
returnres;
}
And wrong code generation again…:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
add8:
pushl %ebp
movl %esp, %ebp
pushl %ebx
subl $20, %esp
movl 8(%ebp), %eax
movl %eax, -24(%ebp)
movl 12(%ebp), %edx
movl -24(%ebp), %eax
#APP
# 88 "testasm_linux.c" 1
movl %eax, %eax
addl %edx, %eax
# 0 "" 2
#NO_APP
movl %eax, %ebx
movl %ebx, -8(%ebp)
movl -8(%ebp), %eax
addl $20, %esp
popl %ebx
popl %ebp
ret
%eax is moved to %eax? gcc selected %eax and %edx as general registers to use. The code accidentally does the right job, but it is still a potential pitfall. Clobber list can be used to avoid this:
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
intadd9(inta,intb)
{
intres;
/*
* The clobber list tells gcc which registers(or memory) are changed by the asm,
* but not listed as an output.
*/
__asm____volatile__("movl %1, %0\n\t"
"addl %2, %0\n\t"
"movl %0, %%eax"
:"=r"(res)
:"r"(a),"r"(b)
:"%eax"
);
returnres;
}
As commented inline: The clobber list tells gcc which registers(or memory) are changed by the asm, but not listed as an output. Now gcc does not use %eax as a candidate of general registers any more. gcc can also generate code to preserve(push onto stack) registers in clobber list if necessary.