### Lecture Material from 201Recursion in C and Asm

#### Robert Dewar

```/* Example of recursion in C, with main program driver */

/* Import standard I/O headers */
#include

/* Following function prints an unsigned number left to right, using a
recursive algorithm. The first if statement deals with all but the
units digit, by making a recursive call. The putchar call outputs
the final digit */

void pdec (unsigned a) {
if (a >= 10) pdec (a / 10);
putchar ((a % 10) | '0');
}

int main () {
pdec (123);
}

Let's first see how we would translate this into the familiar Intel format
syntax for the 286 that we have been learning so far. Let's just deal with
the call to PDEC. That's easy:

PUSH   123
CALL   PDEC

Here we simply push the parameter, call the procedure, then strip the
parameter from the stack (in C, the caller always strips the stack).

The code for pdec itself looks like

PDEC:   PUSH   BP                     ; Save old BP (frame pointer)
MOV    BP, SP                 ; Set new BP (frame pointer)
CMP    WORD PTR [BP+4], 10    ; Compare A against 10
JNAE   L2

SUB    DX, DX                 ; Put argument in DX:AX
MOV    AX, WORD PTR [BP + 4]
MOV    CX, 10                 ; Divide by 10
DIV    CX

PUSH   AX                     ; push quotient from AX
CALL   PDEC                   ; recursive call
ADD    SP, 4                  ; strip parameter

L2:     SUB    DX, DX                 ; put argument in DX:AX
MOV    AX, WORD PTR [BP + 4]
MOV    CX, 10                 ; divide by 10
DIV    CX                     ; remainder goes into DX

OR     DL, '0'                ; or in the zero
PUSH   DX                     ; push parameter for putchar call
CALL   PUTCHAR                ; call to putchar
ADD    SP, 4                  ; strip parameter

MOV    SP, BP                 ; strip stack frame
POP    BP                     ; restore old bp

Now, let's take a look at the code generated for this by a "real" compiler.
We will use for this gcc. That's a C compiler available for all platforms.
If you have Linux, you have this already installed most likely. You can
DOWNLOAD GCC WINDOWS, and you wil see lots of resources for this. One
the www.cs.nyu.edu area. This is a copy of gcc that supports both Ada
and C, it will automatically compile c if the extension of the file is .c.

To compile and link recurs.c with gcc we do:

gcc -c recurs.c

By default that generates an executable called a, which we can easily run:

C:\201>a
123

As you see the result is 123, which is what we expect. Now to see what the
compiler is doing, let's have it generate an assembler file by doing:

gcc -S recurs.c

and examining the assembler file that comes out recurs.s. I am only including
the actual code for the _pdec routine here, not the header junk, and not the
code for the main procedure:

_pdec:
pushl	%ebp
movl	%esp, %ebp
subl	\$8, %esp
cmpl	\$9, 8(%ebp)
jbe	L2
subl	\$12, %esp
movl	8(%ebp), %edx
movl	\$-858993459, %eax
mull	%edx
movl	%edx, %eax
shrl	\$3, %eax
pushl	%eax
call	_pdec
L2:
subl	\$12, %esp
movl	8(%ebp), %ecx
movl	\$-858993459, %eax
mull	%ecx
shrl	\$3, %edx
movl	%edx, %eax
sall	\$2, %eax
subl	%eax, %ecx
movl	%ecx, %eax
orl	\$48, %eax
pushl	%eax
call	_putchar
leave
ret

Now, a few things to say about the above code. It is different in two
important respects from the ASM code we have looked at so far.

First, it is for the 386, not the 286, which means that the registers are
all 32 bits, and have names like %eax for the 32 bit version of ax. The
ax register is still accesible (as the low order 16 bits of eax) and the
al and ah registers are also still accessible (as part of ax).

The second difference is that the compiler outputs things in so called
AT&T syntax instead of Intel syntax. Here is a quick tutorial on the
differences (taken from www.scene.ee/~lego/prog/djgpp_asm.html)

======================================================================

AT&T x86 Asm Syntax

GCC uses AT&T asm syntax. This is a little bit different from the regular
Intel format. The main differences are:

* AT&T syntax uses the opposite order for source and destination operands,
source followed by destination.
* Register operands are preceded by the "%" character, including sections.
* Immediate operands are preceded by the "\$" character.
* The size of memory operands are specified using the last character of
the opcode. These are "b" (8-bit), "w" (16-bit), and "l" (32-bit).

Here are some examples: (Intel equivalent in parentheses)

movw %bx, %ax   (mov ax, bx)
xorl %eax, %eax (xor eax, eax)
movw \$1, %ax    (mov ax,1)
movb X, %ah     (mov ah, byte ptr X)
movw X, %ax     (mov ax, word ptr X)
movl X, %eax    (mov eax, X)

Most opcodes are identical between AT&T and Intel format, except for the
addition of the character b/w/l to indicate the operand size. The main
exceptions are:

movsSD (movsx)
movzSD (movzx)

Where S and D are the source and destination operand size suffixes,
respectively. For example, "movswl %ax, %ecx (movsx ecx, ax)".

cbtw        (cbw)
cwtl        (cwde)
cwtd        (cwd)
cltd        (cdq)
lcall \$S,\$O (call far S:O)
ljmp \$S,\$O  (jump far S:O)
lret \$V     (ret far V)

Memory references are a little bit different too. The usual Intel memory
reference of the form

SECTION:[BASE + INDEX*SCALE + DISP]

is written as

SECTION:DISP(BASE, INDEX, SCALE).

Here are some examples: (Intel equivalent in parentheses)

movl 4(%ebp), %eax            (mov eax, [ebp+4])
movb \$4, %fs:(%eax)           (mov fs:eax, 4)
movl _array(,%eax,4), %eax    (mov eax, [4*eax + array])
movw _array(%ebx,%eax,4), %cx (mov cx, [ebx + 4*eax + array])

Jump instructions always use the smallest displacements. The following
instructions always work in byte displacements only, however: "jcxz",
"jecxz", "loop", "loopz", "loope", "loopnz" and "loopne". As suggested
in the online docs, a "jcxz foo" could be expanded to work:

jcxz cx_zero
jmp cx_nonzero
cx_zero:
jmp foo
cx_nonzero:

The online docs also caution on "mul" and "imul" instructions. The
expanding multiply instructions are done using ONE operand. For example,
"imul \$ebx, \$ebx" will NOT put the result in "edx:eax". Use the single
operand form "imul %ebx" to get the expanded result.
===========================================================

For now we are not trying to learn to write in this style, just to read it
so let's take bits and pieces of the asm file and explain it:

_pdec:
pushl	%ebp
movl	%esp, %ebp

That's setting up the frame, remember the movl works "backwards", so this is
copying esp into ebp (most assembly languages have the source first, so this
is consistent with that convention).

subl	\$8, %esp

That subl seems unnecessary, but actually it is significant. For maximum
efficiency, you want to keep the stack pointer aligned on a 16-byte
boundary. The entry sequence pushed a return point (4 bytes), and one
saved register (EBP) (4 bytes), so a subtract of 8 keeps the stack aligned
properly.

cmpl	\$9, 8(%ebp)
jbe	L2

OK, that's not too hard to understand, it's the comparison against 10, and
jumps if the operand is <= 9. Not quite clear why it changed this from 10,
but it's obviously equivalent.

subl	\$12, %esp

This is a an interesting instruction. To keep this 16-byte alignment
on the stack pointer, each procedure assumes it is called with this
alignment, and then must guarantee that it is properly maintained on
a call. The call below will push one 4 byte parameter on the stack.
The extra 12 bytes from the above subl keep us properly aligned.

movl	8(%ebp), %edx
movl	\$-858993459, %eax
mull	%edx
movl	%edx, %eax
shrl	\$3, %eax

a very slow instruction, even dividing by ten. The above sequence gets the
same result as a division by 10, but using a multiplication (the constant
is indeed weird, It looks like it is doing something like multiplying by
the binary fraction 0.8 and then dividing by 8 to get a shift. Not
something you would easily manage to do by hand, but compilers are good
at this sort of thing!

pushl	%eax
call	_pdec

That's the recursive call, complete with stripping off not only the 4 bytes
pushed for the parameter, but also the extra 12 bytes used to keep the
stack pointer properly aligned

L2:
subl	\$12, %esp

That's the stack alignment instruction for the call below (same deal as the
case above)

movl	8(%ebp), %ecx
movl	\$-858993459, %eax
mull	%ecx
shrl	\$3, %edx
movl	%edx, %eax
sall	\$2, %eax
subl	%eax, %ecx
movl	%ecx, %eax
orl	\$48, %eax

Ugh! All that is to get a remainder on division by 10, again using the
same kind of clever trick of replacing a slow division by a fast multiply.
Hard to believe we are ahead with this many instructions generated, but
division is *REAL* slow.

pushl	%eax
call	_putchar

That's the call to putchar. Same deal as before on stripping not only the
pushed parameter, but also the extra 12 bytes used for alignment purposes.

leave
ret

And that's the epilog. LEAVE is a one byte instruction that is exactly
equivalent to

movl   %ebp, %esp
pop    %epb

which is the sequence we are used to for stripping the stack (remember again
the backwards direction, ebp is being copied into esp, not the other way
around.

One more interesting thing we can do is to turn on optimization. This tells
the gcc compiler "take your time and try to generate cleverer code". With
this clever code option, we get:

_pdec:
pushl	%ebp
movl	%esp, %ebp
pushl	%ebx
pushl	%eax
movl	8(%ebp), %ebx
cmpl	\$9, %ebx
ja	L4
L2:
movl	%ebx, %eax
movl	\$-858993459, %edx
mull	%edx
shrl	\$3, %edx
leal	(%edx,%edx,4), %edx
subl	%edx, %ebx
orl	\$48, %ebx
movl	%ebx, 8(%ebp)
movl	-4(%ebp), %ebx
movl	%ebp, %esp
popl	%ebp
jmp	_putchar
.p2align 4,,7
L4:
movl	%ebx, %eax
movl	\$-858993459, %edx
mull	%edx
shrl	\$3, %edx
subl	\$12, %esp
pushl	%edx
call	_pdec
jmp	L2

OK, what has happened? First, we have

_pdec:
pushl	%ebp
movl	%esp, %ebp
pushl	%ebx
pushl	%eax
movl	8(%ebp), %ebx

This shows the compiler deciding to move the parameter into EBX and keeping
it there throughout the execution of the procedure, avoiding repeated memory
references. It's not too critical in this case, but in general this can be
a big improvement. The push of EAX is to keep the stack aligned. It's a one
byte instruction, which is quicker than doing a subtract of 4 from esp.

cmpl	\$9, %ebx
ja	L4

The compiler decided to switch the condition so that the else comes first.
Not clear why it felt that this was an improvement. Sometimes it's hard to
figure out what is going on in the compiler's mind :-)

movl	%ebx, %eax
movl	\$-858993459, %edx
mull	%edx
shrl	\$3, %edx
leal	(%edx,%edx,4), %edx
subl	%edx, %ebx
orl	\$48, %ebx

That's the code for getting the remainder on division by 10. It's a little
more efficient, in particular the use of the LEA instruction here is
interesting, this instruction has the effect of

EDX = EDX + EDX * 4

so it multiplies EDX by 5 in one instruction, and allows a slight improvement
in the sequence to get the remainder.

movl	%ebx, 8(%ebp)
movl	-4(%ebp), %ebx
movl	%ebp, %esp
popl	%ebp
jmp	_putchar

Now, *this* is clever. What is going on here is that the compiler knows that
the call to putchar is the last thing to be done. Rather than make another
stack frame and call putchar, then return here to destroy our stack frame
and return, the compiler figures out that it will destroy the current
stack frame now, and just jump to putchar, sort of as if the original
caller of pdec had in fact called putchar in the first place. Let's
dismantle it one instruction at a time:

movl	%ebx, 8(%ebp)

That's taking the value in ebx, which is the parameter to be passed to
putchar, and overwriting the value we were called with (which is no longer
needed). That way, when putchar is called, it will get this value from the
stack, as though it had been pushed on by the caller.

movl	-4(%ebp), %ebx

That's restoring the saved value of EBX, so that it will not be clobbered

movl	%ebp, %esp
popl	%ebp

That's stripping the PDEC stack frame which is no longer needed

jmp	_putchar

And that is the instruction that "calls" putchar. The trick here is that we
don't use a call instruction. We don't need to, because the return point
originally stored by the call to PDEC is still there on the stack. When
PUTCHAR finally returns, it will use this return point to return directly
to the caller.

That by the way is called tail recursion removal, a nice optimization (though
it takes quite a bit of work to understand).

```