Lecture Material from 201
Recursion in C and Asm

Robert Dewar

/* Example of recursion in C, with main program driver */

/* Import standard I/O headers */
#include 

/* Following function prints an unsigned number left to right, using a
   recursive algorithm. The first if statement deals with all but the
   units digit, by making a recursive call. The putchar call outputs
   the final digit */

void pdec (unsigned a) {
  if (a >= 10) pdec (a / 10);
  putchar ((a % 10) | '0');
}

int main () {
  pdec (123);
}


Let's first see how we would translate this into the familiar Intel format
syntax for the 286 that we have been learning so far. Let's just deal with
the call to PDEC. That's easy:

   PUSH   123
   CALL   PDEC
   ADD    SP, 4

Here we simply push the parameter, call the procedure, then strip the
parameter from the stack (in C, the caller always strips the stack).

The code for pdec itself looks like

PDEC:   PUSH   BP                     ; Save old BP (frame pointer)
        MOV    BP, SP                 ; Set new BP (frame pointer)
        CMP    WORD PTR [BP+4], 10    ; Compare A against 10
        JNAE   L2

        SUB    DX, DX                 ; Put argument in DX:AX
        MOV    AX, WORD PTR [BP + 4]
        MOV    CX, 10                 ; Divide by 10
        DIV    CX

        PUSH   AX                     ; push quotient from AX
        CALL   PDEC                   ; recursive call
        ADD    SP, 4                  ; strip parameter

L2:     SUB    DX, DX                 ; put argument in DX:AX
        MOV    AX, WORD PTR [BP + 4]
        MOV    CX, 10                 ; divide by 10
        DIV    CX                     ; remainder goes into DX

        OR     DL, '0'                ; or in the zero
        PUSH   DX                     ; push parameter for putchar call
        CALL   PUTCHAR                ; call to putchar
        ADD    SP, 4                  ; strip parameter

        MOV    SP, BP                 ; strip stack frame
        POP    BP                     ; restore old bp
        RET                           ; return to caller

Now, let's take a look at the code generated for this by a "real" compiler.
We will use for this gcc. That's a C compiler available for all platforms.
If you have Linux, you have this already installed most likely. You can
easily download a copy of gcc for windows from the web. Just google on
DOWNLOAD GCC WINDOWS, and you wil see lots of resources for this. One
relatively painless way to download gcc is to get the Ada compiler from
the www.cs.nyu.edu area. This is a copy of gcc that supports both Ada
and C, it will automatically compile c if the extension of the file is .c.

To compile and link recurs.c with gcc we do:

  gcc -c recurs.c

By default that generates an executable called a, which we can easily run:

   C:\201>a
   123

As you see the result is 123, which is what we expect. Now to see what the
compiler is doing, let's have it generate an assembler file by doing:

   gcc -S recurs.c

and examining the assembler file that comes out recurs.s. I am only including
the actual code for the _pdec routine here, not the header junk, and not the
code for the main procedure:

_pdec:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$8, %esp
	cmpl	$9, 8(%ebp)
	jbe	L2
	subl	$12, %esp
	movl	8(%ebp), %edx
	movl	$-858993459, %eax
	mull	%edx
	movl	%edx, %eax
	shrl	$3, %eax
	pushl	%eax
	call	_pdec
	addl	$16, %esp
L2:
	subl	$12, %esp
	movl	8(%ebp), %ecx
	movl	$-858993459, %eax
	mull	%ecx
	shrl	$3, %edx
	movl	%edx, %eax
	sall	$2, %eax
	addl	%edx, %eax
	addl	%eax, %eax
	subl	%eax, %ecx
	movl	%ecx, %eax
	orl	$48, %eax
	pushl	%eax
	call	_putchar
	addl	$16, %esp
	leave
	ret

Now, a few things to say about the above code. It is different in two
important respects from the ASM code we have looked at so far.

First, it is for the 386, not the 286, which means that the registers are
all 32 bits, and have names like %eax for the 32 bit version of ax. The
ax register is still accesible (as the low order 16 bits of eax) and the
al and ah registers are also still accessible (as part of ax).

The second difference is that the compiler outputs things in so called
AT&T syntax instead of Intel syntax. Here is a quick tutorial on the
differences (taken from www.scene.ee/~lego/prog/djgpp_asm.html)

======================================================================

AT&T x86 Asm Syntax

GCC uses AT&T asm syntax. This is a little bit different from the regular
Intel format. The main differences are:

    * AT&T syntax uses the opposite order for source and destination operands,
          source followed by destination.
    * Register operands are preceded by the "%" character, including sections.
    * Immediate operands are preceded by the "$" character.
    * The size of memory operands are specified using the last character of
       the opcode. These are "b" (8-bit), "w" (16-bit), and "l" (32-bit).

Here are some examples: (Intel equivalent in parentheses)

	movw %bx, %ax   (mov ax, bx)
	xorl %eax, %eax (xor eax, eax)
	movw $1, %ax    (mov ax,1)
	movb X, %ah     (mov ah, byte ptr X)
	movw X, %ax     (mov ax, word ptr X)
	movl X, %eax    (mov eax, X)

Most opcodes are identical between AT&T and Intel format, except for the
addition of the character b/w/l to indicate the operand size. The main
exceptions are:

	movsSD (movsx)
	movzSD (movzx)

Where S and D are the source and destination operand size suffixes,
respectively. For example, "movswl %ax, %ecx (movsx ecx, ax)".

	cbtw        (cbw)
	cwtl        (cwde)
	cwtd        (cwd)
	cltd        (cdq)
	lcall $S,$O (call far S:O)
	ljmp $S,$O  (jump far S:O)
	lret $V     (ret far V)

Memory references are a little bit different too. The usual Intel memory
reference of the form

SECTION:[BASE + INDEX*SCALE + DISP]

is written as

SECTION:DISP(BASE, INDEX, SCALE).

Here are some examples: (Intel equivalent in parentheses)

	movl 4(%ebp), %eax            (mov eax, [ebp+4])
	addl (%eax,%eax,4), %ecx      (add ecx, [eax + eax*4])
	movb $4, %fs:(%eax)           (mov fs:eax, 4)
	movl _array(,%eax,4), %eax    (mov eax, [4*eax + array])
	movw _array(%ebx,%eax,4), %cx (mov cx, [ebx + 4*eax + array])

Jump instructions always use the smallest displacements. The following
instructions always work in byte displacements only, however: "jcxz",
"jecxz", "loop", "loopz", "loope", "loopnz" and "loopne". As suggested
in the online docs, a "jcxz foo" could be expanded to work:

	  jcxz cx_zero
	  jmp cx_nonzero
	cx_zero:
	  jmp foo
	cx_nonzero:

The online docs also caution on "mul" and "imul" instructions. The
expanding multiply instructions are done using ONE operand. For example,
"imul $ebx, $ebx" will NOT put the result in "edx:eax". Use the single
operand form "imul %ebx" to get the expanded result.
===========================================================

For now we are not trying to learn to write in this style, just to read it
so let's take bits and pieces of the asm file and explain it:

_pdec:
	pushl	%ebp
	movl	%esp, %ebp

That's setting up the frame, remember the movl works "backwards", so this is
copying esp into ebp (most assembly languages have the source first, so this
is consistent with that convention).

	subl	$8, %esp

That subl seems unnecessary, but actually it is significant. For maximum
efficiency, you want to keep the stack pointer aligned on a 16-byte
boundary. The entry sequence pushed a return point (4 bytes), and one
saved register (EBP) (4 bytes), so a subtract of 8 keeps the stack aligned
properly.

	cmpl	$9, 8(%ebp)
	jbe	L2

OK, that's not too hard to understand, it's the comparison against 10, and
jumps if the operand is <= 9. Not quite clear why it changed this from 10,
but it's obviously equivalent.

	subl	$12, %esp

This is a an interesting instruction. To keep this 16-byte alignment
on the stack pointer, each procedure assumes it is called with this
alignment, and then must guarantee that it is properly maintained on
a call. The call below will push one 4 byte parameter on the stack.
The extra 12 bytes from the above subl keep us properly aligned.

	movl	8(%ebp), %edx
	movl	$-858993459, %eax
	mull	%edx
	movl	%edx, %eax
	shrl	$3, %eax

Gadzooks! What's that about? Well the answer is interesting. Division is
a very slow instruction, even dividing by ten. The above sequence gets the
same result as a division by 10, but using a multiplication (the constant
is indeed weird, It looks like it is doing something like multiplying by
the binary fraction 0.8 and then dividing by 8 to get a shift. Not
something you would easily manage to do by hand, but compilers are good
at this sort of thing!

	pushl	%eax
	call	_pdec
	addl	$16, %esp

That's the recursive call, complete with stripping off not only the 4 bytes
pushed for the parameter, but also the extra 12 bytes used to keep the
stack pointer properly aligned

L2:
	subl	$12, %esp

That's the stack alignment instruction for the call below (same deal as the
case above)

	movl	8(%ebp), %ecx
	movl	$-858993459, %eax
	mull	%ecx
	shrl	$3, %edx
	movl	%edx, %eax
	sall	$2, %eax
	addl	%edx, %eax
	addl	%eax, %eax
	subl	%eax, %ecx
	movl	%ecx, %eax
	orl	$48, %eax

Ugh! All that is to get a remainder on division by 10, again using the
same kind of clever trick of replacing a slow division by a fast multiply.
Hard to believe we are ahead with this many instructions generated, but
division is *REAL* slow.

	pushl	%eax
	call	_putchar
	addl	$16, %esp

That's the call to putchar. Same deal as before on stripping not only the
pushed parameter, but also the extra 12 bytes used for alignment purposes.

	leave
	ret

And that's the epilog. LEAVE is a one byte instruction that is exactly
equivalent to

        movl   %ebp, %esp
        pop    %epb

which is the sequence we are used to for stripping the stack (remember again
the backwards direction, ebp is being copied into esp, not the other way
around.

One more interesting thing we can do is to turn on optimization. This tells
the gcc compiler "take your time and try to generate cleverer code". With
this clever code option, we get:

_pdec:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ebx
	pushl	%eax
	movl	8(%ebp), %ebx
	cmpl	$9, %ebx
	ja	L4
L2:
	movl	%ebx, %eax
	movl	$-858993459, %edx
	mull	%edx
	shrl	$3, %edx
	leal	(%edx,%edx,4), %edx
	addl	%edx, %edx
	subl	%edx, %ebx
	orl	$48, %ebx
	movl	%ebx, 8(%ebp)
	movl	-4(%ebp), %ebx
	movl	%ebp, %esp
	popl	%ebp
	jmp	_putchar
	.p2align 4,,7
L4:
	movl	%ebx, %eax
	movl	$-858993459, %edx
	mull	%edx
	shrl	$3, %edx
	subl	$12, %esp
	pushl	%edx
	call	_pdec
	addl	$16, %esp
	jmp	L2


OK, what has happened? First, we have

_pdec:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ebx
	pushl	%eax
	movl	8(%ebp), %ebx

This shows the compiler deciding to move the parameter into EBX and keeping
it there throughout the execution of the procedure, avoiding repeated memory
references. It's not too critical in this case, but in general this can be
a big improvement. The push of EAX is to keep the stack aligned. It's a one
byte instruction, which is quicker than doing a subtract of 4 from esp.

	cmpl	$9, %ebx
	ja	L4

The compiler decided to switch the condition so that the else comes first.
Not clear why it felt that this was an improvement. Sometimes it's hard to
figure out what is going on in the compiler's mind :-)

	movl	%ebx, %eax
	movl	$-858993459, %edx
	mull	%edx
	shrl	$3, %edx
	leal	(%edx,%edx,4), %edx
	addl	%edx, %edx
	subl	%edx, %ebx
	orl	$48, %ebx

That's the code for getting the remainder on division by 10. It's a little
more efficient, in particular the use of the LEA instruction here is
interesting, this instruction has the effect of

       EDX = EDX + EDX * 4

so it multiplies EDX by 5 in one instruction, and allows a slight improvement
in the sequence to get the remainder.

	movl	%ebx, 8(%ebp)
	movl	-4(%ebp), %ebx
	movl	%ebp, %esp
	popl	%ebp
	jmp	_putchar

Now, *this* is clever. What is going on here is that the compiler knows that
the call to putchar is the last thing to be done. Rather than make another
stack frame and call putchar, then return here to destroy our stack frame
and return, the compiler figures out that it will destroy the current
stack frame now, and just jump to putchar, sort of as if the original
caller of pdec had in fact called putchar in the first place. Let's
dismantle it one instruction at a time:

	movl	%ebx, 8(%ebp)

That's taking the value in ebx, which is the parameter to be passed to
putchar, and overwriting the value we were called with (which is no longer
needed). That way, when putchar is called, it will get this value from the
stack, as though it had been pushed on by the caller.

	movl	-4(%ebp), %ebx

That's restoring the saved value of EBX, so that it will not be clobbered
on eventual return to the PDEC caller.

	movl	%ebp, %esp
	popl	%ebp

That's stripping the PDEC stack frame which is no longer needed

	jmp	_putchar

And that is the instruction that "calls" putchar. The trick here is that we
don't use a call instruction. We don't need to, because the return point
originally stored by the call to PDEC is still there on the stack. When
PUTCHAR finally returns, it will use this return point to return directly
to the caller.

That by the way is called tail recursion removal, a nice optimization (though
it takes quite a bit of work to understand).