Lecture Material from 201

Robert Dewar

#include <stdio.h>

/* A C file showing indexing through arrays */

/* We are going to write a function that is given an array of int
   values (actually a pointer to the first of these values), and
   a count of the number of elements in the array, and returns the
sum of the elements of the array */

/* In this first example, we use array indexing, and we spell things out
   in a rather full non-C style way */

int count1 (int v[], int c) {
  int j;
  int sum = 0;
  for (j = 0; j < c; j = j + 1) {
    sum = sum + v[j];
  }
  return sum;
}

/* Things to observe about the above program. The for loop is really not a
   for loop at all, but rather a while. The first expression in the for (in
   this case "j = 0" is initialization done once at the start of the loop.
   The second expression is a test done at the start of the loop, just as
   in a while loop. The third expression is the step statement (it almost
   certainly must have a side effect, and will typically increment the
   loop variable). It goes at the end of the loop. Thus the FOR here is
   simply a shorthand for:

         j = 0;
         while (j < c) {
            sum = sum + v[j];
            j = j + 1;
         }

    The declarer int v[] says that v is a vector of int's. In fact this is
    virtually the same as int *v in C, i.e. simply a pointer to the first
    element of the array, and the expression v[j] means exactly *(v + j).
    Note here that we are not simply adding the value of j to the pointer.
    Each element of the array is (typically) four bytes long, the space
    needed to hold an int, and so the pointer value we actually compute
    is v + 4*j. The C compiler knows that the elements are four bytes
    long and automatically adds the multiplication by 4) */

/* In this second example, we simply rewrite the above function in a manner
   that is a little more practical C style. We still use array indexing. */

int count2 (int *v, int c) {
  int j;
  int sum = 0;
  for (j = 0; j < c; j++)
    sum += v[j];
  return sum;
}

/* We really didn't chnage much. The *v is in fact much more typical of the
   way that a C programmer regards arrays, they really are seen as pointers.
   The use of ++ and += are very typical uses of these compact C notations */

/* Our third example differs from the second only at one small point */

int count3 (int *v, int c) {
  int j;
  int sum = 0;
  for (j = 0; j < c; j++)
    sum += j[v];
  return sum;
}

/* You have to have sharp eyes to see the difference. We have rewritten v[j]
   as j[v]. What's going on? Indexing an int value using an array as the
   subscript??? Well this just helps to emphasize that what may look like
   a conventional array reference is really just a pointer operation.
   The expression j[v] means *(j + v) which has exactly the same value
   as *(v + j) since addition does not care which way round its operands
   are. Of course this is a bit of a perverse way of writing things, but
   it works just fine */

/* Our fourth example avoids array indexing completely, and really thinks
   of things as pointers, and is most likely the way a C programmer would
   in practice write this function. */

int count4 (int *v, int c) {
  int j;
  int sum = 0;
  for (j = 0; j < c; j++)
    sum += *v++;
  return sum;
}

/* Two things to mention here. First the notation *v++ is one we have seen
   before. It gets the value at *v (the next array element we want to read),
   and then increments it by 1, but again the compiler automatically knows
   that since this is a pointer to type int, it is appropriate to add the
   length of int, to get to the next int, rather than actually adding 1.
   Secondly, this function clobbers the value of v, but that's OK, because
   the caller passed a copy of this value, not the variable itself, and it
   is this copy that is being clobbered (at the assembly level, it is the
   parameter location on the stack, or a copy of it, that is being changed.

/* A main program to test these out */

int main () {
  int x[10] = {1,2,3,4,5,5,4,3,2,1};
  printf ("sum = %d, %d, %d, %d\n",
	  count1(x, 10), count2(x,10), count3(x,10), count4(x,10));
}


OK, let's see if that works

   >gcc -c indexing.c
   >gcc indexing.o -oindexing.exe
   >indexing
   sum = 30, 30, 30, 30

Here the first command compiled indexing.c into the object file indexing.o.
This object file is almost machine code, except that it has blanks left
for calling printf, since printf is in some other file.

Generally speaking, names are gone from object files. For instance, let's
look for the name sum, one of our local variables in our count functions.
The way we do this is to use the nm command if we have that around (in a
windows system, you can get this by obtaining the cygwin tools, in a unix
evironment it will be there).

   >nm indexing.o
   00000000 b .bss
   00000000 d .data
   00000000 t .text
            U ___main
            U __alloca
   00000000 T _count1
   0000003f T _count2
   0000007e T _count3
   000000bd T _count4
   0000010b T _main
            U _printf

What we see here is the ASCII names that are present in the generated object
module, along with some information about them. The first three are internal
sections used to generate data. We can ignore those for now.

The entries for ___main, __alloca and _printf are the unresolved (U for
unresolved) external references, they represent places where this object
code makes calls to routines in other units. For now, we have no idea where
these units are. The T entries for _count1,2,3,4 and _main represent routines
in this object file which could be called from other files. Note that the
locations within the object file are also given, so the machine code for
_count3 for example starts at offset 0000007e from the start of the code
generated for this file.

There is no trace of sum, since this symbol is used only internally within
the indexing.c file, it cannot be referenced from outside, so all references
to sum can be completely converted to machine language by the first gcc
command, and there is no need to include the name in the object file.

The second gcc command is the "link" step. The linker takes the object
file you give it (indexing.o) in this case (in large programs, you will
give it more than one file). Then it looks at the undefined references
like _printf, and goes and finds the units containing these routines in
the library. It then glues together all the units of the program, and now
that all the routines are present, it can fill in the addresses of the
CALL instructions that could not be completed before, e.g. the call to
printf.

Finally our command indexing (we had specified that the executable name be
indexing.o) will run the program, and lo and behold, we get what we expect!

Now what ASM is being generated, well the interesting thing is that if we
get the assembler file:

   gcc -S indexing.c

then looking at the resulting assembler file, we find that the sequence of
assembler instructions is identical for count1, count2, count3. The compiler
understood that all of these mean exactly the same. Here is the code for
count1 (count2 and count3 are the same):

_count1:
	pushl	%ebp                ; establish frame
	movl	%esp, %ebp
	subl	$8, %esp            ; space for two locals, j and sum
	movl	$0, -8(%ebp)        ; initialize sum to zero
	movl	$0, -4(%ebp)        ; initialize j to zoer
L2:                                 ; start of the loop
	movl	-4(%ebp), %eax      ; get value of j
	cmpl	12(%ebp), %eax      ; exit from loop if j >= c
	jl	L5
	jmp	L3
L5:
	movl	-4(%ebp), %eax      ; value of j
	leal	0(,%eax,4), %edx    ; 4*j into edx
	movl	8(%ebp), %eax       ; value of pointer v
	movl	(%eax,%edx), %edx   ; get v[j] (double indexing here)
	leal	-8(%ebp), %eax      ; point to sum
	addl	%edx, (%eax)        ; add v[j] value into sum
	leal	-4(%ebp), %eax      ; point to j
	incl	(%eax)              ; increment it
	jmp	L2                  ; back to top of loop
L3:
	movl	-8(%ebp), %eax      ; load sum to return it
	leave                       ; destroy frame
	ret                         ; return to caller

The code for count4 is a little different

_count4:
	pushl	%ebp                ; establish frame
	movl	%esp, %ebp
	subl	$8, %esp            ; space for two locals, j and sum
	movl	$0, -8(%ebp)        ; initialize sum to zero
	movl	$0, -4(%ebp)        ; initialize j to zero
L17:                                ; start of loop
	movl	-4(%ebp), %eax      ; get value of j
	cmpl	12(%ebp), %eax      ; exit from loop if j >= c
	jl	L20
	jmp	L18                 ; [all the same up to here]

;;;;;;;;;;;;;;;;;; here is different ;;;;;;;;;;;;;;;
L20:
	movl	8(%ebp), %eax       ; copy pointer v to eax
	movl	(%eax), %edx        ; get value of referenced element in edx
	leal	-8(%ebp), %eax      ; point to sum
	addl	%edx, (%eax)        ; add value to sum
	addl	$4, 8(%ebp)         ; add 4 to pointer v value
;;;;;;;;;;;;;;;;;; end of difference ;;;;;;;;;;;;;;;;

	leal	-4(%ebp), %eax      ; point to j
	incl	(%eax)              ; increment it
	jmp	L17                 ; back to top of loop
L18:
	movl	-8(%ebp), %eax      ; load sum to return it
	leave                       ; destroy frame
	ret                         ; return to caller

So it seems that indeed the pointer approach is a little more efficient
(5 instructions instead of 6).

But that was unoptimized code, let's try optimizing the code (or in other
words tell the compiler to spend more time trying to make the code more
efficient):

   gcc -S -O2 -fomit-frame-pointer indexing.c

Here the -O2 says to optimize, and -fomit-frame-pointer says "don't worry
about debugging, if you can avoid using ebp as a frame pointer and improve
the code, go ahead!"

Let's look at the code for count4 first:

_count4:
	movl	8(%esp), %edx       ; c will be held in edx throughout
	xorl	%eax, %eax          ; count will be in eax, set to zero
	movl	4(%esp), %ecx       ; v will be in ecx throughout
	cmpl	%edx, %eax          ; check special case of edx zero
	jge	L32                 ; immediate exit if so
	.p2align 4,,7               ; make sure loop is aligned
L30:                                ; start of loop
	addl	(%ecx), %eax        ; add next value into count
	addl	$4, %ecx            ; bump pointer
	decl	%edx                ; decrement count
	jne	L30                 ; loop till count exhausted
L32:
	ret                         ; return with result in eax

That's really quite impressive. A lot has gone on here. First indeed there
is no need to mess with ebp, since everything can be indexed off esp.
Second, the parameters and the local variables are all held in registers.
Third, the loop works by counting c down to zero, instead of up. That's
equivalent, and allows the compiler to get rid of j completely.

The loop now is only four instructions, and it is hard to see how it could
be improved beyond this.

The whole function is only 10 instructions, instead of 20. But there is a
cost, namely in ability to debug. A programmer stopping the program on a
breakpoint in the middle of the loop and asking for the value of j will
not get much joy, how can the debugger tell you the value of j if the
procedure has been essentially transformed from:

   int count4 (int *v, int c) {
     int j;
     int sum = 0;
     for (j = 0; j < c; j++)
       sum += *v++;
     return sum;

to:

   int count4 (int *v, int c) {
      int sum = 0;
      if (c != 0) {
        loop: sum += *v++;
              if (--c) goto loop;
      }
   }

and there is no j in sight? Answer it probably will simply say that it knows
nothing about j. But of course the result is still correct.

What about count1/2/3. Well we might expect the amazing optimizer to generate
exactly the same code as count4, since after all the meaning is identical, but
we don't quite get that, instead we get (for all three):

_count1:
	pushl	%ebx                ; save ebx, we are going to use it
	xorl	%eax, %eax          ; count will be in eax, set it to zero
	movl	12(%esp), %ecx      ; get count c in ecx
	xorl	%edx, %edx          ; edx will hold j, set it to zero
	movl	8(%esp), %ebx       ; get array pointer in ebx throughout
	cmpl	%ecx, %eax          ; exit if special case of zero
	jge	L8
	.p2align 4,,7
L6:
	addl	(%ebx,%edx,4), %eax ; add next element to sum
	incl	%edx                ; bump index
	cmpl	%ecx, %edx          ; check if index at limit
	jl	L6                  ; loop back if not
L8:
	popl	%ebx                ; restore ebx
	ret                         ; return to caller

That's 13 instructions instead of 10, but the all important loop is still
only four instructions. However, the four instructions have to do quite a
bit more, look at the addl instruction:

	addl	(%ebx,%edx,4), %eax

That addressing mode computes the address ebx + 4*edx, which is exactly what
is wanted. However, this may be a little slower on some pentium chips than
the simple addressing of the count4 example, and certainly the setup
instructions are less efficient. Oh well, optimizers can't do everything,
they try their best. Sometimes they will beat even an experienced asm
programmer, but sometimes an expert asm programmer will be able to beat
out the compiler :-)