V22.0201 Computer Systems Organization
2021-22 Fall—Allan Gottlieb
Tuesdays and Thursdays 9:30-10:45

Start Lecture #01

Chapter 0 Administrivia

I start at 0 so that when we get to chapter 1, the numbering will agree with the text.

0.1 Contact Information

email: my-last-name AT nyu DOT edu (best method)
web: cs.nyu.edu/~gottlieb
office: 60 Fifth Ave, Room 316
office phone: 212 998 3344

0.2 Course Web Page

There is a web site for the course. You can find it from my home page, which is listed above, or from the department's home page.

You can find these lecture notes on the course home page. Please let me know if you can't find them.
The notes are updated as bugs are found or improvements made. As a result, I do not recommend printing the notes now (if at all).
I will place markers at the end of each lecture after the lecture is given. For example, the Start Lecture #01 marker above can be thought of as End Lecture #00.

0.3 Textbooks

The course has several texts.

We will first study the C programming language so you will need a book on C. I own, like, and will use Kernighan and Richie. The C Programming Language 2nd ed (Richie is the creator of C). I suggest you buy it. However, if you already own another C book, that is probably good enough.
Matthews, Newhall, and Webb, Dive Into Systems, Online book: https://diveintosystems.cs.swarthmore.edu/x86_64/antora/diveintosystems/beta The material on C is standard, but the order of presentation is not. A difference between this book and K&R is that the latter starts with low-level I/O (read and print one character) whereas this book starts with a higher level approach.
Bryant and O'Hallaron, Computer Systems: A programmer's Perspective. This will be the main text for the Computer Organization portion of the course. It is required.

0.4 Email, and the Brightspace Mailing List

You should have all been automatically added to NYU Brightspace for this course. I occasionally send broadcast announcements to this list.
If you want to send mail to me, use my-last-name AT nyu DOT edu.
I will respond to all questions.

0.5 Grades

Grades are based on the labs and exams; the weighting will be approximately
20%*LabAverage + 35%*MidtermExam + 45%*FinalExam (but see homeworks below).

0.7 Homeworks and Labs

I make a distinction between homeworks and labs.

Labs are

Required.
Due several lectures later (date given on assignment).
Announced in the notes, during class, and appear in Brightspace.
Penalized for lateness.
- The penalty is 2 point per day up to 5 days.
- After 5 days the penalty is 5 points per day.

Homeworks are

Optional.
Due one week after being entered on Brightspace.
Not accepted late.
The assignment is written in the notes and appears in Brightspace.
Your solution is submitted via Brightspace.
Checked for completeness and graded 0/1/2.
Able to help, but not hurt, your final grade.

0.7.1 Homework Numbering

Homeworks are numbered by the class in which they are assigned. So any homework given today becomes part of homework #1. Even if I do not give homework today, any homework assigned next class would become homework #2. In general the homework present in the notes for lecture #n is homework #n.

0.7.2 Doing Labs on non-NYU Systems

You may develop (i.e., write and test) lab assignments on any system you wish, e.g., your laptop. However, ...

You are responsible for any non-nyu machine. I extend deadlines if the nyu machines are down, not if yours are. So you should back up on an nyu server any work done on your personal computers.
You should test your assignments on the nyu systems, for this class I believe that means linserv1.cims.nyu.edu. More on how to do this will be in the recitations.
If some confusion arises, I can (and do) believe dates on linserv1 and friends. I can not believe dates on your laptop since you can change them backwards in time.
In an ideal world, a program written in a high level language such as Python, Java, C, or C++ that works on one system would work identically on any other system. Sadly, this ideal is not always achieved, despite marketing claims to the contrary. So, although you may develop your labs on any system, you must ensure that they run on linserv1, the computer the TAs will use when grading your labs.
You submit your labs using NYU Brightspace.

0.7.2.1 Testing Your Labs on linserv1.cims.nyu.edu

This will be covered in the recitations.

I feel it is important for CS students to be familiar with basic client-server computing (related to cloud computing) in which one develops software on a client machine (for us, most likely one's personal laptop), but runs it on a remote server (for us, linserv1.cims.nyu.edu). This requires three steps.

Obtaining an account on linserv1 (and access.cims.nyu.edu).
Copying files (your lab) from your system to linserv1.
Logging into linserv1 and running the lab.

I have supposedly given you each an account on linserv1 (and access), which takes care of step 1. Accessing linserv1 and access is different for different client (laptop) operating systems.

If you have a Unix based system (e.g., linux) you are ready to try it. From a terminal, type
ssh username@access.cims.nyu.edu
where username is your username on home.nyu.edu (i.e., your netid). It should ask for your password. You should have received an email from the systems group with your password. You should now be logged into a Unix machine named access.cims.nyu.edu. Try the command ls -lt.
While on access.cims.nyu.edu, type ssh linserv1. Now you are on another Unix machine (linserv1). You use scp (secure copy) to copy files from one Unix machine to another.
Why access and linserv1?
Most NYU systems (including linserv1) are accessible from inside the NYU network. If you are on the NYU network, you can skip access and ssh to linserv1 directly. If you are outside the NYU network, you cannot see linserv1, but can see access. So from outside (you play the Duo/MFA game and) log into access; then you are inside.
If you have MacOS, you use the same commands as for Unix (the core of MacOS is Unix). However, some versions of the MacOS terminal emulator default to rich text (instead of plain text). Once you convert to (or are lucky enough to have) a plain text terminal, you proceed just as for a Unix machine.
If you have MS Windows, you need to get two programs: PuTTY and WinSCP. Both are readily available for no cost (I think nyu/its has one of them). Please get them right away.

If you receive a message from linserv1 about an authentication failure, please follow the advice below from the systems group.

The first line of defense in all cases of authentication failure is to attempt a password reset. Please visit https://cims.nyu.edu/webapps/password/reset to do so. Within 15 minutes of a password reset submission, instructions to retrieve the new password will be sent to xyz123@nyu.edu. Please e-mail helpdesk@cims.nyu.edu in the event that the password reset either fails, or that the new password does not work (be sure to preface your ssh command with your username, e.g. ssh xyz123@access.cims.nyu.edu).

0.7.3 Obtaining Help with the Labs

Good methods for obtaining help include

Asking me during office hours, which are right after class.
Asking another student.
But ...
Your lab must be your own.
That is, each student must submit a unique lab. Naturally, simply changing comments, variable names, etc. does not produce a unique lab.

0.7.4 Computer Language Used for Labs

This course uses (and teaches) the C programming language. You must write your labs in C (or C++, but we will not teach the latter). Moreover C, but not C++, will appear on exams.

0.8 A Grade of Incomplete

The rules for incompletes and grade changes are set by the school and not the department or individual faculty member.

The rules set by CAS can be found in <http://cas.nyu.edu/object/bulletin0608.ug.academicpolicies.html>, which states:

The grade of I (Incomplete) is a temporary grade that indicates that the student has, for good reason, not completed all of the course work but that there is the possibility that the student will eventually pass the course when all of the requirements have been completed. A student must ask the instructor for a grade of I, present documented evidence of illness or the equivalent, and clarify the remaining course requirements with the instructor.

The incomplete grade is not awarded automatically. It is not used when there is no possibility that the student will eventually pass the course. If the course work is not completed after the statutory time for making up incompletes has elapsed, the temporary grade of I shall become an F and will be computed in the student's grade point average.

All work missed in the fall term must be made up by the end of the following spring term. All work missed in the spring term or in a summer session must be made up by the end of the following fall term. Students who are out of attendance in the semester following the one in which the course was taken have one year to complete the work. Students should contact the College Advising Center for an Extension of Incomplete Form, which must be approved by the instructor. Extensions of these time limits are rarely granted.

Once a final (i.e., non-incomplete) grade has been submitted by the instructor and recorded on the transcript, the final grade cannot be changed by turning in additional course work.

0.9 Academic Integrity Policy

This email from the assistant director, describes the policy.

  Dear faculty,

  The vast majority of our students comply with the
  department's academic integrity policies; see

  www.cs.nyu.edu/web/Academic/Undergrad/academic_integrity.html
  www.cs.nyu.edu/web/Academic/Graduate/academic_integrity.html

  Unfortunately, every semester we discover incidents in
  which students copy programming assignments from those of
  other students, making minor modifications so that the
  submitted programs are extremely similar but not identical.

  To help in identifying inappropriate similarities, we
  suggest that you and your TAs consider using Moss, a
  system that automatically determines similarities between
  programs in several languages, including C, C++, and Java.
  For more information about Moss, see:

  http://theory.stanford.edu/~aiken/moss/

  Feel free to tell your students in advance that you will be
  using this software or any other system.  And please emphasize,
  preferably in class, the importance of academic integrity.

  Rosemary Amico
  Assistant Director, Computer Science
  Courant Institute of Mathematical Sciences

0.10 Tutoring

Yifeng Ko <yk1962@nyu.edu> is the official tutor for this course. He will announce his schedule.

The tutoring will be via zoom. I will give more details about the tutoring zoom when I get them.

Remark: The chapter/section numbers for the material on C, agree with Kernighan and Plauger. However, the material is quite standard so, as mentioned before, if you already own a C book that you like, it should be fine.

Chapter K&R-1 A Tutorial Introduction

Since Java includes much of C, my treatment can be very brief for the parts in common (e.g., control structures).

You should be reading the first few chapters of K&R or Dive into Systems for the next few lectures.

K&R-1.1 Getting Started

C programs consist of functions, which contain statements, and variables, the latter store values.

The Hello World Function

  #include <stdio.h>
  main() {
    printf("Hello, world\n");
  }

All complete programs must have a main() function. The program begins execution there.
# introduces preprocessor directives, #include is the most common.
#include <stdio.h> tells the preprocessor to look in the standard place (due to the <>) for a file named stdio.h and include it right here. That file contains (among other things) the declaration of function printf().
printf() produces formatted output. The easiest case is simply a character string as shown here (\n signifies a newline).

Although this program works, the second line should really be
int main(int argc, char *argv[]) {
I know this looks weird for now but remember how long it took you to really understand public static void main (String[] args)

K&R-1.2 Variables and Arithmetic Expressions

Like Java.

K&R-1.3 The For Statement

Like Java

1.A lvalues and rvalues

The program on the right is trivial. However, I wish to use it to introduce lvalues and rvalues. Each variable (in this program x and y) has two values associated with it: its address and the contents of that address. The latter is often called the value of the variable.

  main() {
    int x=5, y=8;
    y = x+2;
  }

Consider the program's assignment statement y = x+2;. To evaluate the right hand side (RHS) we need to know that the value of x is 5; we are not interested in knowing the address in which this 5 is stored. This value, 5, is called the rvalue of x because it is what is needed when x occurs on the RHS. In contrast the fact that 8 is the rvalue of y is not relevant since y does not occur on the RHS.

The LHS contains just y. But the fact that y has the value (specifically the rvalue) 8, is not relevant. What is relevant is the address of y since that is where the system must store the 7 that results from the addition. The address of y is called its lvalue since it is what is needed when y occurs on the LHS.

  #include <stdio.h>
  main() {
    int n = 0, *pn;
    pn = &n;
    *pn = 33;
    printf("n = %d\n", n);
  }

Remark

This idea of addresses is a central theme of CSO because it is one key in understanding how Computer Systems are Organized.

The program on the far right is actually correct and prints "n = 33".

The beginnings of an explanation is the diagram on the near right: pn is a pointer to n, the (r)value of pn is the lvalue (aka the address) of n.

K&R-1.4 Symbolic Constants

Fahrenheight-Celsius

  #include <stdio.h>
  main() {
    int F, C;
    int lo=0, hi=300, incr=20;

    for (F=lo; F<=hi; F+=incr) {
      C = 5 * (F-32) / 9;
      printf("%d\t%d\n", F, C);
    }
  }

C has char, short, int, long, double, float. The first four contain integer values; the last two contain reals.
%d tells printf() to treat the next argument as an int; convert it from the internal form (two's complement binary, which we will learn after C) to printable form (ascii, unicode, ...).
%d uses the right amount of space to print the corresponding argument.
\t is a tab.
printf() accepts a variable number of arguments. Note that the value of the first argument determines the number of additional arguments. This is not an accident. Why?
This program would be better if the output numbers were right justified in a column, as done in the next example.
Should really use floating point.
Should really input lo, hi, and incr.
printf() is declared in stdio.h

  #include <stdio.h>
  #define LO 0
  #define HI 300
  #define INCR 20
  main() {
    int F;
    for (F=LO; F<=HI; F+=INCR)
      printf("%3d\t%5.1f\n", F, 
             (F-32)*(5.0/9.0));
  }

Floating Point Fahrenheight-Celcius

Note 5.0/9.0 to get floating point divide.
We must know how much space to use; in this case I know 3 digits are enough. Note %3d, which right justifies using 3 digits.
Note %5.1f. This means right justified using 5 columns, with 1 of those 5 after the decimal point. Since the decimal point also uses a column, we have 5-1-1=3 columns before the decimal point.
The call to printf() now contains an expression instead of just simple variables.
C uses #define to introduce symbolic constants. By convention these are all capital letters.

K&R-1.5 Character Input and Output

getchar() / putchar()

The simplest (i.e., most primitive) form of character I/O is getchar() and putchar(), which read and print a single character.

Both getchar() and putchar() are declared in stdio.h.

K&R-1.5.1 File Copying

  #include <stdio.h>
  main() {
    int c;
    while ((c = getchar()) != EOF)
      putchar(c);
  }

File copy is conceptually trivial: getchar() a char and then putchar() this char until eof. The code is on the right and does require some comment despite is brevity.

The program is basically just the while statement, which has just one semicolon and in that sense is a one-liner.
getchar() returns an int not a char! That is done so that getchar() can return EOF, which is not a char (and cannot be a char ). It is an int (in fact it is -1).
Question: Why can't EOF be a char?
Answer: Because all chars are legal (non-EOF) values that getchar() can return.
C is an expression language, statements return values. In particular, an assignment statement returns the value of its RHS. This explains the condition part of the while statement, once you notice the extra parens, which are definitely not extra.
getchar() reads from stdin and putchar() writes to stdout.
Illustrate in class how to use stdin/stdout and redirection.

Homework: (1-7) Write a (C-language) program to print the value of EOF. (This is 1-7 in the book but I realize not everyone will have the book so I will type the problems into the notes.)

Homework: Write a program to copy its input to its output, replacing each string of one or more blanks by a single blank.

  while (getchar() != EOF)  ++numChars;
  
  for (numChars = 0; getchar() != EOF; ++numChars);

K&R-1.5.2 Character Counting

This is essentially a one-liner, which I have written in two different ways: once with a while loop and once with a for loop.

K&R-1.5.3 Line Counting

Now we need two tests: end-of-line and end-of-input. Perhaps the following is really a two-liner, but it does have only one semicolon.

  while ((c = getchar()) != EOF)  if (c == '\n')  ++numLines;

So if a file has no newlines, it has no lines. Demo this with echo -n >noEOF "hello"

K&R-1.5.4 Word Counting

The Unix wc Program

The Unix wc program prints the number of characters, words, and lines in the input. It is clear what the number of characters means. The number of lines is the number of newlines (so if the last line doesn't end in a newline, it doesn't count). The number of words is less clear. In particular, what should be the word separators?

  #include <stdio.h>
  #define WITHIN   1
  #define OUTSIDE  0
  main() {
    int c, num_lines, num_words, num_chars;
    int within_or_outside = OUTSIDE;
    num_lines = num_words = num_chars = 0;
    while ((c = getchar()) != EOF) {
      ++num_chars;
      if (c == '\n')
        ++num_lines;
      if (c == ' ' || c == '\n' || c == '\t')
        within_or_outside = OUTSIDE;
      else if (within_or_outside == OUTSIDE) {
        // starting a word
        ++num_words;
        within_or_outside = WITHIN;
      }
    }
    printf("%d %d %d\n", num_lines, num_words, num_chars);
  }

This program assumes blank, newline, and tab are the only word separators. Where is that assumption used?
C doesn't have a real Boolean type. Instead int is used; 0 is false; everything else is true.
The key idea in the program, which is independent of the programming language used, is to keep track of when we are within a word and to bump the word counter at the the start of a new word.
- The program begins outside a word.
- Whenever it encounters a separator, it becomes (or stays) outside.
- Whenever it encounters a non-separator (i.e., a word constituent) and was outside, then it has found the start of a new word. This puts the program within a word and is when it bumps the word counter.
The C if-then-else is the same as Java.
Same for while, do/while, and for.
Same for switch/case.
Same for continue, break, and return.

Homework: (1-12) Write a program that prints its input one word per line.

Remark: Class accounts on linserv1.

First round of your class accounts for 201-003 and 202-002 were created tonight, and students will receive a welcome message if they are getting a new account. Any student who previously had a CIMS account in the past (whether or not it is active) will not get an email but their account will be adjusted as necessary for use for your class. The password reset link may be useful especially to those students: https://cims.nyu.edu/webapps/password/reset We will re-run the class account creation scripts daily until the drop deadline. Thanks,
Shirley

Remark: The tutor has revised the hours. The hours listed in section 0.10 of these notes have been updated.

K&R-1.6 Arrays

We are hindered in our examples because we don't yet know how to input anything other than characters and haven't yet written the program to convert a string of characters into an integer (easy) or (significantly harder) a floating point number.

Mean and Standard Deviation

  #include <stdio.h>
  #define N  10   // imagine you read in N
  main() {
    int i;
    float x, sum=0, mu;
    for (i=0; i<N; i++) {
      x = i;  // imagine you read in x
      sum += x;
    }
    mu = sum / N;
    printf("The mean is %f\n", mu);
  }
  #include <stdio.h>
  #define N  10   // imagine you read in N
  #define MAXN  1000
  main() {
    int i;
    float x[MAXN], sum=0, mu;
    for (i=0; i<N; i++) {
      x[i] = i;  // imagine you read in x[i]
    }
    for (i=0; i<N; i++) {
      sum += x[i];
    }
    mu = sum / N;
    printf("The mean is %f\n", mu);
  }
  #include <stdio.h>
  #include <math.h>
  #define N  5   // imagine you read in N
  #define MAXN  1000
  main() {
    int i;
    double x[MAXN], sum=0, mu, sigma;
    for (i=0; i<N; i++) {
      x[i] = i;  // imagine you read in x[i]
      sum += x[i];
    }
    mu = sum / N;
    printf("The mean is %f\n", mu);
    sum = 0;
    for (i=0; i<N; i++) {
        sum += pow(x[i]-mu,2);
    }
    sigma = sqrt(sum/N);
    printf("The std dev is %f\n", sigma);
  }

I am sure you know the formula for the mean (average) of N numbers: Add the numbers and divide by N. The mean is normally written μ. The standard deviation is the RMS (root mean square) of the deviations-from-the-mean, it is normally written σ. Symbolically, we write μ = ΣX_i/N and σ = √(Σ((X_i-μ)²)/N). (When computing σ we sometimes divide by N-1 not N. Ignore the previous sentence.)

The First Version (Just the Mean; No Array Used)

The first program on the right naturally reads N, then reads N numbers, and finally computes the mean of the latter. There is a problem; we don't know how to read numbers.

So I faked it by having N a symbolic constant and making x[i]=i.

The Second Version (Just the Mean: Array Used)

I do not like the second version with its gratuitous array. It is (a little) longer, slower, and more complicated. Much worse it takes space (i.e., requires memory) proportional to N, for no reason. Hence it might not run at all for large N and small machines. However, I have seen students write such programs. Apparently, there is an instinct to use a three step procedure for all programming assignments:

Read everything in.
Do all the computation.
Print all the answers.

But that is silly if, as in this example, you no longer need each value after you have read the next one.

The Third Version (Mean and Standard Deviation)

The last example is a good use of arrays for computing the standard deviation using the RMS formula above. We do need to keep the values around after computing the mean so that we can compute all the deviations from the mean and, using these deviations, compute the standard deviation.

Note that, unlike Java, no use of new (or the C analogue malloc()) appears.

Arrays declared as in this program have a lifetime of the routine in which they are declared. Specifically sum and x are both allocated when main is called and are both freed when main is finished.

Non-primitive Declarations

Note the declaration int x[MAXN] in the third version. In C, to declare a complicated variable (i.e., one that is not a primitive type like int or char), you write what has to be done to the variable to get one of the primitive types.

For example the declaration int *(x[]) says that x is something that, if you take an element of it and then dereference that element, then you will get an integer. So x is an array of pointers to integers.
int (*y)[] says that if you first dereference y and take an element of the result, you get an integer. So y is a pointer to an array of integers.
Thus int *(x[]) is not the same as (*x)[].
This declaration style is a controversial feature of C. Some like it, some don't (I didn't, but am used to it now). But that is the way it is.

In C if we have int X[10]; then writing X in your program is the same as writing &X[0]. & is the address of operator. More on this later when we discuss pointers.

K&R-1.7 Functions

There is of course no limit to the useful functions one can write. Indeed, the main() programs we have written above are all functions.

  #include <stdio.h>
  // Determine letter grade from score
  // Demonstration of functions
  char letter_grade (int score) {
    if      (score >= 90) return 'A';
    else if (score >= 80) return 'B';
    else if (score >= 70) return 'C';
    else if (score >= 60) return 'D';
    else                  return 'F';
  }  // end function letter_grade
  main() {
    short quiz;
    char grade;
    quiz = 75;   // should read in quiz
    grade = letter_grade(quiz);
    printf("For a score of %3d the grade is %c\n",
           quiz, grade);
  } // end main
  cc -o grades grades.c; ./grades
  For a score of  75 the grade is C

A C program is a collection of functions (and global variables). Exactly one of these functions must be called main and that is the function at which execution begins.

One important issue is type matching. If a function f takes one int argument and f is called with a short, then the short must be converted to an int. Since this conversion is widening, the compiler will automatically coerce the short into an int, providing it knows that an int is required.

It is fairly easy for the compiler to know all this providing f() is defined before it is used, as in the code on the right.

Computing Letter Grades

We see on the right a function letter_grade defined. It has one int argument and returns a char.

Finally, we see the main program that calls the function.

The main program uses a short to hold the numerical grade and then calls the function with this short as the argument. The C compiler generates code to coerce this short value to the int required by the function.

Averages and Sorting

  // Average and sort array of random numbers
  #define NUMELEMENTS 50
  void sort(int A[], int n) {
    int temp;
    for (int x=0; x<n-1; x++)
      for (int y=x+1; y<n; y++)
        if (A[x] < A[y]) {
          temp  = A[y];
          A[y]  = A[x];
          A[x]  = temp;
        }
  }
  double avg(int A[], int num) {
    int sum = 0;
    for (int x=0; x<n; x++)
      sum = sum + A[x];
    return (sum / n);
  }
  main() {
    int table[NUMELEMENTS];
    double average;
    for (int x=0; x<NUMELEMENTS; x++) {
      table[x] = rand(); /* assume defined */
      printf("The elt in pos %d is %d\n",
             x, table[x]);
    }
    average = avg(table, NUMELEMENTS );
    printf("The average is %5.1f ", average);
    sort(table, NUMELEMENTS );
    for (x-=; x<NUMELEMENTS; x++)
      printf("The element in position %3d is %3d \n",
             x, table[x]);
  }

The next example illustrates a function that has an array argument.

Remember that in a C declaration you decorate the item being declared with enough stuff (e.g., [], *) so that the result is a primitive type such as int, double, or char.

The function sort has two parameters, the second one n is simply an int. The parameter A, however, is more complicated. It is the kind of thing that when you take an element of it, you get an int.

That is, A is an array of ints.

Unlike the array example in section 1.6, A does not have an explicit upper bound on its index. This is because the function can be called with arrays of different sizes. Since the function needs to know the size of the array (look at the for loops), a second parameter n is used for this purpose.

This example has two function calls: main calls both avg and sort. Looking at the call from main to sort we see that table is assigned to A and NUMELEMENTS is assigned to n. Looking at the code in main itself, we see that indeed NUMELEMENTS is the size of the array table and thus in sort, n is the size of A.

All seems well provided the called function appears before the function that calls it. Our examples have followed this convention.

So far so good; but if f calls g and (recursively) g calls f, we are in trouble. How can we have f before g, and also have g before f?

This will be answered very soon.

1.8 Arguments—Call by Value

  #include <stdio.h>
  int f(int a, int b) {
    a = a+b;
    return a;
  }
  main() {
    int x = 10;
    int y = 20;
    int ans;
    ans = f(x, y);
  }

Arguments in C are passed by value (the same as Java does for arguments that are not objects).

Terminology: Arguments vs Parameters

The simple example on the right illustrates a few points. First, some terminology. The variables a and b in f() are called parameters of f() whereas, x and y are called arguments of the call f(x, y).

Copy-in BUT NOT Copy-out Semantics

When main() calls f() the values in the arguments are copied into the corresponding parameters. However, when f() returns, the values now in the parameters are NOT copied back to the arguments. This explains why the value in ans differs from the final value in x.

Try to avoid the fairly common error of assuming Copy-in AND Copy-out semantics.

Start Lecture #02

Remark: Homework #1 was entered on Brightspace 3 September; it is due 10 September.

Remark: The tutor sent a msg giving hours.

Weekdays: 11:00 am - 12:00 pm
Weekends: 7:30 pm - 9:00 pm

1.9 Character Arrays

Unlike Java, C does not have a string datatype. A string in C is an array of chars. String operations like concatenate and copy (assignment) become functions in C. Indeed there are a number of standard library routines that act on strings.

Strings in C are null terminated. That is, a string of length 5 actually contains 6 characters, the 5 characters of the string itself and a sixth character = '\0' (called null) indicating the end of the string. This is a big deal.

Print Longest Line

Our goal is a program that reads lines from the terminal, converts them to C strings by appending '\0', and prints the longest line found. Pseudo code would be

  while (more lines)
    read line
    if (line longer than previous longest)
      save line and its length
  print the saved line

Thus we need the ability to read in a line and the ability to save a line. We write two functions getLine() and copy() for these tasks (the book uses getline (all lower case), but that doesn't compile for me since there is a library routine in stdio.h with the same name and different signature).

  #include <stdio.h>
  #define MAXLINE 1000
  int getLine(char line[], int maxline);
  void copy(char to[], char from[]);
  int main() {
    int len, max;
    char line[MAXLINE], longest[MAXLINE];
    max = 0;
    while ((len=getLine(line,MAXLINE))>0)
      if (len > max) {
        max = len;
        copy(longest,line);
      }
    if (max>0)
      printf("%s", longest);
    return 0;
  }
  int getLine(char s[], int lim) {
    int c, i;
    for (i=0; i<lim-1 &&
              (c=getchar())!=EOF &&
              c!='\n';
         ++i)
      s[i] = c;
    if (c=='\n') {
      s[i]= c;
      ++i;
    }
    s[i] = '\0';
    return i;
  }
  void copy(char to[], char from[]) {
    int i;
    i=0;
    while ((to[i] = from[i]) != '\0')
      ++i;
  }

The `main()` Function

Given the two supporting routines, main is fairly simple, needing only a few small comments.

Note that getLine and copy are declared before main. They are defined later in the file but C requires declare (or define) before use so either main would have to come last or the declarations are needed. Since only main uses the routines, the declarations could have been in main but it is common practice to put them outside as shown. Although these routines are not recursive (and hence we could have placed the called routine before the caller), declarations like the one shown are needed for recursive routines.
Note %s inside printf. This is used for (null-terminated) strings.
Note that main is declared to return an integer and it does return 0. In Unix at least, this is the indication of a successful run.
Note the while loop structure. Having a parenthesized assignment as part of the condition being tested is a fairly common idiom.

The `getline()` Helper

This function is discussed further in recitation. The line is returned in the parameter s[], the function itself returns the length. The for continuation condition in getLine is rather complex. (Note that the for loop has an empty body; the entire action occurs in the for statement itself.)

The condition part of the for tests for 3 situations.

We have filled up s[].
We get to the end of the input.
We find an end-of-line

Perhaps it would be clearer if the test was simply i<lim-1 and the rest was done with if-break statments inside the loop.

A Subtlety in The Three Tests

In C, if you write f(x)+g(y)+h(z) you have no guarantee of the order the functions will be invoked. (Thus the program would be non-deterministic if g() modified something used by f().) However, the && and || operators do guarantee left-to-right ordering to enforce short-circuit condition evaluation. This ordering is important here since the test for '\n' must be performed after the getchar() has assigned its value to c.

The `copy() Helper`

The copy() function is declared and defined to return void.

This means that it does not return a value.
Similarly, a function taking no arguments should be declared and defined to have a void parameter list. Leaving the parameter list blank (i.e., writing fun_name()) actually means that you are not specifying the argument signature (which unfortunately limits the compiler's ability to error check). This weird rule is for compatibility with older versions of C that played a little fast and loose with such issues.

Homework: Simplify the for condition in getline() as just indicated.

1.10 External Variables and Scope

Solving Quadratic Equations

  #include <stdio.h>
  #include <math.h>
  #define A +1.0   // should read
  #define B -3.0   // A,B,C
  #define C +2.0   // using scanf()
  void solve (float a, float b, float c);
  int main() {
    solve(A,B,C);
    return 0;
  }
  void solve (float a, float b, float c) {
    float d;
    d = b*b - 4*a*c;
    if (d < 0)
      printf("No real roots\n");
    else if (d == 0)
      printf("Double root is %f\n", -b/(2*a));
    else
      printf("Roots are %f and %f\n",
             ((-b)+sqrt(d))/(2*a),
             ((-b)-sqrt(d))/(2*a));
  }
  #include <stdio.h>
  #include <math.h>
  #define A +1.0     // main() should
  #define B -3.0     // read A,B,C
  #define C +2.0     // using scanf()
  void solve(void);  // declaration of solve()
  float a, b, c;     // definitions
  int main() {       // definition of main()
    extern float a, b, c; // declarations
    a=A;
    b=B;
    c=C;
    solve();
    return 0;
  }
  void solve () {    // definition of solve()
    extern float a, b, c; // declarations
    float d;
    d = b*b - 4*a*c;
    if (d < 0)
      printf("No real roots\n");
    else if (d == 0)
      printf("Double root is %f\n", -b/(2*a));
    else
      printf("Roots are %f and %f\n",
             ((-b)+sqrt(d))/(2*a),
             ((-b)-sqrt(d))/(2*a));
  }

The two programs on the right find the real roots (no imaginary numbers) of the quadratic equation

  ax²+bx+c

They proceed by using the standard technique of first calculating the discriminant

  d = b²-4ac

Since these programs deal only with real roots, they punt when d<0.

The programs themselves are not of much interest. Indeed a Java version would be too easy to be a midterm exam question in 101. Our interest is confined to the way in which the coefficients a, b, and c are passed from the main() function to the helper routine solve().

Method 1 Arguments and Parameters

The main() function calls a function solve() passing it as arguments the three coefficients, A,B,C.

There is little to say. Method 1 is a simple program and uses nothing new.

Method 2 External (a.k.a. Global) Variables

The second main() program communicates with solve() using external variables rather than arguments/parameters.

Note the single external definition of a, b, and c. This definition is before and hence outside any function. The definition causes space to be set aside for these variables and they are visible inside main() and solve().
Also note the multiple internal declarations, one inside each function. They include the keyword extern and indicate that an external declaration is provide elsewhere (externally).
As noted in the preceding section, both main() and solve() are defined in this file and solve is in addition declared.
To repeat, in C you must declare (or define) before use. If you define before using, you don't need to also declare. But if you have recursion (f() calls g() and g() calls f()), you can't have both definitions before the corresponding uses so you need a declaration.
Within a single .c file, the definition of a, b, c is enough since it is before any function that uses the variables. Normally the definitions come before all functions, as done in the example, and hence the declarations are not needed.
For larger programs consisting of many functions, the functions are normally spread across multiple .c files. In this case each file must contain a declaration or definition prior to any use of the variable. Exactly one of the .c files must contain a definition.
Often, each function definition is contained in its own .c file and all the declarations are placed in a single .h (header) file that is included in all the .c files (so each .c has both a declaration and a definition for its function).

Chapter K&R-2 Types, Operators, and Expressions

K&R-2.1 Variable Names

Similar to Java: A variable name must begin with a letter and then can use letters and numbers. An underscore is a letter, but you shouldn't begin a variable name with one since that is conventionally reserved for library routines. Keywords such as if, while, etc are reserved and cannot be used as variable names.

K&R-2.2 Data Types and Sizes

C has very few primitive types.

char: One byte in size; can hold a character. C will coerce a char to an int if needed.
int: The natural size of an integer on the host machine.
float: Single precision floating point.
double: Double precision floating point.

There are qualifiers that can be added. One pair is long/short, which are used with int. Typically short int is abbreviated short and long int is abbreviated long.

long is guaranteed be at least as big as int, which is guaranteed to be as least as big as short.

There is no short float, short double, or long float. The type long double specifies extended precision.

The qualifiers signed or unsigned can be applied to char or any integer type. They basically determined how the sign bit is interpreted. An unsigned char uses all 8 bits for the integer value and thus has a range of 0–255; whereas, a signed char has an integer range of -128–127.

Note: We will have much more to say about data types, e.g., signed and unsigned, next month after we finish our treatment of C.

K&R-2.3 Constants

Integer and Char Constants

A normal integer constant such as 123 is an int, unless it is too big in which case it is a long. But there are other possibilities.

123 is an int
1234567 is an int if int's are 32-bits; it is a long if long's are 32-bits and int's are only 16-bits.
123u is an unsigned int.
1223ul is an unsigned long.
A character constant is written inside single quotes, e.g. '0'. These constants have an integer value. For '0' the value happens to be 48. Some single characters are written as two characters. For example '\0' is the ascii null character, which is used to terminate C strings. Its integer value is 0. Also important are '\n' and '\t'. There are others.

String Constants

There are no string variables in C. Although there are no string variables, there are string constants, written as zero or more characters surrounded by double quotes. A null character '\0' is automatically appended.

'x' is a single character; 'x' is a char it is not a string; 'x'+1 is a valid integer expression.
"x" contains two characters, 'x' followed by '\0'. It is a string constant, not an integer.
"xy" "yz" is combined (at compile time) into "xyyz". This is called concatenation.
Note that we concatenated two strings each with 3 characters and got one string with 5 character. The null ending the first string is not in the concatenation (otherwise it would end the concatenated string).
"" is the empty string consisting of one character, '\0'.
strlen() returns the length of a string, excluding the terminating '\0'.

Enum Constants

Alternative method of assigning integer values to symbolic names.

  enum Boolean {false, true}; // false is zero, true is 1
  enum Month {Jan=1, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec};

2.4 Declarations

Simple examples.

  int x, y;
  char c;
  double q1, q2;

(Stack allocated, i.e., local) arrays are simple since the entire array is allocated not just a reference (no new/malloc required).

  int x[10];

Initializations may be given.

  int x=5, y[2]={44,6}; z[]={1,2,3};
  char str[]="hello, world\n";

The qualifier const makes the variable read only so it must be initialized in the declaration.

2.5 Arithmetic Operators

Mostly the same as java.

Please do not call % the mod (or modulo) operator, unless you know that the operands are positive. The correct name for % is the remainder operator.

2.6 Relational and Logical Operators

Again very little difference from Java.

Please remember that && and || are required to be short-circuit operators. That is, they evaluate the right operand only if needed.

2.7 Type Conversion

There are two kinds of conversions: automatic conversions, called coercions, and explicit conversions, called casts.

Automatic Conversions

C coerces narrow arithmetic types to wide ones.

  {char, short} → int → long
  float → double → long double
  long → float   // precision can be lost

  int atoi(char s[]) {
    int i, n=0;
    for (i=0; s[i]>='0' && s[i]<='9'; i++)
      n = 10*n + (s[i]-'0');  // assumes ascii
    return n;
  }

The program on the right (ascii to integer) converts a character string representing an integer to the integral value.

This program assumes ascii (or some other system where the character form of the digits are consecutive and in the correct order).
The program stops at first non digit. For example, it will stop at the terminating '\0'.

Unsigned coercions are more complicated; you can read about them in the book or wait a few weeks when we will cover them.

Explicit Casts

The syntax

  (type-name) expression

converts the value to the type specified. Note that e.g., (double) x converts the value of x; it does not change x itself.

Homework: (2.3) Write the function htoi(s), which converts a string of hexadecimal digits (including an optional 0x or 0X) into its equivalent integer value. The allowable digits are 0 through 9, a through f, and A through F.

2.8 Increment and Decrement Operators

The same as Java.

Remember that neither x++ nor ++x are the same as x=x+1 because, with the operators, x is evaluated only once, which becomes important when x is itself an expression with side effects.

  x[i++]++ // increments some (which?) element of an array
  x[i++] = x[i++]+1 // puts incremented value in ANOTHER slot

In fact the last line is illegal (what order do you do the two increments of i?)

Homework: (2-4). Write an alternate version of squeeze(s1,s2) (defined in the text) that deletes each character in the string s1 that matches any character is the string s2.

  #include <stdio.h>
  int main(void) {
    int x = 4, y = 1;
    printf("x&y=%i  x&&y=%i\n", x&y, x&&y);
    return 0;
  }
  x%y=0  x&&y=1

K&R-2.9 Bitwise Operators

The same as Java

& bit wise AND
| bitwise OR
^ bitwise XOR (exclusive or)
<< left shift
>> right shift
~ bitwise complement

Note that x&&y is different from x&y.

2.10 Assignment Operators and Expressions

  int bitcount (unsigned x) {
    int b;
    for (b=0; x!=0; x>>= 1)
      if (x&01) // octal (not needed)
        b++;
    return b;
  }

The same as Java: += -= *= /= %= <<= >>= &= ^= |=

The program on the right counts how many bits of its argument are 1. Right shifting the unisigned x causes it to be zero-filled. Anding with a 1, gives the LOB (low order bit). Writing 01 indicates an octal constant (any integer beginning with 0; similarly starting with 0x indicates hexadecimal). Both are convenient for specifying specific bits (because both 8 and 16 are powers of 2). Since the constant in this case has value 1, the 0 has no effect.

2.11 Conditional Expressions

  printf("You enrolled in %d course%s.\n", n, (n==1) ? "" : "s");

The same as Java:

Homework: (2-10). Rewrite the function lower(), which converts upper case letters to lower case with a conditional expression instead of if-else.

Precedence and Associativity of C Operators
Operators	Associativity

() [] -> .	left to right
! ~ ++ -- + - * & (type) sizeof	right to left
* / %	left to right
+ -	left to right
<< >>	left to right
< <= > >=	left to right
== !=	left to right
&	left to right
^	left to right
\|	left to right
&&	left to right
\|\|	left to right
?:	right to left
= += -= *= /= %= &= ^= \|= <<= >>=	right to left
,	left to right

2.12 Precedence and Order of Evaluation

The table on the right is copied (hopefully correctly) from the book. It includes all operators, even those we haven't learned yet. I certainly don't expect you to memorize the table. Indeed one of the reasons I typed it in was to have an online reference I could refer to since I do not know all the precedences.

Homework: Check the table above for typos and report any you find.

Not everything is specified. For example if a function takes two arguments, the order in which the arguments are evaluated is not specified. Consider f(x++,x++);

Also the order in which operands of a binary operator like + are evaluated is not specified. So f() could be evaluated before or after g() in the expression f()+g(). This becomes important if, for example, f() alters a global variable that g() reads.

  #include <stdio.h>
  void main (void) {
    int x=3, y;
    y = + + + + + x;
    y = - + - + + - x;
    y = - ++x;
    y = ++ -x;
    y = ++ x ++;
    y = ++ ++ x;
  }

Question: Which of the expressions on the right are illegal?
Answer: The last three. They apply ++ to values not variables (i.e, to rvalues not lvalues).

I mention this because at this point in a previous semester there was some discussion about ++ ++. The distinction between lvalues and rvalues will become very relevant when we discuss pointers.

Since pointers have presented difficulties for students in the past, I use every opportunity that arises to give ways of looking at the problem.

Since ++ does an assignment (as well as an addition) it needs a place to put the result, i.e., an lvalue.

Start Lecture #03

Chapter K&R-3: Control Flow

3.1 Statements and Blocks

  int t[]={1,2};
  int main() {
    22;
    return 0;
  }

C is an expression language; so the constant 22 and the assignment x=33 have values (i.e., rvalues). One simple statement is an expression followed by a semicolon; For example, the program on the right is legal.

As in Java, a group of statements can be enclosed in braces to form a compound statement or block. We will have more to say about blocks later in the course.

3.2 If-Else

Same as Java.

3.3 Else-IF

Same as Java.

Dangling Else

Same as Java.

3.4 Switch

Same as Java.

3.5 Loops—While and For

  #include <ctype.h>
  int atoi(char s[]) {
    int i, n, sign;
    for (i=0; isspace(s[i]); i++) ;
    sign = (s[i]=='-') ? -1 : 1;
    if (s[i]=='+' || s[i]=='-')
      i++;
    for (n=0; isdigit(s[i]); i++)
      n = 10*n + (s[i]-'0');
    return sign * n;
  }

Same as Java. As we shall see, the loops in the book show the hand of a master.

The program on the right (ascii to integer) illustrates several points, as well as being extremely useful in its own right.

The C library contains a number of string/character routines. In this program two simple ones are used.
The first for loop skips over leading spaces. The loop has an empty body; this is not strange. The work is done in the termination test.
The conditional expression is used to determine the sign.
The program depends on the ascii property that the digits are consecutive and in numerical order.

The Comma Operator

  for (i=0, j=0; i+j<n; i++,j+=3)
    printf ("i=%d and j=%d\n", i, j);

If two expressions are separated by a comma, they are evaluated left to right and the final value is the value of the one on the right. This operator often proves convenient in for statements when two variables are to be incremented.

3.6 Loops—Do-While

Same as Java.

3.7 Break and Continue

Same as Java.

3.8 Goto and Labels

The syntax is

  goto label;

  for (...) {
    for (...) {
      while (...) {
        if (...) goto out;
      }
    }
  }
  out: printf("Left 3 loops\n");

The label has the form of a variable name. A label followed by a colon can be attached to any statement in the same function as the goto. The goto transfers control to that statement.

Note that a break in C (or Java) only leaves one level of looping so would not suffice for the example on the right.

The goto statement was deliberately omitted from Java. Poor use of goto can result in code that is hard to understand and hence goto is rarely used in modern practice.

The goto statement was much more commonly used in the past.

Homework: Write a C function escape(char s[], char t[]) that converts the characters newline and tab into two character sequences \n and \t as it copies the string t to the string s. Use the C switch statement. Also write the reverse function unescape(char s[], char t[]).

Chapter K&R-4 Functions and Program Structure

4.1 Basics of Functions

Very Simplified Unix grep

  #include <stdio.h>
  #define MAXLINE 100
  int getline(char line[], int max);
  int strindex(char source[], char searchfor[]);
  char pattern[]="x y"; // "should" be input
  int main() {
    char line[MAXLINE];
    int found=0;
    while (getline(line,MAXLINE) > 0)
      if (strindex(line, pattern) >= 0) {
        printf("%s", line);
        found++;
      }
    return found;
  }
  int getline(char s[], int lim) {
    int c, i;
    i = 0;
    while (--lim>0 && (c=getchar())!=EOF
                   && c!='\n')
      s[i++] = c;
    if (c == '\n')
      s[i++] = c;
    s[i] = '\0';
    return i;
  }
  int strindex(char s[], char t[]) {
    int i, j, k;
    for(i=0; s[i]!='\0'; i++) {
      for (j=i,k=0; t[k]!='\0' && s[j]==t[k];
                    j++,k++) ;
      if (k>0 && t[k]=='\0')
        return i;
    }
    return -1;
  }

The Unix utility grep (Global Regular Expression Print) prints all occurrences of a given string (or more generally a regular expression) from standard input. A very simplified version is on the right.

The basic program is

  while there is another line
    if the line contains the string
      print the line

Getting a line and seeing if there are more is getline(); a slightly revised version is on the right. Note that a length of 0 means EOF was reached; an "empty" line still has a newline char '\n' and hence has length 1.

Printing the line is printf().

Checking to see if the string is present in the line is the new code. The choice made was to define a function strindex() that is given two strings s and t and returns the position in s (i.e., the index in the array) where t occurs. strindex() returns -1 if t does not occur in s.

The program is on the right; further comments follow.

The string to look for is hardwired into the program in the variable pattern. We do this to avoid including a routine to read a string.
If you think of found as a Boolean, you would expect to see found=1; and not found++; Actually, found counts the number of occurrences of pattern in the input.
Note the declarations of getline() and strindex near the top of the program.. In particular see how their parameters are declared C-style, i.e., the declarations specify what you do to each parameter in order to get a char or int. These are not definitions of getline() and strindex(), which are given later. The declarations include only the header information and not the body; they describe only how to use the functions, not what the functions do.
The while inside getline() is quite nice and replaces a for in the previous version that looked like a while in disguise.
Note that the assignment to c is guaranteed to be done before c is used in the comparison.
The strindex outer for loops over where to start in s; the inner for loops over matching successive characters.
The inner for uses the comma operator to initialize and increment two variables. Cute, but I believe j is always i+k so this usage is not needed.

Form of a Function Definition

Note that a function definition is of the form

  return-type function-name(parameters) {
    declarations and statements
  }

The default return type is int, but I recommend not utilizing this fact and instead always declaring the return type.

The return statement is like Java.

K&R-4.2 Functions Returning Non-integers

The book correctly gives all the defaults and explains why they are what they are (compatibility with previous versions of C). I find it much simpler to always

Use no defaults when defining a function.
- If it returns int say so even though that is the default.
- If it has no parameters, write void as the parameter list.
Declare all functions that are used in a file. Have these declarations early, before any function definitions.
Caveat: We are not declaring main() correctly; we will correct this omission after we learn more about pointers.

K&R-4.3 External Variables

A C program consists of external objects, which are either variables or functions.

External vs. Internal

Variables and functions defined outside any function are called external.

Variables defined inside a function are called internal.

Functions defined inside another function would also be called internal; however standard C does not have internal functions. That is, you cannot in C define a function inside another function. In this sense C is not a fully block-structured language (see block structure below).

Defining External Variables

As stated, a variable defined outside functions is external. All subsequent functions in that file will see the definition (unless it is overridden by an internal definition).

External variables can be used, instead of parameters/arguments to pass information between functions. It is sometimes convenient not to repeat a long list of arguments common to several functions. However, using external variables also has problems: It makes the exact information flow harder to deduce when reading the program.

When we solved quadratic equations in section 1.10 our second method used external variables.

K&R-4.4 Scope Rules

Scope rules determine the visibility of names in a program. In C the scope rules are fairly simple.

Internal Names (Variables)

Since C does not have internal functions, all internal names are variables. Internal variables can be automatic or static. We have seen only automatic internal variables, and this section will discuss only them. Static internal variables are discussed in section 4.6 below.

An automatic variable defined in a function is visible from the definition until the end of the function (but see block structure, below).

If the same variable name is defined internal to two functions, the variables are unrelated.

Parameters of a function are the same as local variables in these respects.

External Names

  int main(...) {...}
  int value;
  float joe(...) {...}
  float sam;
  int bob(...) {...}

An external name (function or variable) is visible from the point of its definition (or declaration as we shall see below) until the end of that file. In the example on the right

main() cannot call joe() or bob(), and cannot use either value or sam.
joe() can call main() and can use value but cannot use sam or call bob().
bob() can call main() or joe() and can use value and sam.

Definitions and Declarations

There can be only one definition of a given external name in the entire program (even if the program includes many files). However, there can be multiple declarations of the same name.

A declaration describes a variable (gives its type) but does not allocate space for it. A definition both describes the variable and allocates space for it.

  extern int X;
  extern double z[];
  extern float f(double y);

Thus we can put declarations of a variable X, an array z[], and a function f() at the top of every file and then X and z are visible in every function in the entire program. Declarations of z[] do not give its size since space is not allocated; the size is specified in the definition.

If declarations of joe() and bob() were added at the top of the previous example, then main() would be able to call them.

If an external variable is to be initialized, the initialization must be put with the definition, not with a declaration.

How to tell apart declarations or definitions

If a function is written inside another function, the program is not written in C.
If a function is written outside all other functions, it is a definition if has {} with code inside.
If a function is written outside all other functions, it is a declaration if does not have {}.
If a variable is declared/defined inside a function, we have a definition and the variable is internal to the function.
If a variable is declared/defined outside every function, and includes the C keywork extern, we have a declaration of an external variable.
If a variable is declared/defined outside every function, and does not includes the C keywork extern, we have a definition of an external variable.

K&R-4.5 Header Files

  #include <stdio.h>
  double f(double x);
  int main() {
    float y;
    int x = 10;
    printf("x in main is %i\n", x);
    printf("f(x) is %f\n", f(x));
    return 0;
  }
  double f(double x) {
    printf("x in f is %f\n", x);
    return x;
  }
  x in main is 10
  x in f is 10.000000
  f(x) is 10.000000

The code on the right shows how valuable having the types declared can be. The function f() is the identity function. However, main() knows that f() takes a double so the system automatically converts x to a double when calling f().

It would be awkward to have to change every file in a big programming project when a new function was added or had a change of signature (types of arguments and return value). What is done instead is that all the declarations are included in a single header file. The definitions remain scattered over many files. (Each function is naturally defined only once).

For now assume the entire program is in one directory. Create a file with a name like functions.h containing the declarations of all the functions. Then early in every .c file write the line

    #include  "functions.h"

Note the quotes not angle brackets, which indicates that functions.h is located in the current directory, rather than in the standard place that is used for <>.

4.6: Static Variables

Lifetime and Visibility

We need to distinguish the lifetime of the value in a variable from the visibility of the variable.

Consider the variable x in the trivial example
void f(void) { int x = 5; printf(%d\n", x++); }

No matter how many times f() is called, the value printed will always be 5. This is because each call re-initializes x to 5. We say that the lifetime of x's value is one execution of the function. In contrast an external variable maintains values assigned to it; its lifetime is permanent.

In addition, x, a local variable, is not visible in any other function. That is, the visibility of x is local to the function in which it is defined.

Static Variables

The adjective static has very different meanings when applied to internal and external variables.

For external variables, static decreases the visibility of the variable (but retains the lifetime of its value).
For internal variables, static increases the lifetime of its value (but retains the local visibility of the variable).

  int main(...){...}
  static int b16;
  void sam(...){...}
  double beth(...){...}

If an external variable is defined with the static attribute, its visibility is limited to the current file. In the example on the right b16 is naturally visible in sam() and beth(), but not main(). The addition of static means that if another file has a definition or declaration of b16, with or without static, the two b16 variables are not related.

If an internal variable is declared static, its lifetime is the entire execution of the program. This means that if the function containing the variable is called twice, the value of the variable at the start of the second call is the final value of that variable at the end of the first call.

Static Functions

As we know, there are no internal functions in standard C. If an (external) function is defined to be static, its visibility is limited to the current file (as for static external variables).

K&R-4.7 Register Variables

Ignore this section. Register variables were useful when compilers were primitive. Today, compilers can generally decide, better than programmers, which variables should be put in register.

K&R-4.8: Block Structure

Standard C does not have internal functions, that is you cannot in C define a function inside another function. In this sense C is not a fully block-structured language.

Of course C does have internal variables; we have used them in almost every example. That is, most functions we have written (and will write) have variables defined inside them.

  #include <stdio.h>
  int main(void) {
    int x = 5;
    printf ("The value of outer x is %d\n", x);
    {
      int x = 10;
      printf ("The value of inner x is %d\n", x);
    }
    printf ("The value of the outer x is %d\n", x);
    return 0;
  }
  The value of outer x is 5.
  The value of inner x is 10.
  The value of outer x is 5.

Also C does have block structure with respect to variables. This means that inside a block (remember that a block is a bunch of statements surrounded by {}) you can define a new variable with the same name as the old one. These two variables are unrelated. The lifetime of the new variable is just the lifetime of the execution of the block in which it is defined.

For example, the program on the right produces the output shown.

Remark: The gcc compiler for C does permit one to define a function inside another function. These are called nested functions. Some consider this gcc extension to be evil; we will not use it.

Note that we have used nested blocks many times without calling them out. Specifically, when you use {} to group the body of a for loop or the then portion of an if-then-else these also are blocks since they are enclosed by {}.

Homework: Write a C function int odd (int x) that returns 1 if x is odd and returns 0 if x is even. Can you do it without an if statement?

K&R-4.9 Initialization

Default Initialization

Static and external variables are, by default, initialized to zero. Automatic i.e., non-static, internal variables (the only kind left) are not initialized by default.

Initializing Scalar Variables

As in Java, you can write int X=5-2;. For external or static scalars, that is all you can do.

  int x=4;
  int y=x-1;

For automatic, internal scalars the initialization expression can involve previously defined values as shown on the right (even function calls are permitted).

Initializing Arrays

  int BB[8] = {4,9,2}
  int AA[] = {3,5,12,7};
  char str[] = "hello";
  char str[] = {'h','e','l','l','o','\0'}

You can initialize an array by giving a list of initializers as shown on the right.

The last 5 elements of BB are uninitialized.
The size of AA is automatically 4. This is very convenient.
The last two are the same; the size of str is 6.

K&R-4.10 Recursion

The same as Java.

K&R-4.11 The C Preprocessor

Normally, before the compiler proper sees your program, a utility called the C preprocessor is invoked to include files and perform macro substitutions.

K&R-4.11.1 File Inclusion

  #include <filename>
  #include "filename"

We have already discuss both forms of file inclusion. In both cases the file mentioned is textually inserted at the point of inclusion. The difference between the two is that the first form looks for filename in a system-defined standard place; whereas, the second form first looks in the current directory.

K&R-4.11.2 Macro Substitution

  #define MAXLINE 20
  #define MULT(A, B) ((A) * (B))
  #define MAX(X, Y)  ((X) > (Y)) ? (X) : (Y)
  #undef getchar

We have already used examples of macro substitution similar to the first line on the right. The second line, which illustrates a macro with arguments is more interesting.

Without all the parentheses on the RHS, the MULT macro would still be legal, but would (sometimes) give the wrong answers.
Question: Why?
Answer: Consider MULT(x+4, y+3)

Note that macro substitution is not the same as a function call (with standard call-by-value or call-by-reference semantics). Even with all the parentheses in the third example you can get into trouble since MAX(a++,5) can increment a twice. If you know call-by-name from algol 60 fame, this will seem familiar.

We probably will not use the fourth form. It is used to un-define a macro from a library so that you can write another version.

There is some fancy stuff involving # in the RHS of the macro definition. See the book for details; I do not intend to use it.

  #if integer-expr
  ...
  #elif integer-expr
  ...
  #else
  ...
  #endif

K&R-4.11.3 Conditional Inclusion

The C-preprocessor has a very limited set of control flow items. On the right we see how the C

  if (cond1)
  ...
  else if (cond2)
  ...
  else
  ..
  end if

construct is written. The individual conditions are simple integer expressions consisting of integers, some basic operators and little else. Perhaps the most useful additions are the preprocessor function defined(name), which evaluates to 1 (true) if name has been #define'd, and the ! operator, which converts true to false and vice versa.

  #if !defined(HEADER22)
  #define HEADER22
  // The contents of header22.h
  // goes here
  #endif

We can use defined(name) as shown on the right to ensure that a header file, in this case header22.h, is included only once.

Question: How could a header file be included twice unless a programmer foolishly wrote the same #include twice?
Answer: One possibility is that a user might include two systems headers h1.h and h2.h each of which includes h3.h.

Two other directives #ifdef and #ifndef test whether a name has been defined. Thus the first line of the previous example could have been written ifndef HEADER22.

  #if SYSTEM == MACOS
    #define HDR "macos.h"
  #elsif SYSTEM == WINDOWS
    #define HDR "windows.h"
  #elsif SYSTEM == LINUX
    #define HDR "linux.h"
  #else
    #define HDR "empty.h"
  #endif
  #include HDR

On the right we see a slightly longer example of the use of preprocessor directives. Assume that the name SYSTEM has been set to the name of the system on which the current program is to be run (not compiled). Assume also that individual header files have been written for macos, windows, and linux systems. Then the code shown will include the appropriate header file.

Note: The quotes used in the various #defines for HDR are not required by #define, but instead are needed by the final #include.

Chapter K&R-5 Pointers and Arrays

  public class X {
    int a;
    public static void main(String args[]) {
      int i1;
      int i2;
      i1 = 1;
      i2 = i1;
      i1 = 3;
      System.out.println("i2 is " + i2);
      X x1 = new X();
      X x2 = new X();
      x1.a = 1;
      x2 = x1;     // NOT x2.a = x1.a
      x1.a = 3;
      System.out.println("x2.a is " + x2.a);
    }
  }

Pointers are a big difference between Java and C. You can read chapter 2 of Dive into Systems for another account of C pointers.

Much of the material on pointers has no explicit analogue in Java; it is there kept under the covers. If in Java you have an Object obj, then obj is actually what C would call a pointer. The technical term is that Java has reference semantics for all objects. In C this will all be quite explicit

To give a Java example, look at the snippet on the right. The first part works with integers. We define 2 integer variables; initialize the first; set the second to the first; change the first; and print the second. Naturally, the second has the initial value of the first, namely 1.

The second part deals with X, a trivial class, whose objects have just one data component, an integer called a. We mimic the above algorithm. We define two X's and work with their integer field (a). We then proceed as above: initialize the first integer field; set the second to the first; change the first; and print the second. The result is different from the above! In this case the second has the altered value of the first, namely 3.

The key difference between the two parts is that (in Java) simple scalars like i1 have value semantics; whereas objects like x1 have reference semantics. But enough Java, we are interested in C.

5.1 Pointers and Addresses

You will learn later this semester and again in 202, that the OS finagles memory in ways that would make Bernie Madoff smile. But, in large part thanks to those shenanigans, user programs can have a simple view of memory. For us C programmers, memory is just a large array of consecutively numbered locations.

The machine model we will use in this course is that the fundamental unit of addressing is a byte and a character (a char) exactly fits in a byte. Other types like short, int, double, float, long normally take more than one byte, but always a consecutive range of bytes.

lvalues and rvalues

One consequence of our memory model is that associated with int z=5; are two numbers. The first number is the address of the location in which z is stored. The second number is the value stored in that location; in this case that value is 5. The first number, the address, is often called the lvalue; the second number, the contents, is often called the rvalue. Why l and r? I know we did this already; I think it is worth repeating.

Consider
z = z + 1;
To evaluate the right hand side (RHS) we need to add 5 to 1. In particular, we need the value contained in the memory location assigned to z, i.e., we need 5. Since this value is what is needed to evaluate the RHS of an assignment statement it is called an rvalue.

Then we compute 6=5+1. Where should we put the 6? We look at the LHS and see that we put the 6 into z; that is, into the memory location assigned to z. Since it is the location that is needed when evaluating a LHS, the address is called an lvalue.

Start Lecture #04

The Unary Operators & and *

As we have just seen, when a variable appears on the LHS, its lvalue or address is used. What if we want the address of a variable that appears on the RHS; how do we get it?

In a language like Java the answer is simple; we don't.

In C we use the unary operator & and write p=&x; to assign the address of x to p. After executing this statement we say that p points to x or p is a pointer to x. That is, after execution, the rvalue of p is the lvalue of x. In other words the value of p is the address of x.

  int x=13;

Look at the declarations on the right. x is familiar; it is an integer variable initially containing 13. Specifically, the rvalue of x is (initially) 13. What about the lvalue of x, i.e., the location in which the 13 is stored? It is not an int; it is an address into which an int can be stored. Alternately said it is pointer to an int.

The unary prefix operator & produces the address of a variable, i.e., &x gives the lvalue of x, i.e. it gives a pointer to x.

The unary operator * does the reverse action. When * is applied to a pointer, it gives the object (object is used in the English not OO sense) pointed to. The * operator is called the dereferencing or indirection operator.

  int x=13;
  int *p = &x;

Now look at the declaration of p on the right. It says that p is the kind of thing, that when you apply * to it you get an int, i.e., p is a pointer to an int. That is why we can initialize p to &x.

Note: Try to avoid the common error of thinking the second line on the right initializes *p to &x. It doesn't. It declares and initializes p not *p.

How Does it Look in Memory?

On the right we show how p and x might be stored in memory. After we finish with C we will study the memory model in more detail. Here I just give enough to understand that pointers like p are also variables that are stored just like ints, floats, and chars.

The basic storage unit on modern computers is a byte. We shall assume that a char fits perfectly in a byte. However, ints, floats, and pointers are bigger. Each requires several bytes. For today assume each is 4 bytes.

In the diagram on the right x happens to be stored in locations 5000-5003 (i.e., each box is 4 bytes). x has value 13; more precisely its rvalue is 13. Since the address of x is 5000, the lvalue of x is 5000.

The integer pointer p happens to be stored in 8040-8043; i.e., its address or lvalue happens to be 8040. Since p points to x, the rvalue of p equals the lvalue of x, which is 5000.

An Example Code Sequence (Part One)

  // part one of three
  int x=1;
  int y=2;
  int z[10];
  int *ip;
  int *jp;
  ip = &x;

Consider the code sequence on the right (part one). The first 3 lines we have seen many times before; the next three are new. Recall that in a C declaration, all the doodads around a variable name tell you what you must do to the variable to get the base type at the beginning of the line. Thus the fourth line says that if you dereference ip (i.e., if you look at the contents of the address in ip, not of ip), you get an integer. Common parlance is to call ip an integer pointer (which is why I named it ip). Similarly, jp is another integer pointer.

At this point both ip and jp are uninitialized. The last line sets ip to the address, of x. Note that the types match, both ip and &x are pointers to an int.

An Example Code Sequence (Part Two)

  // part two of three
  y = *ip;     // L1
  *ip = 0;     // L2
  ip = &z[0];  // L3
  *ip = 0;     // L4
  jp = ip;     // L5
  *jp = 1;     // L6

In part two, L1 sets y=1 as follows: ip now points to x, * does the dereference so *ip is x. Since we are evaluating the RHS, we take the contents not the address of x and get 1.

L2 sets x=0;. The RHS is clearly 0. Where do we put this zero? Look at the LHS: ip currently points to x, * does a dereference so *ip is x. Since we are on the LHS, we take the address and not the contents of x and hence we put 0 into x.

L3 changes ip; it now points to z[0]. So L4 sets z[0]=0;

Pointers can be used without the deferencing operator. L5 sets jp to ip. Since ip currently points to z[0], jp now does as well. Hence L6 sets z[0]=1;

An Example Code Sequence (Part Three)

  // part three of three
  ip = &x;         // L1
  *ip = *ip + 10;  // L2
  y = *ip + 1;     // L3
  *ip += 1;        // L4
  ++*ip;           // L5
  (*ip)++;         // L6
  *ip++;           // L7

Part three begins by re-establishing ip as a pointer to x so L2 increments x by 10 and L3 sets y=x+1;.

L4 increments x by 1 as does L5 (because the unary operators ++ and * are right associative).

L6 also increments x, but L7 does not. By right associativity we see that the increment precedes the dereference, so the pointer is incremented (not the pointee). The full story awaits section 5.4 below.

5.2 Pointers and Function Arguments

  void bad_swap(int x, int y) {
    int temp;
    temp = x;
    x = y;
    y = temp;
  }

The program on the right is what a novice programer just learning C (or Java) would write. It is supposed to swap the two arguments it is called with. However, it fails due to call by value semantics for function calls in C.

When another function calls swap(a,b) the values of the arguments a and b are transmitted to the parameters x and y and then swap() interchanges the values in x and y. But when swap() returns, the final values in x and y are NOT transmitted back to the arguments: a and b are unchanged.

But functions that change the values of their arguments are useful! We won't give them up without a fight.

Actually, what is needed is to be able to change the value of variables used in the caller (even if some related variables become the arguments) and that distinction is the key. Just because we want to swap the values of a and b, doesn't mean the arguments have to be literally a and b.

  void swap(int *px, int *py) {
    int temp;
    temp = *px;
    *px = *py;
    *py = temp;
  }

The program on the right has two parameters px and py each of which is a pointer to an integer (*px and *py are the integers). Since C is a call-by-value language, changes to the parameters, which are the pointers px and py would not result in changes to the corresponding arguments. But the program on the right doesn't change the pointers at all, instead it changes the values they point to.

Since the parameters are pointers to integers, so must be the arguments. A typical call to this function would be
int A=10,B=20; swap(&A,&B);

It is crucial to understand, how this call results in A becoming 20, the value previously in B, and B becoming 10, the value previously in A.

On the right is a pictorial explanation. A has a certain address. &A equals that address (more precisely the rvalue of &A = the lvalue of A). Similarly for py, &B, and B. These are shown by the solid arrows in the diagram.

The call swap(&A,&B) copies (the rvalue of) &A into (the rvalue of) the first parameter, which is px. Similarly for &B and the second parameter, py. These are shown by the dotted arrows. Thus the value of px is the address of A, which is indicated by the arrow. Again, to be pedantic, the rvalue of px equals the rvalue of &A, which equals the lvalue of A. Similarly for py, &B, and B.

Swapping px with py would change the dotted arrows, but would not change anything in the caller. However, we don't swap px with py; instead we swap *px with *py. That is, we dereference the pointers and swap the things pointed to! This subtlety is the key to understanding the effect of many C functions. It is crucial.

Homework: Write rotate3(A,B,C) that sets A to the old value of B, sets B to old C, and C to old A.

Homework: Write plusminus(x,y) that sets x to old x + old y and sets y to old x - old y.

Start Lecture #05

A Larger Example—getch(), ungetch(), and getint()

  #include <stdio.h>
  #define BUFSIZE 100
  char buf[BUFSIZE];
  int  bufp = 0;
  int  getch(void);
  void ungetch(int);
  int getint(int *pn);
  int getch(void) {
    return (bufp>0) ? buf[--bufp] : getchar();
  }
  void ungetch(int c) {
    if (bufp >= BUFSIZE)
      printf("ungetch: too many chars\n");
    else
      buf[bufp++] = c;
  }
  #include <stdio.h>
  #include <ctype.h>
  int getint(int *pn) {
    int c, sign;
    while (isspace(c=getch())) ;
    if (!isdigit(c) && c!=EOF && c!='+' && c!='-') {
      ungetch(c);
      return 0;
    }
    sign = (c=='-') ? -1 : 1;
    if (c=='+' || c=='-')
      c = getch();
    for (*pn = 0; isdigit(c); c=getch())
      *pn = 10 * *pn + (c-'0');
    *pn *= sign;
    if (c != EOF)
      ungetch(c);
    return c;
  }

The program pair getch() and ungetch() generalize getchar() by supporting the notion of unreading a character, i.e., having the effect of pushing back several already read characters.

Note that ungetch() is careful not to exceed the size of the buffer used to stored the pushed back characters. Remember that the C compiler does not generate run-time checks that prevent you from accessing an array beyond its bound. As mentioned previously, a number of break ins had been enabled by the lack of such checks in library programs.

Also shown is getint(), which reads an integer from standard input (stdin) using getch() and ungetch().

getint() returns the integer read via a parameter. As we have seen the new value of a parameter is not passed back to the caller. Hence, getint() uses the pointer/address business we just saw with swap().

Specifically any change made to pn by getint() would be invisible to the caller. However, getint() changes *pn; a change the caller does see.

The value returned by the function itself is the status: zero means the next characters do not form an integer, EOF (which is negative) means we are at the end of file, positive means an integer has been found.

Briefly the program works as follows.

  Skip blanks
  Check for legality
  Determine sign
  Evaluate number
    one digit at a time

Although short, the program is not trivial. Indeed, there are some details to note.

A + or - followed by a non digit is treated as zero. See homework 5-1 below.
Care is needed to understand the result when the file being read does not end in a newline. For example, if getint() is invoked on a file containing just three characters 123 (no newline at the end), it will set *pn=123 as desired but will return EOF. I suspect that most programs using getint() will, in this case, ignore *pn and just treat it as EOF.
If this, and other, examples from the book seem clever and subtle, remember that Richie, one of the authors, invented C.

If you were asked to produce a getint() function you would have three tasks.

Write, in precise English, a detailed specification of what is to happen in all cases.
Write a C program implementing this specification.
Get the C syntax right.

The third is clearly the easiest task. I suspect that the first is the hardest.

Homework: 5-1. As written, getint() treats a + or - not followed by a digit as a valid representation of zero. Fix it to push such a character back on the input.

  > Hi,
  > 
  > Many students have submitted CIMS acount requests because they are 
  > enrolled in UA.0201-005.  This is just a heads up that I am rejecting 
  > all of these in the request system since the class accounts should be 
  > made directly by the systems staff with use of the class roster.  This 
  > is Allan Gottlieb's class so I am copying in case he hasn't yet formally 
  > requested class accounts for all his students.
  > 
  > Best,
  > Stephanie

  Hi Stephanie,

  Thanks for the heads up. Indeed, Allan has requested class accounts
  for his students in this course, and they have been created by us
  based on the roster.

  Allan, in case you don't have it, you may point any student who is
  unsure of their account status to this link, where they may view
  their current status and reset their password if desired: 

  https://cims.nyu.edu/webapps/password/reset

  Thanks,
  Aric

5.3 Pointers and Arrays

In C pointers and arrays are closely related. As the book says

Any operation that can be achieved by array subscripting can also be done with pointers.

The authors go on to say

The pointer version will in general be faster but, at least to the uninitiated, somewhat harder to understand.

The second clause is doubtless correct; but perhaps not the first. Remember that the 2e was written in 1988 (1e in 1978). Compilers have improved considerably in the past 30+ years and, I suspect, would turn out nearly as fast code for many of the array versions.

The next few sections present some simple examples using pointers.

Some Tiny, But Subtle Examples.

  int a[5], *pa;
  pa = &a[0];
  int x = *pa;
  x = *(pa+1);
  x = a[0];
  x = *a;
  int i;
  x = a[i];
  x = *(a+i);

On the far right we see some code involving pointers and arrays. After the first two lines are executed we get the diagram shown on the near right. pa is a pointer to the first element of the array a. Remember that, as in Java, the first element of a C array a is a[0]. Similarly, pa+3 is a pointer to the fourth element of the array.

But note that pa+3 is just an expression and not a container (no lvalue): you can't put another pointer into pa+3 just like you can't put another integer into i+3.

The next line sets x (which is a container) equal to (the rvalue of) a[0]; the line after that sets x=a[1].

Then we explicitly set x=a[0].

The line after that has the same effect! That is because in C the value of array name equals the address of its first element. (The rvalue of a = the rvalue of &a[0] = the address of a[0] = the lvalue of a[0].) Again note that a (i.e., &a[0]) is an expression, not a variable, and hence is not a container.

Said yet another way a and pa have the same value (rvalue) but are not the same thing!

Similarly, the last two lines each have the same effect, this time for a general element of the array a[i].

A Subtle Difference Between Array Names and Pointers (one more time)

  int a[5], *pa;
  pa = &a[0];
  pa = a;
  a = pa;        // illegal
  &a[0] = pa;    // illegal

Both pa and a are pointers to ints. In particular a is defined to be &a[0]. Although pa and a have much in common, there is an important difference: pa is a variable, its value can be changed; whereas &a[0] (and hence a) is an expression and not a variable. In particular the last two lines on the right are illegal.

Another way to say this is that &a[0] is not an lvalue. This is similar to the legality of x=y+5; versus the illegality of y+5=x;

Calculating the Length of a String: `mystrlen()`

  int mystrlen(char *s) {
    int n;
    for (n=0; *s!='\0'; s++,n++) ;
    return n;
  }

The Program Itself

The code on the right illustrates how well C pointers, arrays, and strings work together. What a tiny program to find the length of an arbitrary string!

Note that the body of the for loop is null; all the work is done in the for statement itself.

Calling `mystrlen()`

  char str[50], *pc;
  // calculate str and pc
  mystrlen(pc);
  mystrlen(str);
  mystrlen("Hello, world.");

Note the various ways in which mystrlen() can be called.

The first call shown exactly matches the function declaration. That is, both the argument (in the function call) and the parameter (in the function definition) have the same type, a pointer to a char. Don't forget in a C declaration you decorate a variable with enough stuff to obtain one of the primitive types.
The the types in the second call match as well, once we remember that an array name has the same value as a pointer to the first value.
Finally, the third is like the second since a string constant is a character array.

One More Time: Value Versus Address (or rvalue Versus lvalue)

  #include <stdio.h>
  int x, *p;
  int main () {
    p = &x;
    x = 12;
    printf("p = %p\n", p);
    printf("*p = %d\n", *p);
    p++;
    printf("p = %p\n", p);
    printf("*p = %d\n", *p);
  }

The example on the right below illustrates well the difference between a variable, in this case x, and its address &x. The first value printed is the address of x. This is not 12. Instead, it is some (probably large) number that happens to be the address of x.

In fact when run on my laptop the program produced the following output.

  p = 0x7fc41fc78040
  *p = 12
  p = 0x7fc41fc78044
  *p = 0

Let's go over this 7-line main() function line by line.

The value of p is set to the address of x (or the rvalue of p is set to the lvalue of x).
The (r)value of x becomes 12. (You can't change lvalues with simple assignment statements.)
Just as printf() uses %d to print integers, it uses %p to print pointers. The 0x means that the number is base 16, which we will learn about later. For now we just need to note that it is a big number and is definitely not 12. So the address (lvalue) of x is a big number.
The next line prints the value (rvalue) of x, which of course is 12.
The next line increments p so that it points to one integer beyond x. Since, on my system, integers use 4 memory locations, p is incremented by 4, as shown in the next line.
The incremented p is printed.
The value of the next integer after x is printed. But there is no integer after x. Hence the program is erroneous! Its output in unpredictable!

Note: Incrementing p does not increment x. Instead, the result is that p points to the next integer after x. In this program there is no further integer after x, so the result is unpredictable and the program is erroneous. Specifically, the value of *p is now unpredictable. On my system the value of *p was 0, but that can NOT be counted on. If, instead of pointing to x, we had p point to A[7] for some large int array A, then the last line would have printed the value of A[8] and the penultimate line would have printed the address of A[8].

Remarks:

In a previous semester I was asked if accessing variables through a pointer take more time than accessing variables directly, since you have to look into two addresses? (The pointer and then the value).
Compilers! See O'H-3.4.1.
I will stay on zoom for a few minutes after each class to answer questions.

String Length with Arrays and Pointers

#include <stdio.h>
int mystrlen (char *s);
int main () {
  char stg[] = "hello";
  printf ("The string %s has %d characters\n",
          stg, mystrlen(stg));
}
int mystrlen (char s[]) {
  int i;
  for (i = 0; s[i] != '\0'; i++) ;
  return i;
}
int mystrlen (char *s) {
  int i = 0;
  while (*s++ != '\0')
    i++;
  return i;
}

On the right we show two versions of a string length function. The first version uses array notation for the string; the second uses pointer notation. The main() program is identical for the two versions so is shown only once.

Note how very close the two string length functions are. This is another illustration of the similarity of arrays and pointers in C.

Note the two declarations

  int mystrlen (char *s);
  int mystrlen (char s[]);

They are used 3 times in the code on the right. In C these two declarations are equivalent. Changing any or all of them to the other form does not change the meaning of the program.

I realize an array does not at first seem the same as a pointer. Remember that the array name itself is equal to a pointer to the first element of the array. Hence declaring

  float a[5], *b;

results in a and b having the same type (pointer to float). But the array a has additionally been defined; that is, space for 5 floats has been allocated. Hence a[3] = 5; is legal. b[3] = 5 is syntactically legal, but may be semantically invalid and abort at runtime, unless b has previously be set to point to sufficient space.

In the pointer version of mystrlen() we encounter a common C idiom *s++. First note that the precedence of the operators is such that *s++ means *(s++). That is, we are moving (incrementing) the pointer and examining what it used to point at. We are not incrementing a part of the string. Specifically, we are not executing (*s)++;

Simple Substitution

  void changeltox (char *s) {
    while (*s != '\0') {
      if (*s == 'l')
        *s = 'x';
      s++;
    }
  }

The program on the right loops through the input string and replaces each occurence of l with x.

The while loop and increment of s could have been combined into a for loop.

This version is written in pointer style.

Homework: Rewrite changeltox() to use array style and a for loop.

String Copy (again)

  void mystrcpy (char *t, char *s) {
    while ((*t++ = *s++) != '\0') ;
  }

Check out the ONE-liner on the right. Note especially the use of standard idioms for marching through strings and for finding the end of the string.

Slick, very slick!

Even slicker is to note that '\0' has value 0 and testing != 0 is just testing so the while statement is equivalent to
while (*t++ = *s++);

But the program is scary, very scary!

Question: Why is it scary?
Answer: Because there is no length check.

If the character array t (or equivalently the block of characters t points to) is smaller than the character array s, then the copy will overwrite whatever happens to be located right after the array t.

The lack of such length checks has permitted a number of security breaches.

Using `int *A` vs `int A[]` as a Parameter

  double f(int *a);
  double f(int a[]);

The two lines on the right are equivalent when used as a function declaration (or, with the semicolon replaced by a {, as the head line of a function definition). The authors say they prefer the first. For me it is not so clear cut. In mystrlen() above I would indeed prefer char *s as written since I think of a string as a block of chars with a pointer to the beginning.

  double dotprod(double A[], double B[]);

However, if I were writing an inner product routine (a.k.a. dot product), I would prefer the array form as on the right since I think of dot product as operating on vectors.

But of course, more important than which one I prefer or the authors prefer, is the fact that they are equivalent in C.

Note: The definition int a[10]; reserves space for 10 ints and no pointers; whereas the definition int *a; reserves space for no ints and 1 pointer.

Passing Part of an Array

  #include <stdio.h>
  void f(int *p);
  int main() {
    int A[20];
    // initialize all of A
    f(A+6);
    return 0;
  }
  void f(int *p) {
    printf("legal? %d\n", p[-2]);
    printf("legal? %d\n", *(p-2));
  }

In the code on the right, main() first declares an integer array A[] of size 20 and initializes all its members (how the initialization is done is not important). Then main(), in a effort to protect the beginning of A[], passes only part of the array to f(). Remembering that A+6 means (&A[0])+6, which is &A[6], we see that f() receives a pointer to the 7th element of the array A.

The author of main() mistakenly believed that A[0],..,A[5] are hidden from f(). Let's hope this author is not on the security team for the board of elections.

Since C uses call by value, we know that f() cannot change the value of the pointer A+6 in main(). But f() can use its copy of this pointer to reference or change all the values of A, including those before A[6]. On the right, f() successfully references A[4].

It naturally would be illegal for f() to reference (or worse change) p[-9].

Start Lecture #06

K&R-5.4: Address Arithmetic

A important point is that, given the declaration int *pa; the increment pa+=3 does not simply add three to the address stored in pa. Instead, it increments pa so that it points 3 integers further forward (since pa is a pointer to an integer). If pc is a pointer to a double, then pc+3 increments pc so that it points 3 doubles forward.

  #include <stdio.h>
  void main (void) {
    int q[] = {11, 13, 15, 19};
    int *p = q; // initializes p NOT *p
    printf("*p = %d\n", *p);
    printf("*p++ = %d\n", *p++);
    printf("*p = %d\n", *p);
    printf("*++p = %d\n", *++p);
    printf("*p = %d\n", *p);
    printf("++*p = %d\n", ++*p);
 }

To better understand pointers, arrays, ++, and *, let's go over the code on the right line by line. For reference the precedence table is here. The output produced is

  *p = 11
  *p++ = 11
  *p = 13
  *++p = 15
  *p = 15
  ++*p = 16

`alloc()/afree()`: A Simple Storage Allocator

  #define ALLOCSIZE 15000
  static char allocbuf[ALLOCSIZE];
  static char *allocp = allocbuf;
  char *alloc(int n) {
    if (allocp+n ≤ allocbuf+ALLOCSIZE) {
      allocp += n;
      return allocp-n;   // previous value
    } else               // not enough space
      return 0;
  }
  void afree (char *p) {
    if (p>=allocbuf && p<allocbuf+ALLOCSIZE)
      allocp = p;
  }

On the right is a primitive storage allocator and freer, alloc() and afree(). This pair of routines distributes and reclaims memory from a buffer allocbuf. The internal pointer allocp points to the boundary between already allocated memory (on the left of allocp in the diagrams) and memory still available for allocation (on the right).

The top picture shows the initial state: nothing is allocated; everything is free.

Looking at the middle (before) diagram we see four blocks that have been allocated and a large free region on the right. The routines alloc() and afree control the internal pointer allocp.

When alloc(n) is called, with a non-negative integer argument, it returns a pointer to a block of n characters and then moves allocp to the right, indicating that these n characters are no longer available.

When afree(p) is called with the pointer returned by alloc(), it resets the state of alloc()/afree() to what it was before the call to alloc().

A very strong assumption is being made that calls to alloc()/afree() are executed in a stack-like manner, i.e., the routines assume that a block being freed is the last block that was allocated.

These routines would be useful for managing storage for C automatic, local variables. They are far from general. The standard library routines malloc()/free() do not make this assumption and as a result are considerably more complicated.

Since pointers, not array positions are communicated to users of alloc()/afree(), these users do not need to know the name of the array, which is kept under the covers via static.

Notes:

The initialization of allocp is the same as setting it to &allocbuf[0].
Normally, the only reasonable initial values for a pointer are zero or some expression involving the addresses of previously declared objects.
C guarantees that no valid pointer has value zero. That is, &x is never zero. Thus setting a pointer to zero is a way of saying it points to no object. Although a literal 0 is permitted; most programmers use NULL.
Question: What happens if alloc() is called with n<0?
Answer: The in progress allocation will get memory that is part of the next allocation. It is even worse if n is very negative. Scary!

Homework: What is wrong with the following calls to alloc() and afree()? Assume that ALLOCSIZE is big enough.

  char *p1, *p2, *p3;
  p1 = alloc(10);
  p2 = alloc(20);
  p3 = alloc(15);
  afree(p3);
  afree(p1);
  afree(p2);

Pointer Comparison

If pointers p and q point to elements of the same array (or string), then comparisons between the pointers using <, <=, ==, !=, >, and >= all work as expected.

If pointers p and q do not point to members of the same array, the value returned by comparisons is undefined, with one exception: p pointing to an element of an array and q pointing to the first element past the array.

Any pointer can be compared to 0 via == and !=. Normally, NULL is used, but an actual literal 0 is permitted.

Pointer Subtraction

Again we need p and q pointing to elements of the same array. In that case, if p<=q, then q-p+1 equals the number of elements from p to q (including the elements pointed to by p and q).

Using the Allocator

  #include <stdio.h>
  void changeltox(char *z);
  void mystrcpy char *s, char *t);
  char *alloc(int n);
  int main() {
    char string[] = "hello";
    char *string2 = alloc(6);
    mystrcpy(string2, string);
    changeltox(string);
    printf ("String is now %s\n", string);
    printf ("String2 is now %s\n", string2);
  }

These examples are interesting in their own right, beyond showing how to use the allocator.

Making Changes in a New String

We have already written a program changeltox() that changes one character to another in a given string.

After initializing the string to "hello", the code on the right first copies it (using mystrcpy(), a one liner presented above) and then makes changes in the original. Thus, at the end, we have two versions of the string: the before and the after.

As expected the output is

  String is now hexxo
  String2 is now hello

So far, so good. Let's try something fancier.

Mismatched Sizes and the Resulting Mess

Recall the danger warning given with the code for mystrcpy(char *x, char *y): The code copies all the characters in y (i.e., up to and including '\0') to x ignoring the current length of x. Thus, if y is longer than the space allocated for x, the copy will overwrite whatever happens to be stored right after x.

  #include <stdio.h>
  void changeltox (char*);
  void mystrcpy (char *s, char *t);
  char *alloc(int n);
  int main () {
    char string[] = "hello";
    char *string2 = alloc(2);
    char *string3 = alloc(6);
    mystrcpy (string2, string);
    printf ("String2 is now %s\n", string2);
    printf ("String3 is now %s\n", string3);
    mystrcpy (string3, string);
    changeltox (string);
    printf ("The string is now %s\n", string);
    printf ("String2 is now %s\n", string2);
    printf ("String3 is now %s\n", string3);
  }

The example on the right illustrates the danger. When the code on the right is compiled with the code for changeltox(), mystrcpy(), and alloc(), the following output occurs.

  String2 is now hello
  String3 is now llo
  The string is now hexxo
  String2 is now hehello
  String3 is now hello

What happened?

The string in string contains the 5 characters in the word hello plus the ascii null '\0' to end the string. (The array string has 6 elements so the string fits perfectly.)

The major problem occurs with the first execution of mystrcpy() because we are copying 6 characters into a string that has room for only 2 characters (including the ascii null). This executes flawlessly copying the 6 characters to an area of size 6 starting where string2 points. These 6 locations include the 2 slots allocated to string2 and then the next four locations. Normally it is very hard to tell what has been overwritten, and the resulting bugs can be very difficult to find.

In this situation it is not hard to see what was overwritten since we know how alloc() works. The excess 6-2=4 characters are written into the first 4 slots of string3.

When we print string2 the first time we see no problem! A string pointer just tells where the string starts, it continues up to the ascii null. So string2 does have all of hello (and the terminating null). Since string3 points 2 characters after string2, the string string3 is just the substring of string2 starting at the third character.

The second mystrcpy copies the six(!) characters in the string hello to the 6 bytes starting at the location pointed to by string3. Since the string string2 includes the location pointed to by string3, both string2 and string3 are changed.

The changeltox() execution works as expected.

K&R-5.5 Character Pointers and Functions

As we know, C does not have string variables, but does have string constants. This arrangement sometimes requires care to avoid errors.

`char amsg[]="hello";` vs `char *msgp="hello";`

  char amsg[] = "hello";
  char *msgp = "hello";
  int main () {...}

Let's see if we can understand the following rules, which can appear strange at first glance.

amsg (which means &amsg[0], a character pointer) cannot be changed (amsg is an rvalue not an lvalue). However, both amsg[1] (an 'e') and *(amsg+1) (the same 'e') can be changed.
msgp (a character pointer) can be changed, but both *(msgp+2) (an 'l') and msgp[2] (the same 'l') cannot be changed.

Perhaps the following will help.

amsg is defined to be &amsg[0], which is an expression not a variable; it is not a container. So it is only an rvalue, not an lvalue. But amsg (like other arrays) does point to a block of 6 containers (amsg[0]...amsg[5]). Hence de-referencing amsg (or amsg+1) does yield a container. Remember that *(amsg+1) is just amsg[1].
msgp is a variable. Hence msgp is an lvalue (a container) and can be changed. However it points to a bunch of character constants none of which are variables and hence none cannot be changed.

The C Idiom for String Copy (Without Sleepless Nights)

  void mystrcpy (char *s, char *t) {
    while (*s++ = *t++) ;
  }

Our first version of this program tested if the assignment did not return the character '\0', which has the value 0 (a fact about ascii null). However checking if something is not 0 is the same (in C) as asking if it is true. Finally, testing if something is true is the same as just testing the something. The C rules can seem cryptic, but they are consistent.

If you have been trembling with fright over this scary function, rest assured and see the following homework problem.

Homework: 5-5 (first part). Write a version of the library functions

    char *strncpy(char *s, char *t, int n)

This copies at most n characters from t to s. This code is not scary like other copies since a user of the routine can simply declare s to have space for n characters.

Slick String Length Using Pointer Substraction

  int mystrlen(char *s) {
    char *p = s;
    while (*p)
      p++;
    return p-s;
  }

The code on the right applies the technique used to get the slick string copy to the related function string length. In addition it uses pointer subtraction. Note that when the return is executed, p points just after the string (i.e., to the terminating null) and s points to its beginning. Thus the difference gives the length.

Normally, pointer subtraction is defined only when both pointers point to the same array or string (or some other objects we haven't studied yet). The point is that you cannot meaningfully subtract two pointers pointing to different objects (say both point to different integer variables). One exception is that subtraction is guaranteed to work if one points to an element of an array and the other points one element past that same array. The function mystrlen() does not utilize this exception since the terminating null is part of the string.

String Comparison

  int mystrcmp(char *s, char *t) {
    for (; *s == *t; s++,t++)
      if (*s == '\0')
        return 0;
    return *s - *t;
  }

We next produce a string comparison routine that returns a negative integer if the string s is lexicographically before t, zero if they are equal, and a positive integer if s is lexicographically after t.

The loop takes care of equal characters. The function returns 0 if we reached the end of the equal strings.

If the loop concludes early, we have found the first difference.

A key is that if exactly one string has ended, its character ('\0') is smaller then the other string's character. This is another ascii fact (ascii null is zero. the rest are positive).

I tried to produce a version using while(*s++ == *t++), but I failed since the loop body and the post loop code was dealing with the subsequent character. I suppose it could have been forced to work if I used a bunch of constructions like *(s-1), but that would have been ugly.

5.6: Pointer Arrays; Pointers to Pointers

For the moment forget that C treats pointers and arrays almost the same. For now just think of a character pointer as another data type.

So we can have an array of 9 character pointers, e.g., char *A[9]. (Note that it is not a pointer to an array of 9 characters.) We shall see fairly soon that this is exactly how some systems (e.g. Unix) transmit command-line arguments to the main() function.

  #include <stdio.h>
  int main() {
    char *STG[3] = { "Goodbye", "cruel", "world" };
    printf ("%s %s %s.\n", STG[0], STG[1], STG[2]);
    STG[1] = STG[2] = STG[0];
    printf ("%s %s %s.", STG[0], STG[1], STG[2]);
    return 0;
  }
  Goodbye cruel world.
  Goodbye Goodbye Goodbye.

The code on the right defines an array of 3 character pointers, each of which is initialized to (point to) a string. The first printf() has no surprises. But the assignment statement should fail since we allocated space for three strings of sizes 8, 6, and 6 and now want to wind up with three strings each of size 8 and we didn't allocate any additional space.

However, it works perfectly and the resulting output is shown as well.

Question: What happened? How can space for 8+6+6 characters be enough for 8+8+8?
Answer: We do not have three strings of size 8. Instead, we have one string of size 8, with three character pointers pointing to it.

The picture on the right shows a before and after view of the array and the strings.

This suggests and interesting possibility. Imagine we wanted to sort long strings alphabetically (really lexicographically). Let's not get bogged down in the sort itself and assume it is a simple interchange sort that loops and, if a pair is out of order, executes a swap, which is something like

  temp = x;
  x = y;
  y = temp;

If x, y, and temp are (long but varying size) strings then we have some issues to deal with.

It is expensive to do the three assignments if the strings are very long.
If one of the strings is longer than the space allocated for another, we either overwrite something else (and potentially end the world) or refuse the copy and hence not complete the sort.

Both of these issues go away if we maintain an array of pointers to the strings. If the string pointed to by A[i] is out of order with respect to the string pointed to by A[j], we swap the (fixed size, short) pointers not the strings that they point to.

This idea is illustrated on the right.

Putting the Pieces Together: Sorting Strings

The code on the right below, plus the mystrcmp() function above, produces the output on the left.

  #include <stdio.h>
  void sort(int n, char *C[]) {
    int i,j;
    char *temp;
    for (i=0; i<n-1; i++)
      for (j=i+1; j<n; j++)
        if (mystrcmp(C[i],C[j]) > 0) {
          temp = C[i];
          C[i] = C[j];
          C[j] = temp;
        }
  }
  int main() {
    char *STG[] = {"Hello","99","3","zz","best"};
    int i,j;
    for (i=0; i<5; i++)
      printf ("STG[%i] = \"%s\"\n", i, STG[i]);
    sort(5,STG);
    for (i=0; i<5; i++)
      printf ("STG[%i] = \"%s\"\n", i, STG[i]);
    return 0;
  }

  STG[0] = "Hello"
  STG[1] = "99"
  STG[2] = "3"
  STG[3] = "zz"
  STG[4] = "best"

  STG[0] = "3"
  STG[1] = "99"
  STG[2] = "Hello"
  STG[3] = "best"
  STG[4] = "zz"

You might feel that the sort fails due to call-by-value the same way bad_swap failed previously. Since call-by-value initially copies the arguments into the parameters, but does not, at the end, copy the parameters back to the arguments, swapping C[I] with C[j] has no effect since the parameters C[i] and C[j] are not copied back. But no, C[i] is not a parameter, the array C is the parameter and C[i] is pointed to by C. Yes, this is subtle; but it is also crucial!

You might question if the output is indeed sorted. For example, we remember that ascii '3' is less than ascii '9', and we know that in ascii 'b'<'h'<'z', but why is '9'<'b' and why is 'H'<'b'?

Well, I don't know why they are, but they are. That is, in ascii the digits come before the capital letters, which in turn come before the lower-case letters.

Another Example

  #include <stdio.h>
  int main(int argc, char *argv[]) {
    char c1 = '1', c2 = '2';
    char ac[10] = "wxyXYZ";   // ac = Array of Chars
    ac[1] = c1;
    ac[2] = c2;
    printf("ac[1]=%c ac[2]=%c\n", ac[1], ac[2]);
    char *pc1, *pc2;          // pc = Pointer to Char
    pc1 = &ac[3];
    pc2 = pc1+1;
    printf("*pc1=%c *pc2=%c\n", *pc1, *pc2);
    char *apc[10];            // Array of Pointers to Char
    apc[3] = pc1;             // Points at ac[3]
    apc[4] = pc2-2;           // Points at ac[2]
    printf("*apc[3]=%c *apc[4]=%c\n", *apc[3], *apc[4]);
    return 0;
  }

The program on the right includes several types of variables. In particular we find chars, an array of chars, pointers to chars, and an array of pointers to chars.

The program, when run, produces the following output.

  ac[1]=1 ac[2]=2
  *pc1=X *pc2=Y
  *apc[3]=X *apc[4]=2

You should first confirm that the types are correct. For example, is * always applied to a pointer? Since all the prints use %c for the values printed, all those values must be chars. Are they?

Then confirm that you agree with the values produced.

At one point the program adds 1 to the char pointer pc1. At another point it subtracts 2 from another char pointer. This is valid only if the final value of the pointer is pointing inside the same array as the initial value (or to one location past the end of the array). Is this the case?

Start Lecture #07

K&R-5.7: Multi-dimensional Arrays

  void matmul(int n, int k, int m, double A[n][k],
       double B[k][m], double C[n][m]) {
    int i,j,l;
    for (i=0; i<n; i++)
      for (j=0; j<m; j++) {
        C[i][j] = 0.0;
        for (l=0; l< k; l++)
          C[i][j] += A[i][l]*B[l][j];
      }
  }

C does have normal multidimensional arrays. For example, the code on the right multiplies two matrices. Matrices (the simple 2-dimensional type in linear algebra) are rectangular: all rows have the same number of columns.

In some sense C, like Java, has only one-dimensional arrays. However, a one-dimensional array of one-dimensional arrays of doubles is close to a two-dimensional array of doubles. One difference is the notation: C/Java uses A[][], indicating a 1D array of 1D arrays, rather than A[,] of algebra. Another difference is that, in the example on the right, A[j] is a legal (one-dimensional) array (of size k) if 0≤j<n.

The biggest difference is that a C/Java 2D array need not be rectangular, that is the rows need not be the same length. This will become clear in the next few sections.

  int A[2][3] = { {5,4,3}, {4,4,4} };
  int B[2][3][2] = { { {1,2}, {2,2}, {4,1} },
                     { {5,5}, {2,3}, {3,1} } };

Initializing Multidimensional Arrays

Multidimensional arrays can be initialized. Once you remember that a two-dimensional array is a one-dimensional array of one-dimensional arrays, the syntax for initialization exemplified on the right is not surprising.

(C, like most modern languages uses row-major ordering so the last subscript varies the most rapidly.)

  #include <stdio.h>
  int main(int argc, char *argv[]) {
    int A[3][3] = { {1,2,3}, {4,5,6}, {7,8,9} };
    printf("*(A[1]+1)=%d\n", *(A[1]+1) );
    return 0;
  }

2D Arrays Are 1D Arrays (of 1D Arrays)

Looking at the code on the right we see that A[1] can be thought of as a 1D array so, when written without the second subscript.

A[1] points to the first component (i.e., component 0) of the second row (i.e., row 1) of A,
A[1]+1 points to the second component of that row,
and *(A[1]+1) is that component, namely 5.
Hence 5 is the value printed.

  char amsg[] = "hello";
  int main(int argc, char *argv[]) {
    printf("%c\n", amsg[100]);
  }

Note: Note that an array of size 1 is similar to an array of size 10 and that a pointer to X is very similar to array of X. For example the code on the right compiles and runs (it is illegal but not caught by the compiler) in part because the types match.

A related comment is that a pointer to a character is the same as a pointer to 10 characters as far as the C compiler is concerned.

K&R-5.8 Initialization of Pointer Arrays

  char *monthName(int n) {
    static char *name[] = {"Illegal",
      "Jan", "Feb", "Mar", "Apr",
      "May", "Jun", "Jul", "Aug",
      "Sep", "Oct", "Nov", "Dec"};
    return (n<1 || n>12) ? name[0] : name[n];
  }

The initialization syntax for an array of pointers follows the general rule for initializing an array: Enclose the initial values inside braces.

Looking at the code on the right we see this principle in action. I believe the most common usage of pointer arrays is for an array of character pointers as in this example.

Question: How are those initializers pointers; they look like constant strings?
Answer: A string is a pointer to the first character.

K&R-5.9 Pointers vs. Multi-dimensional Arrays

  int  A[3][4];
  int *B[3];

Consider the two definitions on the right. They look different, but both A[2][3] and B[2][3] are legal (at least syntactically). The real story is that the two definitions most definitely are different. (In fact Java arrays have a great deal in common with the 2nd form in C.)

The declaration int A[3][4]; allocates space for 12 integers (really 12 containers each of which can hold an integer), which are stored consecutively so that A[i][j] is (a container holding) the (4*i+j)^th integer stored (counting from zero). With the simple declaration written, none of the integers is initialized, but we have seen how to initialized them.

The declaration int *B[3]; allocates space for NO integers. It does allocate space for 3 pointers (to integers). The pointers are not initialized so they currently point to junk. The program must somehow arrange for each of them to point to a group of integers (and must figure out when the group ends). An important point is that the groups may have different lengths. The technical jargon is that we can have a ragged array as shown in the bottom of the picture.

Comparing a 2D Array of Integers to a 1D Array of Integer Pointers

#include <stdio.h>
int main(int argc, char *argv[]) {
    int A[3][3] = { {1,2,3}, {4,5,6}, {7,8,9} };
    int B0[1] = {2}, B1[3]={5,14,5}, B2[2] = {11,4};
    int *B[3] = {B0, B1, B2};
    printf("*(A[1]+1)=%d\n", *(A[1]+1) );
    printf("A[1,1]=%d\n", A[1][1] );
    printf("*(B[1]+1)=%d\n", *(B[1]+1) );
    printf("B[1][1]=%d\n", B[1][1]);
    return 0;
}

The code sequence on the right shows the comparison between initializing a 2-D array of integers and initializing a 1-D array of pointers to integers. Note how B is initialized to a 1D array of integer pointers. The example also illustrates that C supports ragged arrays. When the program is run the output produced is

*(A[1]+1)=5
A[1][1]=5
*(B[1]+1)=14
B[1][1]=14

Arrays of Strings

Although ragged arrays of Integers (and Floats) are used in C, you are more likely to see a ragged array of chars, that is a 1-D array of pointers to (varying length) strings.

We have already seen two examples of this: The monthName program just above and the Goodbye Cruel World diagrams in section 5.6. We next illustrate that every C main() program on Unix (e.g., on Linux) also uses a (ragged) array of strings, i.e. an array of character pointers.

K&R-5.10 Command-line Arguments

On the right is a picture of how arguments are passed to a (Unix) command. It this case the command executed was

  ./cmdline xx y;

The arguments generated by the system are shown on the left The green arrows show those arguments being copied into the parameters of the main program. (The black arrows are simply pointers, as before.) Each main() program has two parameters: an integer, normally called argc for argument count, and an array of character pointers, normally called argv for argument vector.

As always, a naked array name is a pointer to the first element. argv in the main() program is best thought of as a pointer that has been initialized to point to the first element of the array of pointers. The diagram makes clear that both arguments, argc and argv, are containers In particular they have lvalues, i.e., they can appear on the LHS of an assignment statement. (Of course, with call-by-value any changes to the parameters are not passed back to the arguments.)

Since the same program can have multiple names (more on that later), argv[0], the first element of the argument vector, is a pointer to a character string containing the name by which the command was invoked. Subsequent elements of argv point to character strings containing the arguments given to the command. Finally, there is a NULL pointer to indicate the end of the pointer array.

The integer argc gives the total number of pointers, including the pointer to the name of the command. Thus, the smallest possible value for argc is 1 and argc=3 for the picture above.

Accessing `argc` and `argv` in `main()`

  #include <stdio.h>
  int main(int argc, char *argv[argc]) {
    int i;
    printf("My name is %s; ", argv[0]);
    printf("I was called with %d argument%s.\n",
           argc-1, (argc==2) ? "" : "s");
    for (i=1; i<argc; i++)
      printf("Argument #%d is %s.\n", i, argv[i]);
   }
  sh-4.0$ cc -o cmdline cmdline.c
  sh-4.0$ ./cmdline
  My name is ./cmdline; I was called with 0 arguments.
  sh-4.0$ ./cmdline x
  My name is ./cmdline; I was called with 1 argument.
  Argument #1 is x.
  sh-4.0$ ./cmdline xx y
  My name is ./cmdline; I was called with 2 arguments.
  Argument #1 is xx.
  Argument #2 is y.
  sh-4.0$ ./cmdline -o cmdline cmdline.c
  My name is ./cmdline; I was called with 3 arguments.
  Argument #1 is -o.
  Argument #2 is cmdline.
  Argument #3 is cmdline.c.
  sh-4.0$ cp cmdline mary-joe
  sh-4.0$ ./mary-joe -o cmdline cmdline.c
  My name is ./mary-joe; I was called with 3 arguments.
  Argument #1 is -o.
  Argument #2 is cmdline.
  Argument #3 is cmdline.c.

The code on the right shows how a program can access its name and any arguments it was called with.

Having both a count (argc) and a trailing NULL pointer (argv[argc]==NULL) is redundant, but convenient. The code on the right treats argv as an array. It loops through the array using the count argc as an upper bound, but does not make use of the trailing NULL. Another style (using NULL but not argc) would look something like

  while (*argv)
    printf("%s\n", *argv++);

which treats argv as a pointer and terminates when argv points to NULL.

The second frame on the right shows a session using the code directly above. We assume the first frame is stored in the file cmdline.c

First, we show how to compile a C program and have the result called something other than a.out. This is a feature of the C compiler cc.
Next we run the resulting program with differing numbers of arguments. Note the use of the conditional expression to get singular and plural correct. Remember that argc is 2 when there is 1 argument.
Finally, we show one way (via copying the executable) that the same program can have more than one name. We also could have renamed/moved the executable instead of copying it.
Another way to obtain multiple names for the same executable is to recompile the program giving a different file name after -o (or not using -o at all and getting a.out as the executable).
In 202 you will learn yet another way to have multiple names for the same file (hard links).

Using Command-line Arguments Instead of Symbolic Constants

Now we can get rid of some symbolic constants that should have been specified at run time.

Here are two before and after examples. The code on the left uses symbolic constants; on the right we use command-line arguments.

Fahrenheight to Celcius

                                    #include <stdlib.h>
  #include <stdio.h>                #include <stdio.h>
  #define LO 0                      
  #define HI 300                    
  #define INCR 20                   
  main() {                          int main (int argc, char *argv[argc]) {
    int F;                            int F;
    for (F=LO; F<=HI; F+=INCR)        for (F=atoi(argv[1]); F<=atoi(argv[2]);
                                           F+=atoi(argv[3]))
      printf("%3d\t%5.1f\n", F,         printf("%3d\t%5.1f\n", F,
             (F-32)*(5.0/9.0));                (F-32)*(5.0/9.0));
                                      return 0;
  }                                 }

Notes.

Now main() is specified correctly; it returns an integer and has the complicated parameter structure we just described. As written on the left, the program terminates abnormally (it doesn't return 0).
We use atoi() (declared in stdlib.h) to convert the ascii (character) form of the numerical inputs into integers.

Solving Quadratic Equations

                                            #include <stdlib.h>
#include <stdio.h>                          #include <stdio.h>
#include <math.h>                           #include <math.h>
#define A +1.0   // should read                  
#define B -3.0   // A,B,C                        
#define C +2.0   // using scanf()           
void solve (float a, float b, float c);     void solve (float a, float b, float c);
int main() {                                int main(int argc, char *argv[argc]) {
  solve(A,B,C);                               solve(atof(argv[1]), atof(argv[2]),
                                                    atof(argv[3]));
  return 0;                                   return 0;
}                                           }
void solve (float a, float b, float c){     void solve (float a, float b, float c){
  float d;                                    float d;
  d = b*b - 4*a*c;                            d = b*b - 4*a*c;
  if (d < 0)                                  if (d < 0)
    printf("No real roots\n");                  printf("No real roots\n");
  else if (d == 0)                            else if (d == 0)
    printf("Double root is %f\n",               printf("Double root is %f\n",
           -b/(2*a));                                  -b/(2*a)); 
  else                                        else
    printf("Roots are %f and %f\n",             printf("Roots are %f and %f\n",
           ((-b)+sqrt(d))/(2*a),                       ((-b)+sqrt(d))/(2*a),
           ((-b)-sqrt(d))/(2*a));                      ((-b)-sqrt(d))/(2*a));
}                                           }

Notes.

Again main() is now specified correctly. When we had main() we said don't check the arguments. Now we specify them correctly.
This time we need atof() since the arguments are floating point.

Optional Command-line Arguments

  include <string.h>
  include <stdio.h>
  include <ctype.h>
  int main (int argc, char *argv[argc]) {
    int c, makeUpper=0;
    if (argc > 2)
      return -argc;   // error return
    if (argc == 2)
      if (strcmp(argv[1], "-toupper")) {
        printf("Arg %s illegal.\n", argv[1]);
        return -1;
      }
      else   // -toupper was arg
        makeUpper=1;
    while ((c = getchar()) != EOF)
      if (!isdigit(c)) {
        if (isalpha(c) && makeUpper)
          c = toupper(c);
        putchar(c);
      }
     return 0;
  }

Often a leading minus sign (-) is used for command-line arguments that are optional.

The program on the right removes all digits from the input. If it is given the optional argument -toupper it also converts all letters to upper case using the toupper() library routine.

Notes

This example has only 1 (optional) argument so the only legal values for argc are 1 and 2.
If there is an argument (argc==2) then we check to be sure it is -toupper and if so set the Boolean makeUpper.
Note the use of library routines, isdigit(), isalpha(), toupper().

Demo this function on my laptop. It is the file c-progs/rem-digit.c.

Homework: At the very end of chapter 3 you wrote escape() that converted a tab character into the two characters \t (it also converted newlines but ignore that). Call this function detab() and call the reverse function entab(). Combine the entab() and detab functions by writing a function tab that has one command-line argument.

  tab -en   # performs like entab()
  tab -de   # performs like detab()

Optional Arguments—Finding Matching Lines

  #include <stdio.h>
  #include <string.h>
  #define MAXLINE 1000
  int getline(char *line, int max);
  // find: print lines matching argv[1]
  int main(int argc, char *argv[]) {
    char line[MAXLINE];
    int found = 0;
    if (argc != 2)
      printf("Usage: find pattern\n");
    else
      while (getline(line, MAXLINE) > 0)
        if (strstr(line, argv[1]) != NULL) {
          printf("%s", line);
          found++;
        }
    return found;
  }

Each of the programs in this section accepts a command-line argument (call it pattern) and when executed the program echos all input lines that contain the pattern. These programs are useful in their own right. However, our main interest is the pointer/character/string/array manipulations that occur.

No Optional Command-line Arguments

This first version, which is shown on the right, simply echos those input lines that contain the command-line argument. This version is fairly simple thanks to the library routine strstr(s1, s2), which checks whether string s2 occurs in s1. The declaration of strstr(s1,s2) is found in string.h.

In fact strstr(s1,s2) indicates the location in s1 where s2 occurs, but we do not use this information as we want to know only if the pattern occurs in the line, not where.

The pattern we are looking for is the first command-line argument so the routine checks each input line to see if argv[1] occurs. If it does occur, the line is printed.

Two Optional Command-line Arguments

Now we permit two optional command-line arguments.

The first optional command-line argument, named except, indicates that we are to reverse the sense of the comparison and print those lines that do not contain the pattern.
The second optionnal command-line argument, number, specifies that the line number is printed for all matching lines.

A common convention, which is followed for this example, is to use a single letter (preceded by -) for optional command-line arguments. In this case we use -x for except and -n for number.

We also follow the convention of allowing these single letter options to be combined. Hence the single argument -nx (or -xn) can be used instead of -n -x (or -x -n). In all four cases, we print lines not matching the string given in the the required argument and for each such line we also print its line number.

In summary we want to process all arguments that start with - and for each one check every character after the -.

  #include <stdio.h>
  #include <string.h>
  #define MAXLINE 1000

  int getline(char *line, int max);

  // find: print lines matching pattern
  int main(int argc, char *argv[]) {
    char line[MAXLINE];
    long lineno = 0;
    int c, except = 0, number = 0, found = 0;
  
    while (--argc > 0 && (*++argv)[0] == '-')
      while (c = *++argv[0])
        switch (c) {
        case 'x':
          except = 1;
          break;
        case 'n'
          number = 1;
          break;
        default:
          printf("find: illegal option %c\n", c);
          argc = 0;
          found = -1;
          break;
        }
    if (argc != 1)
      printf("Usage: find -x -n pattern\n");
    else
      while (getline(line, MAXLINE) > 0) {
        lineno++;
        if ((strstr(line, *argv) != NULL) != except) {
          if (number)
            printf("%ld:", lineno);
          printf("%s", line);
        }
      }
    return found;
  }

The entire program is quite clever and well done, especially the part that handles the variable number of optional arguments. I strongly suggest you give it careful study. In class we will concentrate on how the program processes the variable number of arguments. In particular we will study the distinction between the pink *(++argv)[0] and the yellow *++argv[0].

The Pink vs. the Yellow

In class I want to discuss the pink and yellow highlighted regions, both of which contain *, ++, argv, and [0] in that order when read left to right. The difference between them is a pair of parentheses, that determine the order the operations are applied. Let's start with the pink.

Recall that, when execution begins, argv points to an array of char pointers. Specifically, it initially points at the first entry of the array, argv[0], which itself points at the name of the executable. Hence ++argv initially points at a pointer to the first command-line argument, which is a string (during subsequent iterations it points at subsequent arguments). Hence, *++argv initially points to the first argument and (*++argv)[0] (which can also be written as **++argv) is the first character of the first argument. This character is what would be a '-', if we have an optional argument. Subsequent iterations of this while loop increment argv to point to subsequent arguments.

The () are needed since [] has higher precedence than *. Indeed, it is these () that distinguish the pink from the yellow, which we look at next.

When the yellow is executed, argv points at an argument that begin with a '-'. More precisely argv points at the pointer to a character string that begins with a '-'. Hence argv[0] is the character pointer, and ++argv[0] (initially) points at the character after the '-', and *++argv[0] is (initially) the character after the '-'.

Since we can have multiple options, each specified by a single character (in this example the max is 2, but the code is more general), the (inner) while loop moves character by character across the argument.

The outer while moves from argument to argument executing the inner loop for each one until it reaches an argument not beginning with a '-' (or runs out of arguments, which is an error).

Start Lecture #08

Remark: Lab1 is assigned. It is on Brightspace and is due 12 October (two weeks from today). Lateness penalty is 2pts/day for the first 5 days then 5 pts/day. Good luck!

K&R-5.11 Pointers to Functions

  #include <ctype.h>
  #include <string.h>
  #include <stdio.h>
  // Program to illustrate function pointers
  int digitToStar(int c);   // Cvt digit to *
  int letterToStar(int c);  // Cvt letter to *
  int main (int argc, char *argv[argc]) {
    int c;
    int (*funptr)(int c);
    if (argc != 2)
      return argc;
    if (strcmp(argv[1],"digits") == 0)
      funptr = &digitToStar;
    else if (strcmp(argv[1],"letters") == 0)
      funptr = &letterToStar;
    else
      return -1;
    while ((c=getchar()) != EOF)
      putchar((*funptr)(c));
    return 0;
  }
  int digitToStar(int c) {
    if (isdigit(c))
      return '*';
    return c;
  }
  int letterToStar(int c) {
    if (isalpha(c))
      return '*';
    return c;
  }

In C you can do very little with functions, mostly define them and call them (and take their address, see what follows).

However, pointers to functions (called function pointers) are real values. You can do a lot with function pointers.

A function can return a function pointer.
You can declare variables (i.e., containers) that hold function pointers.
You can have an array of function pointers.
You can have a structure with function pointer components.
A function can take a function pointer argument.
etc.

One reason the system can do more with function pointers than with functions is that all function pointers (indeed all pointers) are the same length.

The program on the right is a simple demonstration of function pointers. Two very simple functions are defined.

The first function, digitToStar() accepts an integer (representing a character) and return an integer. If the argument is a digit, the value returned is (the integer version of) '*'. Otherwise the value returned is just the unchanged value of the argument.

Similarly letterToStar() converts a letter to '*' and leaves all other characters unchanged.

The star of the show is funptr. Read its declaration carefully: The variable funptr is the kind of thing that, once de-referenced, is the kind of thing that, given an integer, produces an integer.

So it is a pointer to something. That something is a function from integers to integers.

The main program checks the (mandatory) argument. If the argument is "digits", funptr is set to the address of digitToStar(). If the argument is "letters", funptr is set to the address of letterToStar(). So funptr is a pointer to one of two functions.

Then we have a standard getchar()/putchar() loop with a slight twist. The character (I know it is an integer) sent to putchar() is not the naked input character, but is instead the input character processed by whatever function funptr points to. Note the "*" in the call to putchar().

Note: C permits abbreviating &function-name to function-name. So in the program above we could write

  funptr = digitToStar;
  funptr = letterToStar;

instead of

  funptr = &digitToStar;
  funptr = &letterToStar;

I don't like that abbreviation so I don't use it. Others do like it and you may use it if you wish.

#include <stdio.h>
#include <stdlib.h>
int funA(int x) {printf("A x=%d\n", x); return x+10; }
int funB(int x) {printf("B x=%d\n", x); return x+20; }
int funC(int x) {printf("C x=%d\n", x); return x+30; }
int (*funPtrArr[])(int x) = {&funA, &funB, &funC};

int main(int argc, char *argv[]) {
    int x = atoi(argv[1]);
    int y = atoi(argv[2]);
    printf("x=%d\n", x);
    int z;
    z = (*funPtrArr[0])(x);
    printf ("z=%d\n", z);
    z = (*funPtrArr[1])(y);
    printf ("z=%d\n", z);
    z = (*funPtrArr[2])(100);
    printf ("z=%d\n", z);
}

Function pointers are especially useful when there are many functions involved and you have a function pointer array.

On the right is a simple, but rather silly example.

When run with input 4 5, it produces the following output.

  x=4
  A x=4
  z=14
  B x=5
  z=25
  C x=100
  z=130

5.12: Complicated Declarations

We are basically skipping this section. It gives some examples of more complicated declarations than we have seen (but are just more of the same—one example is below). The main part of the section presents a program that converts C definition to/from more-or-less English equivalents.

Here is one example of a complicated declaration. It is basically the last one in the book with function arguments added.

  char (*(*f[3])(int x))[5]

Remembering that *f[3] (like *argv[argc]) is an array of 3 pointers to something not a pointer to an array of 3 somethings, we can unwind the above to.

The variable f is an array of size three of pointers.

Remembering that *(g)(int x) = *g(int x) is a function returning a pointer and not a pointer to a function, we can further unwind the monster to.

The variable f is an array of size three of pointers to functions taking an integer and returning a pointer to an array of size five of characters.

One more (the penultimate from the book).

  char (*(f(int x))[5])(float y)

The function f takes an integer and returns a pointer to an array of five pointers to functions, each taking a float and returning a character.

Chapter K&R-6: Structures

For a start, a Java programmer can think of structures as basically classes and objects without methods.

K&R-6.1: Structure Basics

On the right we see some simple structure declarations for use in a geometry application. They should be familiar from your experience with Java classes in CS101 and CS102.

  #include <math.h>
  struct point {
    double x;
    double y;
  };
  struct rectangle {
    struct point ll;
    struct point ur;
  } rect1;
  double f(struct point pt);
  struct point mkPoint(double x, double y);
  struct point midPoint(struct point pt1,
                        struct point pt2);
  int main(int argv, *char argv[]) {
    struct point pt1={40.,20.}, pt2;
    struct rectangle rect1;
    pt2 = pt1;
    rect1.ll = pt2;
    pt1.x += 1.0;
    pt1.y += 1.0;
    rect1.ur = pt1;
    rect1.ur.x += 2.;
    return 0;
  }

The top declaration defines the struct point type. This is similar to defining a class without methods.

As with Java classes, structures in C help organize data by permitting you to treat related data as a unit. In the case of a geometric point, the x and y coordinates are closely related mathematically and, as components of the struct point type, they become closely related in the program's data organization.

The next definition defines both a new type struct rectangle and a variable rect1 of this type. Note that we can use struct point, a previously defined struct, in the declaration of struct rectangle.

Recall from plane geometry that a rectangle (we assume its sides are parallel to the axes) is determined by its lower left ll and upper right ur corners.

The next group declares a function f() having a structure parameter, then a function mkPoint() with a structure result, and finally declares midPoint() having both structure parameters and a structure result.

The definition in main() of pt1 illustrates an initialization. C does not support structure constants. Hence you could not in main() have the assignment statement

  pt1 = {40., 20.};

We see in the executable statements of main() that one can assign a point to a point as well as assigning to each component.

Since the rectangle rect1 is composed of points, which are in turn composed of doubles, we can assign a point to a point component of a rectangle and can assign a double to a double component of a point component of a rectangle.

If you wrote Java programs for geometry (we did when I last taught 201/202), they probably had classes like rectangle and point and had objects like pt1, pt2, and rect1. Given these classes, the assignment statements in our C-language main() function would have been more or less legal Java statements as well.

K&R-6.2: Structures and Functions

The only legal operations on a structure are copying it, assigning to it as a unit, taking its address with &, and assessing its members. Note that copying a structure includes passing one as a parameter to a function or returning the value of a function.

  double dist (struct point pt) {
    return sqrt(pt.x*pt.x + pt.y*pt.y);
  }
  struct point mkPoint(double x, double y) {
    // return {x, y};   invalid in C
    struct point pt;
    pt.x = x;
    pt.y = y;
    return pt;
  }
  struct point midpoint(struct point pt1,
                        struct point pt2){
    // return (pt1 + pt2) / 2;  not C
    struct point pt;
    pt.x = (pt1.x+pt2.x) / 2;
    pt.y = (pt1.y+pt2.y) / 2;
    return pt;
  }
  void mvToOrigin(struct rectangle *r){
    (*r).ur.x = (*r).ur.x - (*r).ll.x;
    r->ur.y = r->ur.y - r->ll.y;
    r->ll.y = 0;
    r->ll.x = 0;
  }

On the right we see four geometry functions. Although all four deal with structs, they do so differently. A function can receive and return structures, but you may prefer to specify the constituent native types instead. A third alternative is to utilize a pointer to a struct.

The first function dist() takes a struct point as an argument and returns a double corresponding to the distance from the point to the origin. Taking a point (rather than its two components) seems appropriate since geometrically we think of the distance from a point to the origin not from 2 numbers to the origin.

The second function mkpoint() in contrast has double arguments and returns a struct point. It would have been illegal for mkpoint to have included a
return {x, y}
statement since C does not support structure constants.
mkpoint() has the flavor of a Java constructor.

Next we have midpoint(), which has struct point parameters and struct point return type. Geometrically, we think of the midpoint as determined by two points and not by a four numbers. The parameters reflect our understanding.

It would be nice for midpoint() to be basically a 1-liner
return {(pt1 + pt2) / 2}
but again that is not legal in C.

The last function mvToOrigin(), which moves a rectangle so that its lower left corner is at the origin, takes one parameter, a pointer to a struct rectangle (often called a structure pointer).

I say more about this function and structure pointers just below.

As we have seen, functions can take structures as parameters, but is that a good idea? Should we instead use the components as parameters or perhaps pass a pointer to the structure? For example, if main() wishes to pass pt1 (of type struct point) to a function f(), should we write.

f(pt1)
f(pt1.x, pt1.y)
f(&pt1)

Naturally, the declaration of f() will be different in the three cases. When would each case be appropriate?

f(pt1) This form is the most natural where the parameter is viewed as a point, not as two real numbers. Our example was distance from the origin, double dist(struct pt).
f(pt1.x, pt1.y) A common example is a Java constructor like function that produces a structure from its constituents, for example mkPoint(pt1.x, pt2.y) above would produce a new point having coordinates a mixture of pt1 and pt2.
Another possibility is viewing midpoint() above as a function that returns the center of mass of the rectangle, in which case the arguments to midpoint() would be rect1.pt1 and rect1.pt2, the components of the rectangle rect1.
f(&pt1) A simple reason for using the address is that it might be significantly shorter than the structure itself and thus it would be faster and take less memory to pass the address.
A perhaps more interesting reason to pass the address is so that the receiving function can modify the argument. This is shown in mvToOrigin().
The first assignment statement in mvToOrigin() uses the standard dereferencing operator * followed by the standard component selection operator .. Due to precedence, the parentheses are needed.
The remaining three lines use the abbreviation ->.

Note: The -> abbreviation is employed almost universally. Constructs like ptr1->elt5 are very common; the long form (*ptr1).elt5 is much less common.

Homework: Write two versions of mkRectangle, one that accepts two points, and one that accepts 4 real numbers.

K&R-6.3 Arrays of Structures (and Structures of Arrays)

int f(int x) {
  if (x&1)
    return 3*x+1;
  return x >> 1;
}
  #define MAXVAL 10000
  #define ARRAYBOUND (MAXVAL+1)
  int G[ARRAYBOUND];
  int P[ARRAYBOUND];
  struct gameValType {
    int G[ARRAYBOUND];
    int P[ARRAYBOUND];
  } gameVal;
  struct gameValType {
    int G;
    int P;
  } gameVal[ARRAYBOUND];
  #define NUMEMPLOYEES 2
  struct employeeType {
    int id;
    char gender;
    double salary;
  } employee[NUMEMPLOYEES] = {
      { 32, 'M', 1234. },
      { 18, 'F', 1500. }
    };

Consider the following game. (The code is on the right does one step.)

Start with a positive integer N
If N is 1, stop.
If N is even, set N = N/2.
If N is odd, set N = 3N+1.

So, starting with N=5, you get 16 8 4 2 1.
starting with N=7, you get 7 22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1.
and starting with N=27, you get 27 82 41 ... 9232 ... 160 80 40 20 10 5 16 8 4 2 1.

It is an deep open problem in number theory if every positive integer eventually get to 1. This has been checked for MANY numbers. Let G[i] be the number of rounds of the game needed to get from i to 1. G[1]=0, G[2]=1, G[7]=16, G[27]=111. (Define G[0]=-1)

Factoring into primes is fun too. So let P[N] be the number of distinct prime factors of N. P[2]=1, P[16]=1, P[12]=2 (define P[0]=P[1]=0).

This leads to two arrays as shown on the right in the second frame.

We might want to group the two arrays into a structure as in the third frame. This version of gameVal is a structure of arrays. In this frame the number of distinct prime factors of 763 would be stored in gameVal.P[763]

In the fourth frame we grouped together the values of G[n] and P[n]. This version of gameVal is an array of structures. In this frame the number of distinct prime factors of 763 would be stored in gameVal[763].P

If we had a database with employeeID, gender, and salary, we might use the array of structures in the fifth frame. Note the initialization. The inner {} are not needed, but I believe they make the code clearer.

The `sizeof` and `sizeof()` Operators

How big is the employee array of structures? How big is employeeType?

C provides two versions of the sizeof unary operator to answer these questions.

sizeof object gives the size of any object (in bytes).
sizeof (type name) gives the size of any type (in bytes).

These functions are not trivial and indeed the answers are system dependent ... for two reasons.

Certain primitive types (e.g., int) may have different sizes in different systems.
The alignment requirements may be different.

Example: Assume char requires 1 byte, int requires 4, and double requires 8. Let us also assume that each type must be aligned on an address that is a multiple of its size and that a struct must be aligned on an address that is a multiple of 8.

So the data in struct employeeType requires 4+1+8=13 bytes. But three bytes of padding are needed between gender and salary so the size of the type is 16.

Homework: How big is each version of sizeof(struct gameValType)? How big is sizeof employee?

Calculating the Number of Elements in an Array

  #include <stdio.h>
  int main (int argc, char *argv[argc]) {
    struct howBig {
      int n;
      double y;
    } howBigAmI[] = { {26, 18.}, {33, 99.} };
    printf ("howBigAmI has %ld entries.\n",
      sizeof howBigAmI / sizeof(struct howBig));
  }

In the example above it is easy to look at the initialization and count the array bound for employee. An annoyance is that you need to change the #define for NUMEMPLOYEES if you add or remove an employee from the initialization list.

A more serious problem occurs if the list is long in which case manually counting the number of entries is tedious and, much worse, error prone.

Instead we can use sizeof and sizeof() to have the compiler compute the number of entries in the array. The code is shown on the right. The output produced is

  howBigAmI has 2 entries.

The Very Useful Function `getword(char *word, int limit)`

  int getword(char *word, int lim) {
    int c, getch(void);
    void ungetch(int);
    char *w = word;
    while (isspace(c = getch())) ;
    if (c != EOF)
      *w++ = c;
    if (!isalpha(c)) {
      *w = '\0';
      return c;
    }
    for ( ; --lim > 0; w++)
      if (!isalnum(*w = getch())) {
        ungetch(*w);
        break;
      }
    *w = '\0';
    return word[0];
  }

As its name suggests the purpose of getword() is to get (i.e., read) the next word from the input. It's first parameter is a buffer into which getword() will place the word found. Although declared as a char *, the parameter is viewed as pointing to many characters, not just one. The second parameter throttles getword(), restricting the number of characters it will read. Thus getword() is not scary; the caller need only ensure that the first parameter points to a buffer at least as big as the second parameter specifies.

The definition of a word is technical. (It is chosen to enable programs like the keyword counting example in the next section.) A word is either a string of letters and digits beginning with a letter, or a single non-whitespace character. The return value of the function itself is the first character of the word, or EOF for end of file, or the character itself if it is not alphabetic.

The program has a number of points to note.

The declaration of w initializes w, not *w.

Question: Changing word in getword() would not affect the caller, so why do we need w?
Answer: w is created and used so that the original value of word is available for word[0] at the end.

The expression *w++ is a common C idiom; make sure you understand what it does. In particular, the assignment 2 lines below *w++ does not overwrite the value stored.

The library routine isalnum() returns true for a letter or digit. You can find this information on any Unix machiine (e.g., access.cims.nyu.edu) by typing man isalnum.

Note that getword() above (which is from the text) requires the use of getch() and ungetch() from the text (and notes). The versions of these two routings in the standard library are slightly different and getword() fails if you use them.

K&R-6.4: Pointers to Structures

  #include <stdio.h>
  #include <ctype.h>
  #include <string.h>
  #define MAXWORDLENGTH 50
  struct keytblType {
    char *keyword;
    int  count;
  } keytbl[] = {
    { "break", 0 },
    { "case", 0 },
    { "char", 0 },
    { "continue", 0 },
    // others
    { "while", 0 }
  };
  #define NUMKEYS (sizeof keytbl / sizeof keytbl[0])
  int getword(char *, int); // no var names given
  struct keytblType *binsearch(char *);
  int main (int argc, char *argv[argc]) {
    char word[MAXWORDLENGTH];
    struct keytblType *p;
    while (getword(word,MAXWORDLENGTH) != EOF)
      if (isalpha(word[0]) &&
          ((p=binsearch(word)) != NULL))
        p->count++;
    for (p=keytbl; p<keytbl+NUMKEYS; p++)
      if (p->count > 0)
        printf("%4d %s\n", p->count, p->keyword);
    return 0;
  }
  struct keytblType *binsearch(char *word) {
  int cond;
    struct keytblType *low  = &keytbl[0];
    struct keytblType *high = &keytbl[NUMKEYS];
    struct keytblType *mid;
    while (low < high) {
      mid = low + (high-low) / 2;
      if ((cond = strcmp(word, mid->keyword)) < 0)
        high = mid;
      else if (cond > 0)
        low = mid+1;
      else
        return mid;
    }
    return NULL;
  }

The program on the right illustrates well the use of pointers to structures and also serves as a good review of many C concepts. The overall goal is to read text from the console and count the occurrence of C keywords (such as break, if, int, and others.). After reading the input, the program prints a list of all the keywords that were present and how many times each occurred.

Lets examine the code on the right.

The first interesting item is keytbl, the table of keywords. It is an array of struct keytblType; each entry of the array contains string and an integer. Note that the strings are in alphabetical order.

The initialization of keytbl serves two purposes. Each string is set to a C keyword and each count is initialized to zero. The size of the array is cleverly determined by the initialization and the #define on the next line. The entries are initialized in alphabetical order, which permits the use of a binary search to find an entry.

The main program contains two loops: The first computes the counts and the second outputs the results.
1. The first loop calls getword() and terminates on receiving EOF. How clever it is to have getword() return both the status and (in a pointer argument) the actual word found. The word is looked up in the keytbl using binsearch. The value returned by binsearch is either a pointer to the table entry found or NULL if the word is not a keyword (i.e., is not in the table). If the word is found, the corresponding count is incremented.
  I don't believe the isalpha() test is needed since, if the character is not a letter, binsearch will return NULL; it is presumably there to save a useless search.
2. The second loop traverses the table and prints out all entries with non-zero counts. Note the test used in the for statement and remember that the increment p++ increments p by enough so that it points to the next entry.

As I suspect you know, a binary search is quite efficient (its running time is logarithmic in the size of the table) and notoriously easy to get wrong (< vs. ≤, mid vs. mid-1 vs. mid+1, etc.). The only real difference between this one and the one I hope you saw in 102, is that the code on the right is pointer based not array based. This explains the mysterious code to set mid to the midpoint between high and low. But, other than that oddity, I find it striking how array-like the code looks. That is, the manipulations of the pointers could just as well be manipulating indices.

If you have the K&R text as well as your 101-102 material, it would be very useful for you to compare the following three versions of binary search.
1. The java, array-based version from 101-102.
2. The C, array-based version from section 6.3 of our text (not covered in these notes).
3. The C, pointer-based version on the right (and section 6.4 of our text).

Note/Suggestion: The code just above won't compile and run by itself. It needs getword(), which needs getch() and ungetch(), which are further back. Some of these are in standard libraries, but the library versions are slightly different and will not work with program above.

I believe it would be instructive for you to put the pieces all together into a single .c file which you then compile and run. As data you can type in (or cut and past in) any C program and it should work. At least it worked for me.

6.5 Self-referential Structures

Consider a basic binary tree. A small example is shown on the near right; one cell is detailed on the far right. Looking at the diagram on the far right suggests a structure with three components: left, right, and value. The first two refer to other tree nodes and the third is an integer.

I am fairly sure you did trees in 101-102 but I will describe the C version as though it is completely new. I will say that in both Java and C the key is the use of pointers. In C this will be made very explicit by the use of *. In Java it was somewhat under the covers.

  struct bad {
    struct bad left;
    int value;
    struct bad right;
  };

  struct treenode_t {
    struct treenode_t *left;
    int value;
    struct treenode_t *right;
  };

Since trees are recursive data structures you might expect some sort of recursive structure in the C or Java declaration. Consider struct bad defined on the right. (You might be fancier and have a struct tree, which contains a struct root, which has in turn an int value and two struct tree's).

But struct bad and its fancy friends are infinite data structures: The left and right components are the same type as the entire structure. So the size of a struct bad is the size of an int plus the size of two struct bad's. Since the size of an int exceeds zero, the total size must be infinite. Some languages permit infinite structures providing you never try to materialize more than a finite piece. But C is not one of those languages. So for us, struct bad is bad!

Instead, we use struct treenode_t as shown on the right (names like treenode_t are a shorter and very commonly used alternative to names like treenodeType).

The key is that a struct treenode_t does not contain an internal struct treenode_t. Instead it contains pointers to two internal struct treenodes t.

Be sure you understand why struct treenode_t is finite and corresponds exactly to the tree picture above.

Mutually Referential/Recursive Structures

  struct s {
    int val;
    struct t *pt;
  };
  struct t {
    double weight;
    struct s *ps;
  };

What if you have two structure types that need to reference each other. You cannot have a struct s contain a struct t if struct t contains a struct s. If you did try that, then each struct s would contain a struct t, which would in turn contain a struct s, which would contain ... .

Once again pointers come to the rescue as illustrated on the right. Neither structure is infinite. A struct s contains one integer and one pointer. A struct t contains one double and one pointer. Neither is a subset of the other, instead each references (points at) the other

Start Lecture #09

Remark: getword() from last time is quite a clever program. Imagine if it was run with itself as data. The first time called it returns int. The next time it is called yields getword, then (, then char, then * ... .

Linked Lists: An Unbounded 1D Data Structure

  struct llnode_t {
    long data;
    struct llnode_t *next;
  }

Probably the most familiar 1D unbounded data structure (beyond the 1D array) is the linked list, which is well studied in 101-102. On the near right we have a diagram of a small linked list and on the far right we show the C declaration of a structure corresponding to one node in the diagram. Again we note that a struct llnode_t does not contain a struct llnode_t. Instead, it contains a pointer to such a node.

With one pointer in each node the structure has a natural 1D geometric layout. Trees, in contrast, have two pointers per node and have a natural 2D geometric layout.

Nested Linked Links: Another 2D Data Structure

Instead of trees, we will investigate a different 2-dimensional structure, a linked list of linked lists. This structure (or something similar) will likely become the subject of the future lab 2.

Although all the actual data are strings (i.e., char *), there are two different types of structures present, the vertical list of struct node2d's (2D nodes) and the many horizontal lists of struct node1d's (1D nodes).

Actually it is a little more complicated. Each horizontal list has a list head that is a node2d and there must be somewhere (not shown in the diagram) a pointer to the first node2d (i.e., the node with data joe).

The three decreasing length horizontal lines indicate that the pointer in question is null. (I borrow that symbol from electrical engineering, where it is used to represent ground.)

The form of the individual nodes

  struct node1d {
    struct node1d *next;
    char *name;
  };
  struct node2d {
    struct node1d *first;
    char *name;
    struct node2d *down;
  };

The structure declarations are on the right. Perhaps I should have called them struct node1d_t and struct node2d_t.

Be sure you understand why the picture above agrees with the C declarations on the right.

The diagram (and the code) suggests a hierarchy: the nodes in the left hand column are higher level than the others. You can think of the struct node1d's on a single row belonging to a list headed by the struct node2d on the left of that same row.

Note that every struct node1d is the same (rather small) size independent of the length of the name. Similarly, all the struct node2d's are the same size (but bigger that the struct node1d's). In that sense the figure is misleading since is suggests that alice is larger that joe. The confusion is that the node does not contain the actual 6 characters in alice ('a', 'l', 'i', 'c', 'e', '\0') but rather a (fixed size) pointer to the name.

Said using C terminology the name component is a fixed size pointer. The possibly large string is the object pointed to by name, i.e., it is *name. But *name is a char, which is even smaller than a pointer. It would be more precise to say that name points to the first character of the string; you must look at the string itself to see where it ends.

  2d node name=joe
    1d node name=xy2
    1d node name=sally
    1d node name=e342
  2d node name=alice
  2d node name=R2D2
    1d node name=cso
    1d node name=c3pO

Printing the 2D Structure

How should we print the above structure.
I suggest, and probably lab2 will require, that you use the style shown on the right. (You might put quotes around the strings.)

The idea behind this style is the following.

The vertical 2D list of the diagram is printed left justified.
Each 1D horizontal list (which is headed by a 2D node in the vertical list) is indented under its 2D head node.

From this printout one can see immediately, for example, that the 2D list has three entries and that the middle 2D node has an empty sublist of 1D nodes.

A remaining question

One question remains. The string itself can be big. If the length is a constant, then the compiler can be asked to leave space for it.
Question: What if the string is generated at runtime?
Answer: malloc().

`malloc() and free()`

As you know, in Java, objects (including arrays) have to be created via the new operator. We have seen that in C this is not always needed: you can declare a struct rectangle and then declare several rectangles.

However, this doesn't work if you want to generate the rectangles during run time. When you are writing a program to process 2D lists, you won't know how many 2d nodes or 1d nodes will be needed. That number will be determined by the data read when the program is run.

In addition the size of the strings that name each node will also not be known until runtime.

So we need a way to create an object during run time. In C this uses the library function malloc(), which takes one argument, the amount of space to be allocated. The function malloc() allocates the requested space and returns a pointer to it. The companion function free() takes as argument a pointer that was obtained from malloc and makes the corresponding space available for future malloc()s.

`malloc()/free` vs. `alloc()/afree()`

These two new functions should remind you of the similar pair we studied a few lectures ago. The new functions are considerably more sophisticated than the old ones.

The previous pair of functions required that the next item freed was the last item allocated, i.e., the blocks of memory were required to be allocated and freed in a stack-like LIFO order. The new functions have no such requirement. This is a big deal.
The total amount of memory available to alloc() was a compile time constant. In contrast malloc() gets (large) chunks of memory from the operating system whenever it's own supply is exhausted.
Later this semester, we will study such storage allocaters in some detail.

Since malloc() is not part of C, but is instead just a library routine, the compiler does not treat it specially (unlike the situation with new, which is part of Java). Since malloc() is just an ordinary function, and we want it to work for dynamically created objects of any type (e.g., an int, a char *, a struct treenode, etc), and there is no way to pass the name of a type to a function, two questions arise.

How do we arrange that the space returned by malloc() meets the alignment requirements of the object we desire?
How do we arrange that the pointer returned by malloc() is a pointer to the correct type.

The alignment question is easy and can be essentially ignored at this time. This is fortunate since we haven't studied (or even defined) alignment yet, but will do so soon after we finish with C.

The answer to the alignment question (which will become clear when we study alignment) is that we simply have malloc() return space aligned on the most stringent requirement. So, on a system where long doubles and all structures require 16-byte alignment and all other data types require 8-byte, 4-byte, 2-byte, or 1-byte alignment, then malloc() always returns space aligned on a 16-byte boundary (i.e., the address is a multiple of 16).

Ensuring type correctness is not automatic, but not hard. Specifically, malloc() returns a void *, which means that the value returned is a pointer that must be explicitly coerced to the correct type. For example, lab 2 might contain code like

  struct node2d *p2d;
  p2d = (struct node2d *) malloc(sizeof(struct node2d));

An application calls the library routine free(void *p) to return memory obtained by malloc(). Indeed p must be a pointer returned by a previous call to malloc(). Note, as mentioned above, that the order in which chunks of memory are freed need not match the order in which they were obtained.

It is clearly an error to continue using memory you already freed. Such errors often lead to crashes with very little useful diagnostic information available.

Advice: Try very hard not to make that error.

Note: See, in addition, section 7.8.5 below.

  #include <stdio.h>
  #include <stdlib.h>
  int main(int argc, char *argv) {
    int n;
    char *As; 
    scanf("%d", &n);
    As = (char *) malloc((1+n) * sizeof(char));
    for (int i = 0; i<n; i++)
	  As[i] = 'A';
    As[n] = '\0';
    printf("As is: %s\n", As);
    return 0;
  }

Creating Unbounded Strings at Runtime

The program on the right reads an integer n and then produces a string containing n As (plus, of course, the trailing NULL to mark the end of the string). Note that n is unbounded and the only limit is the (virtual and physical) memory on your system. malloc() is used to obtain the memory needed for all n As.

Note that malloc() is declared in the system library stdlib.h, which explains the second #include.

Also note that As, which is a character pointer, is quite comfortable in its dual role as a character array. Of course the declaration char *A does not allocate any space for the characters. That job is handled by malloc().

Using `malloc()` in Lab2.

At various points in lab2, you may need to create a node, either a struct node2d or a struct node1d. These individual nodes cannot be simply declared since we don't know until runtime how many there will be of each type and what will be the individual names.

The situation will be that a user of your lab has entered a command such as:

  append2d name2

A first call to getword() yields append2d so you know you are creating a new struct node2d and placing it at the end of the existing vertical list.

A second call to getword() yields name2 which is the string you are to place in the newly-created struct node2d. Note that the lab provides an upper bound on the length of name2.

Since the node and the string must be created, TWO calls to malloc() are used. My code has the following comments.

  // create a 2D node with the given name and null 1D sublist
  // first malloc space for the node
  // now malloc() space for the name (i.e., the real string)

6.6: Table Lookup

Skipped

6.7: Typedef

Instead of declaring pointers to trees via

    struct treenode *ptree;

we can write

    typedef struct treenode *Treeptr;
    Treeptr ptree;

Thus treeptr is a new name for the type struct treenode *. As another example, instead of

    char *str1, *str2;

We could write

    typedef char *String;
    String str1, str2;

Note that this does not give you a new type; it just gives you a new name for an existing type. In particular str1 and str2 are still pointers to characters even if declared as a String above.

A common convention is to capitalize the a typedef'ed name.

6.8: Unions

Saving Space by Sharing Memory between 2 or More Variables

    struct something {
    int x;
    union {
    double y;
    int z;
    }
    }

Traditionally union was used to save space when memory was expensive. Perhaps with the recent emphasize on very low power devices, this usage will again become popular. Looking at the example on the right, y and z would be assigned to the same memory locations. Since the size allocated is the larger of what is needed the union takes space max(sizeof(double),sizeof(int)) rather than sizeof(double)+sizeof(int) if a union was not done.

It is up to the programmer to know what is the actual variable stored. The union shown cannot be used if y and z are both needed at the same time.

It is risky since there is no checking done by the language.

Meeting Alignment Constraints

A union is aligned on the most severe alignment of its constituents. This can be used in a rather clever way to meet a requirement of malloc().

As we mentioned above when discussing malloc(), it is sometimes necessary to force an object to meet the most severe alignment constraint of any type in the system. How can we do this so that if we move to another system where a different type has the most severe constraint, we only have to change one line?

    struct something {
    int x;
    struct something *p;
    // others
    } obj;
    // assume long most severely aligned
      typedef long Align
      union something {
      struct dummyname {
      int x;
      union something *p;
      // others
      } s;
      Align dummy;
      }
      typedef union something Something;

Say struct something, as shown in the top frame on the right, is the type we want to make most severely aligned.

Assume that on this system the type long has the most severe alignment requirement and look at the bottom frame on the right.

The first typedef captures the assumption that long has the most severe alignment requirement on the system. If we move to a system where double has the most severe alignment requirement, we need change only this one line. The name Align was chosen to remind us of the purpose of this type. It is capitalized since one common convention is to capitalize all typedefs.

The variable dummy is not to be used in the program. Its purpose is just to force the union, and hence s to be most severely aligned.

In the program we declare an object say obj to be of type Something (with a capital S) and use obj.s.x instead of obj.x as in the top frame. The result is that we know the structure containing x is most severely aligned.

See section 8.7 if you are interested.

6.9: Bit Fields

Skipped

Chapter K&R-7: Input and Output

7.1: Standard Input and Output

`getchar()` and `putchar()`

This pair form the simplest I/O routines.

  #include <stdio.h>
  int main (int argc, char *argv[argc]) {
    int c;
    while ((c = getchar()) != EOF)
      if (putchar(c) == EOF)
        return EOF;
    return 0;
  }

The function getchar() takes no parameters and returns an integer. This integer is the integer value of the character read from stdin or is the value of the symbolic parameter EOF (normally -1), which is guaranteed not the be the integer value of any character.

The function putchar() takes one integer parameter, the integer value of a character. The character is sent to stdout and is returned as the function value (unless there is an error in which case EOF is returned).

The code on the right copies the standard input (stdin), which is usually the keyboard, to the standard output (stdout), which is usually the screen.

We built getch() / ungetch() from getchar().

Homework: 7.1. Write a program that converts upper case to lower or lower case to upper, depending on the name it is invoked with, as found in argv[0]

Formatted Output—`printf`

We have already seen printf(). A surprising characteristic of this function is that it has a variable number of arguments. The first argument, called the format string, is required. The number of remaining arguments depends on the value of the first argument. The function returns the number of characters printed, but the return value is rarely used. Technically the declaration of printf() is

  int printf(char *format, ...);

The format string contains regular characters, which are just sent to stdout unchanged and conversion specifications, each of which determines how the value of the next argument is to be printed.

Each conversion specification begins with a %, which is optionally followed by some modifiers, and ends with a conversion character.

We have not yet seen any modifiers but have seen a few conversion characters, specifically d for an integer (i is also permitted), c for a single character, s for a string, and f for a real number.

There are other conversion characters that can be used, for example, to get real numbers printed using scientific notation. The book gives a full table.

There are a number of modifiers to make the output line up and look better. For example, %12.3f means that the real number will be printed using 12 columns (or more if the number is too big to fit in 12 columns) with 3 digits after the decimal point. So, if the number was 36.3 it would be printed as ||||||36.300 where I used | to represent a blank. Similarly -1000. would be printed as |||-1000.000. These two would line up nicely if printed via

  printf("%12.3f\n%12.3f\n\n", 36.3, -1000.);

`sprintf()`: A Relative of `printf()`

The function

  int sprintf(char *string, char *format, ...);

is very similar to printf(). The only difference is that, instead of sending the output to stout (normally the screen), sprintf() assigns it to the first argument specified.

  char outString[50];
  int d = 14;
  sprintf(outString, "The value of d is %d\n", d);

For example, the code snippet on the right sets the first 23 characters of outString to The value of d is 14 \n\0 while the remaining 27 characters of outString continue to be uninitialized.

Since the system cannot in general check that the first argument is big enough, care is needed by the programmer, for example checking that the returned value is no bigger than the size of the first argument. In summary, sprintf() is scary. A good defense is to use instead snprintf(), which like strncpy(), guarantees than no more than n bytes will be assigned (n is an additional parameter to snprintf).

7.3 Variable-length Argument Lists

As we mentioned, printf() takes a variable number of arguments. But remember that printf() is not special, it is just a library function, not an object defined by the language or known specially to the compiler. That is, anyone can write a C program with declaration

  int myfunction(int x, float y, char *z, ...)

and it will have three named arguments and zero or more unnamed arguments.

There is some magic needed to get the unnamed arguments. However, the magic is needed only by the author of the function; not by a user of the function.

scanf()‐the printf() Companion

Related to the Java Scanner class is the C function scanf().

The function scanf() is to printf() as getchar() is to putchar(). As with printf(), scanf() accepts one required argument (a format string) and a variable number of additional arguments. Since this is an input function, the additional arguments give the variables into which input data is to be placed.

Consider the code fragment shown on the top frame to the right and assume that the user enters on the console the lines shown on the bottom frame.

  int n;
  double x;
  char str[50];
  scanf("%d  %f  %s", &n, &x, str);
  22 37.5
  no-blanks-here

Perhaps the first point to notice is that the non-arrays n and x are each preceded with &. This is because C is a call-by-value language and hence there would be no way for scanf() to assign a value to x if the argument was simply x
It is a common error (at least it was for me) to forget the &, which can lead to disastrous results since you would then be giving some number (the current rvalue of x) to scanf, which it would treat as an address into which it would try to store the input.
22 is assigned to n.
37.5 is assigned to x.
The next assignment is to str. Since this variable is an array and a naked array name is short for a pointer to the first element, it is already an address so no & is used).
scanf() skips over white space so newlines in the input are skipped. It also considers an input item to end whenever white space is encountered, which is why blanks cannot occur for a string input.

A Relative of `scanf()`: `sscanf()`

The function

  int sscanf(char *string, char *fmt, ...);

is very similar to scanf(). The only difference is that, instead of getting the input from stdin (normally the keyboard), sscanf() gets it from the first argument specified.

Start Lecture #10

Remarks:

Midterm is thurs 14 Oct on Brightspace.
- I will (try to) permit Brightspace re-submissions up to the deadline.
- I STRONGLY recommend saving your Brightspace work as you proceed.
- If you are submitting diagrams you can either use computer software (providing you can convert the result to a pdf) or you can use paper and take a picture (again converting to pdf, which yoyu submit on Brightspace).
- If you are a MOSES student, you should see the same exam but giving you more time.
Mention malloc use at end of last lecture

7.5 File Access

So far all our input has been from stdin and all our output has been to stdout (or from/to a string for sscanf()/sprintf).

What if we want to read or write a file?
In Unix you can use the redirection operators of the command interpreter (the shell), namely < and >, to have stdin and/or stdout refer to a file.

But what if you want input from 2 or more files?

Opening and Closing Files; File Pointers

Before we can specify files in our C programs, we need to learn a (very) little about the file pointer.

Before a file can be read or written, it must be opened. The library function fopen() is given two arguments, the name of the file and the mode; it returns a file pointer.

Consider the code snippet on the right. The type FILE is defined in <stdio.h>. We need not worry about how it is defined.

  FILE *fp1, *fp2, *fp3, *fp4;
  FILE *fopen(char *name, char *mode);
  fp1 = fopen("cat.c", "r");
  fp2 = fopen("../x", "a");
  fp3 = fopen("/tmp/z", "w");
  fp4 = fopen("/tmp/q", "r+");

The file cat.c in the current directory is opened for reading and some information about this file is recorded in *fp1.
The file x in the parent directory is opened for appending; x is created if it doesn't exist.
The file z in /tmp is opened for writing. Previous contents of z are lost.
The file q in /tmp is opened for reading and/or writing. (When mixing reads and writes, care is needed.)

After the file is opened, the file name is no longer used; subsequent commands (reading, writing, closing) use the file pointer.

The function fclose(FILE *fp) breaks the connection established by fopen().

`getc()/putc()`: The File Versions of `getchar()/putchar()`

Just as getchar()/putchar() are the basic one-character-at-a-time functions for reading and writing stdin/stdout, getc()/putc() perform the analogous operations for files (really for file pointers). These new functions naturally require an extra argument, a pointer to the file to read from or write to.

Since stdin/stdout are actually file pointers (they are constants not variables) we have the definitions

  #define getchar()    getc(stdin)
  #define putchar(c)   putc((c), stdout)

I think this will be clearer when we do an example, which is our next task.

An Example `cat.c`

  #include <stdio.h>
  main (int argc, char *argv[argc]) {
    FILE *fp;
    void filecopy(FILE *, FILE *);
    if (argc == 1) // NO files specified
      filecopy(stdin, stdout);
    else
      while(--argc > 0)  // argc-1 files
        if((fp=fopen(*++argv, "r")) == NULL) {
          printf ("cat: can't open %s\n", *argv);
          return 1;
        } else {
          filecopy(fp, stdout);
          fclose(fp);
        }
    return 0;
  }
  void filecopy (FILE *ifp, FILE *ofp) {
    int c;
    while ((c = getc(ifp)) != EOF)
      putc(c, ofp);
  }

The name cat is short for catenate, which is a synonym of concatenate.

If cat is given no command-line arguments (i.e., if argc=1), then it just copies stdin to stdout. This is not useless: for one thing remember < and >.

If there are command-line arguments, they must all be the names of existing files. In this case, cat concatenates the files and writes the result to stdout. The method used is simply to copy each file to stdout one after the other.

The copyfile() function uses the standard getc()/putc() loop to copy the file specified by its first argument ifp (input file pointer) to the file specified by its second argument. In this application, the second argument is always stdout so copyfile() could have been simplified to take only one argument and to use putchar().

Note the check that the call to fopen() succeeded; a very good idea.

Note also that cat uses very little memory, even if concatenating 100GB files. It would be an unimaginably awful design for cat to read all the files into some ENORMOUS character array and then write the result to stdout.

7.6: Error Handling—`Stderr` and `Exit` (and `Ferror()`)

`Stderr`

A problem with cat is that error messages are written to the same place as the normal output. If stdout is the screen, the situation would not be too bad since the error message would occur at the end. But if stdout were redirected to a file via >, we might not notice the message.

Since this situation is common, there are actually three standard file pointers defined: In addition to stdin and stdout, the system defines stderr.

Although the name suggests that it is for errors (and that is indeed its primary application), stderr is really just another file pointer, which (like stdout) defaults to the screen).

Even if stdout is redirected by the standard > redirection operator, stderr will still appear on the screen.

There is also syntax to redirect stderr, which can be used if desired.

`Exit()`

As mentioned previously a command should return zero if successful and non-zero if not. This is quite easy to do if the error is detected in the main() routine itself.

What should we do if main() has called joe(), which has called f(), which has called g(), and g() detects an error (say fopen() returned NULL)?

It is easy to print an error message (sent to stderr, now that we know about file pointers). But it is a pain to communicate this failure all the way back to main() so that main() can return a non-zero status.

Exit() to the rescue. If the library routine exit(n) is called, the effect is the same as if the main() function executed return n. So executing exit(0) terminates the command normally and executing exit(n) with n>0 terminates the command and gives a status value indicating an error.

`Ferror()`

The library function

  int ferror(FILE *fp);

returns non-zero if an error occurred on the stream fp. For example, if you opened a file for writing and sometime during execution the file system became full and a write was unsuccessful, the corresponding call to ferror() would return non-zero.

7.7 Line Input and Output (`fgets()` and `fputs()`)

The standard library routine

  char *fgets(char *line, int maxchars, FILE *fp)

reads characters from the file fp and stores them plus a trailing '\0' in the string line. Reading stops when a newline is encountered (it is read and stored) or when maxchars-1 characters have been read (hence, counting the trailing '\0', at most maxchars will be stored).

The value returned by fgets is normally line. If an end of file or error occurs, NULL is returned instead.

The standard library routine

  int fputs(char *line, FILE *fp)

writes the string line to the file fp. The trailing '\0' is not written and line need not contain a newline. The return value is zero unless an error occurs in which case EOF is returned.

7.8 Miscellaneous Functions

A laundry list. I typed them all in to act as convenient reference. Let me know if you find any errors.

The integer type `size_t`

This subsection represents a technical point; for this class you can replace size_t by int.

Consider the return type of strlen(), which the length of the string parameter. It is surely some kind of integral type but should it be short int, int, long int or one of the unsigned flavors of those three?

Since lengths cannot be negative, the unsigned versions are better since the maximum possible value is twice as large. (On the machines we are using int is at least 32-bits long so even the signed version permits values exceeding two billion, which is good enough for us).

The two main contenders for the type of the return value from strlen() are unsigned int and unsigned long int. Note that long int can be, and usually is, abbreviated as long.

If you make the type too small, there are strings whose length you cannot represent. If you make the type bigger than ever needed, some space is wasted and, in some cases, the code runs slower.

Hence the introduction of size_t, which is defined in stdlib.h.
Each system specifies whether size_t is unsigned int or unsigned long (or something else).

For the same reason that the system-dependent type size_t is used for the return value of strlen, size_t is also used as the return type of the sizeof operator and is used several places below.

7.8.1 String Operations

These are from string.h, which must be #include'd. The versions with n added to the name limit the operation to n characters. In the following table n is of type size_t and c is an int containing a character; src and dest are strings (i.e., character pointers, char *); and cs and ct are constant strings (const char *).

I indicated which inputs may be modified by writing the string name in red. Remember that a string in C is represented by a character pointer.

Call	Meaning
`strcat(dest,src)`	Concatenate `src` on to the end of `dest` (changing `dest`) and return `dest`.
`strncat(dest,src,n)`	The same but concatenates no more than `n` characters.
`strcmp(cs,ct)`	Compare `s` and `t` lexicographically. Returns a negative, zero, or positive `int` if `s` is respectively `<`, `=`, or `> t`
`strncmp(cs,ct,n)`	The same but compares no more than `n` characters.
`strcpy(dest,ct)`	Copy `ct` to `s` and return `dest`.
`strncpy(dest,ct,n)`	Similar but copies no more than `n` characters and pads with '\0' if `ct` has fewer than `n` characters. The result might NOT be `'\0'` terminated.
`strlen(cs)`	Returns the length of `cs` (not including the terminating '\0') as a `size_t` value.
`strchr(cs,c)`	Returns a pointer to the first `c` in `cs` or `NULL` if `c` is not in `cs`.
`strrchr(cs,c)`	Returns a pointer to the last `c` in `cs` or `NULL` if `c` is not in `cs`.
`strstr(cs,ct)`	Returns a pointer to the first occurrence of `ct` in `cs` or NULL if `c` is not in `cs`.

7.8.2 Character Testing and Conversion

These functions are from ctype.h, which must be #include'd. Each of them takes an integer argument (representing a character or the value EOF) and return an integer.

Call	Meaning
`isalpha(c)`	Returns true (non-zero) if (and only if) `c` is alphabetic. In our locale this means a letter.
`isupper(c)`	Returns true if `c` is upper case.
`islower(c)`	Returns true if `c` is lower case.
`isdigit(c)`	Returns true if `c` is a digit.
`isalnum(c)`	Returns true if `isalpha(c)` or `isdigit(c)`.
`toupper(c)`	Returns `c` converted to upper case if `c` is a letter; otherwise returns `c`.
`tolower(c)`	Returns `c` converted to lower case if `c` is a letter; otherwise returns `c`.

7.8.3 Ungetc

int ungetc(int c, FILE *fp) pushes back to the input stream the character c. It returns c or EOF if an error was encountered.

This function is from stdio.h, which must be #include'd.

Only one character can be pushed back, i.e., it is not safe to call ungetc() twice without an call in between that consumes the first pushed back character. The function ungetch() found in the book and these notes does not have this restriction.

7.8.4 Command Execution

  #include <stdio.h>
  #include <stdlib.h>
  int main (int argc, char *argv[argc]) {
    int status;
    printf("Hello.\n");
    status = system("dir; date");
    printf("Goodbye: status %d\n", status);
    return 0;
  }

The function system(char *s) runs the command contained in the string s and returns an integer status.

The contents of s and the value of the status is system dependent.

On my system, the program on the right when run in a directory containing only two files x and y produced the following output.

Hello.
x  y
Sun Mar  7 16:05:03 EST 2010
Goodbye: status 0

This function is in stdlib.h, which must be #include'd.

K&R-7.8.5 Storage Management

`Malloc()`

We have already seen

  void *malloc(size_t n)

which returns pointer to n bytes of uninitialized storage. If the request cannot be satisfied, malloc() returns NULL.

`Calloc()`

The related function

    void *calloc(size_t n, size_t size)

returns a pointer to a block of storage adequate to hold an array of n objects each of size size. The storage is initialized to all zeros.

`Free()`

The function

  void free (void *p)

is used to return storage obtained from malloc() or calloc().

Remarks

  for (p = head; p != NULL; p = p->next)
    free(p);
  for (p = head; p != NULL; p = q) {
    q = p-> next;
    free (p);
  }

It is crucial that the pointer argument to free() was obtained by a call to malloc() or calloc() (or realloc(), which we shall not use).
Equally bad is to reference space after free'ing it. For example the code on the top right is buggy since p->next uses the p that has just been free'd. Instead the bottom loop should be used.
The two arguments to calloc() are not silly. You cannot always multiply them to determine the amount of storage needed due to padding requirements. Indeed, the space needed is system dependent.
The pointers returned by malloc() and calloc are properly aligned, but must be cast to the appropriate type.

K&R-7.8.6 Mathematical Functions

Call	Meaning
`sin(x)`	sine
`cos(x)`	cosine
`atan(x)`	arctangent
`exp(x)`	exponential `e^x`
`log(x)`	natural logarithm `log_e(x)`
`log10(x)`	common logarithm `log₁₀(x)`
`pow(x,y)`	`x^y`
`sqrt(x)`	square root, `x≥0`
`fabs(x)`	absolute value

These functions are from math.h, which must be #include'd. In addition (at least on on my system and linserv1.nyu.edu) you must specify a linker option to have the math library linked. If your mathematical program consists of A.c and B.c and the executable is to be named prog1, you would write

  cc -o prog1 -l m A.c B.c

All the functions in this section have double's as arguments and as result type. The trigonometric functions express their arguments in radians and the inverse trigonometric functions express their results in radians.

K&R-7.8.7 Random Number Generation

Random number generation (actually pseudo-random number generation) is a complex subject. The function rand() given in the book is an early and not wonderful generator; it dates from when integers were 16 bits. I recommend instead (at least on linux and linserv.nyu.edu)

  long int random(void)
  void srandom(unsigned int seed)

The random() function returns an integer between 0 and RAND_MAX. You can get different pseudo-random sequences by starting with a call to srandom() using a different seed. Both functions are in stdlib.h, which must be #include'd.

On my linux system RAND_MAX (also in stdlib.h) is defined as 2³¹-1, which is also INT_MAX, the largest value of an int. It looks like linserv.nyu.edu doesn't define RAND_MAX, but does use the same psuedo-random number generator.

Remark: Let's write some programs/functions.

Write a program (most-vowels) that reads lines and prints the one with the most vowels together with a count of how many vowels it contains.
Write a function (repstr) that accepts a count and a string and returns a newly allocated string containing the concatenation of the original string the specified number of times.
Write a function (mergeint) that merges two sorted arrays of integers A and B into C. The signature is
mergeInt(int n, int m, int A[n], int B[m], int C[n+m]);
Think about how to do this for arrays of (varying length) strings.
The 5 -> 16 -> 8 -> 4 -> 2 -> 1 game.
1. Write a function f giving the next number: f(5)=16; f(14)=7. What should the signature be?
2. Write a program that repeatedly reads an integer and prints out the generated sequence up to 1.
3. Modify the previous program to instead accept three arguments, which act like the three parts of a for. The program plays the game for all integers starting at the first argument, incrementing by the third, up to the second.
4. Modify the previous to accept an optional -b or --brief argument. If present, don't print the sequence, instead just print its length.
5. Modify the previous to accept an optional -s or --summary argument that prints only the number having the longest sequence and the sequence (OK to re-calculate). If the -b is also present, just print the number and the length of the sequence (not OK to recalculate).

Chapter O'H-1 A Tour of Computer Systems

O'H-1.1 Information is Bits + Context

O'H-1.2 Programs are Translated by Other Programs into Different Forms

At the lowest level of abstraction each of these forms of code are just sequences of bits.

Chapter O'H-2 Representing and Manipulating Information

O'H-2.1 Information Storage

Computers are Inherently Binary Machines

Modern electronics can quickly distinguish 2 states of an electric signal: low voltage and high voltage. Low has always been around 0 volts; high was 5 volts for a long while now is below 3.5 volts.

Since this is not a EE course we will abstract the situation and say that a signal is in one of two states, low (a.k.a. 0) and high (a.k.a. 1).

On the right we see plots of voltage (vertical) vs time (horizontal).

The top plot is the ideal we shall assume: the voltage is always at the high or low level and transitions in zero time from one to the other.
The middle is more realistic showing a sine wave. It is important for engineers to arrange not to test the voltage when it is in the middle.
The third is a nightmare. The voltage goes above the max and below the min. I got carried away and showed the signal occasionally going backwards in time, which can't happen.

It is fine if you ignore the middle and bottom pictures.

Binary Notation

decimal base 10	binary base 2	base 4	octal base 8	hex base 16
0	0	0	0	0
1	1	1	1	1
2	10	2	2	2
3	11	3	3	3
4	100	10	4	4
5	101	11	5	5
6	110	12	6	6
7	111	13	7	7
8	1000	20	10	8
9	1001	21	11	9
10	1010	22	12	A
11	1011	23	13	B
12	1100	30	14	C
13	1101	31	15	D
14	1110	32	16	E
15	1111	33	17	F
16	10000	100	20	10

Since (for us) a signal can be in one of two states, it is convenient to use binary (a.k.a. base 2) notation. That way if we have three signals with the first and third high and the middle one low, we can represent the situation using 3 binary digits, specifically 101.

Recall that to calculate the numeric value of a ordinary (base 10, i.e., decimal) number the right most digit is multiplied by 10⁰=1 the next digit to the left by 10¹=10, the next digit by 10²=100, etc.

For example 6205 = 6*10³ + 2*10² + 0*10¹ + 5*10⁰ = 6*1000 + 2*100 + 0*10 +5*1.

Similarly binary numbers work the same way so, for example the binary number 11001 has value (written in decimal)
1*2⁴ + 1*2³ + 0*2² + 0*2¹ + 1*2² = 1*16 + 1*8 + 0*4 + 0*2 +1*1 = 16+8+1 = 25.

We normally use decimal (i.e., base 10) notation where each digit is conceptually multiplied by a power of 10. The use of 10 digits is strongly related to our having 10 fingers (aka digits).

We all know about the ten's place, hundred's place, etc. The feature that the same digit is valued 1/10 as much if it is one place further to the right continues to hold to the right of the decimal point.

Computer hardware uses binary (i.e., base 2) arithmetic so to understand hardware features we could write our numbers in binary. The only problem with this is that binary numbers are long. For example, the number of US senators would be written 1100100 and the number of miles to the sun would need 25 bits (binary digits).

This suggests that decimal notation is more convenient. The problem with relying on decimal notation is that we need binary notation to express multiple electrical signals and it is difficult to convert between decimal and binary because ten is not an integral power of 2.

The table on the right (for now only look at the first two columns) shows how we write the numbers from 0 to 16 in both base 10 and base 2.

Start Lecture #11

Remarks:

A practice midterm (and soln) are posted in Brightspace.
Describe midterm timing (via Brightspace);
- The midterm will appear on Brightspace at 9:30 in the assignment tab of Brightspace. You may need to refresh the tab.
- I will(try to) enable unlimited resubmissions, so save your work as you go.
- Brightspace will block any submission after the end time.
- You may bring a sheet with the library functions.
The difficulty will be comparable to lab one except:
- The midterm will have fewer questions.
- The material covered on the midterm will be all we did on C; whereas, the lab did not include the last few chapters.

O'H-2.1.1 Hexadecimal Notation

A Base 4 Compromise?

Base 10 is familiar to us, which is certainly an enormous advantage, but it is hard to convert base 10 numbers to/from base 2 and we need base 2 to express hardware operation. Base 2 corresponds well to the hardware but is verbose for large numbers.

Let's try a compromise, base 4.

To convert between base four and base two is easy since the four base 4 digits (I hate that expression, for me digit means base 10) correspond exactly to the four possible pairs of bits.

  base 4   bits
    0       00
    1       01
    2       10
    3       11

Look again at the table above but now concentrate on columns two and three.

We see that it is easy to convert back and forth between base 2 and base 4. But base 4 numbers are still a little long for comfort: a number needing n bits would use ⌈n/2⌉ base four digits.

A base 8 number would need ⌈n/3⌉ digits for an n-bit base 2 number because 8=2³ and a base 16 number would need ⌈n/4⌉. Base 8 (called octal) would be good, and was used when I learned about computers. The C language dates from this time and C has support for octal. Base 16 (called hexadecimal) is used now and C supports it.

Question: Why the switch from base 8 to base 16?
Answer: Words in a 1960s computer had 36 bits and 36 is divisible by 3 so a word consisted of exactly 12 octal digits. Words in modern computers have 32 bits and 32 is divisible by 4 (but not by 3) so a 32-bit word consists of exactly 8 base-16 digits. (Recently the word size has increased to 64 bits, but 64 is also divisible by 4 and a 64-bit word consists of exactly 16 base-16 digits.)

Question: Why were there 36-bit words?
Answer: Characters then were 6 bits so a 36-bit word held six characters.

Base 16 is called hexadecimal.

We need 16 symbols for the 16 possible digits; the first 10 are obvious 0,1,...,9. We need 6 more to represent ten, eleven, ..., fifteen.

We use A, B, C, D, E, F to represent the extra 6 digits. When we write a hexadecimal number we precede it with 0x.

So 0x1234 = 1*(16)³ + 2*(16)² + 3*(16)¹ + 4*(16)⁰ is quite a bit bigger than 1234 = 1*(10)³ + 2*(10)² + 3*(10)¹ + 4*(10)⁰

The Big Advantage of Base 16 Over Base 10

You convert a base-16 to/from binary one hexadecimal digit (4 bits) at a time. For example

  1011000100101111 = 1011 0001 0010 1111 = B 1 2 F = 0xB12F

Look again at the table above right and notice that groups of four bits do match one hex digit.

The Downside

You need to learn (or figure out) that 0xA3 + 0x3B = 0xDE and worse 0xFF + 0xBB = 0x1BA and much worse 0xFA * 0xAF = 0xAAE6.

O'H-2.1.2 Data Sizes

Although fundamentally hardware is based on bits, we will normally think of computers as byte oriented. A byte (aka octet) consists of 8 bits (or two hex characters). As we learned, the primitive types in C (char, int, double, etc) are each a multiple of bytes in size. In fact, the multiples are powers of 2 so individual data items are 1, 2, 4, 8, or 16-bytes long.

Two Examples

  #include <string.h>
  #include <stdio.h>
  void showBytes (unsigned char *start, int len) {
    int i;
    for (i=0; i<len; i++)
      printf("%p %5x %c\n", start+i, *(start+i), start[i]);
  }
  int main(int argc, char *argv[]) {
    showBytes(argv[1], strlen(argv[1]));
  }

The simple program on the right prints its first argument in hex. Actually it does a little more, it prints the address of each character of the first argument, and then prints the character twice, first as a hex number and then as a character. Remember that in C, a character is an integer type.

./a.out jB4k
0x7ffd3892789d    6a  j
0x7ffd3892789e    42  B
0x7ffd3892789f    34  4
0x7ffd389278a0    6b  k

Several points to note.

The relationship between pointers and arrays in the printf().
Consecutive characters are stored in consecutive addresses. This is because each character is one byte in size.
Capital letters have smaller int values than lower case letters and digits have smaller values than letters. Those are properties of ASCII and (I believe) unicode.

  #include <stdio.h>
  int main(int argc, char *argv[]) {
    int idx;
    char   c[3];
    short  s[3];
    int    i[3];
    long   l[3];
    float  f[3];
    double d[3];
    for (idx=0; idx<3; idx++) {
      printf("%p  %p  %p  %p\n", &c[idx], 
              &s[idx], &i[idx], &l[idx]);
    }
    printf("\n");
    for (idx=0; idx<3; idx++) {
    printf("%p  %p\n", &f[idx], &d[idx]);
    }
  }

Alignment

The program on the right produces the following output.

 0x7fff73546565 0x7fff73546502 0x7fff73546508 0x7fff73546520
 0x7fff73546566 0x7fff73546504 0x7fff7354650c 0x7fff73546528
 0x7fff73546567 0x7fff73546506 0x7fff73546510 0x7fff73546530

 0x7fff73546514 0x7fff73546540
 0x7fff73546518 0x7fff73546548
 0x7fff7354651c 0x7fff73546550

Note that the chars are one byte apart, shorts are two bytes apart, ints and floats, are four bytes apart, and longs and doubles are eight bytes apart.

Also note that chars (which are of size 1) can start on any byte, shorts (which are of size 2) can start only on even numbered byte, ints and floats (which are of size 4) can start only on addresses that are a multiple of 4, and longs and doubles (which are of size 8) can start only on addresses that are a multiple of 8.

In general data items of size n must be aligned on addresses that are a multiple of n.

This answers a question we posed concerning malloc(), namely malloc() returns addresses that are a multiple of the most severe alignment restriction on the system. Normally this is 16.

O-H'2.1.3 Addressing and Byte Ordering

We think of memory as composed of 8-bit bytes and the bytes in memory are numbered. So if you could find a memory as small as 1KB (kilobyte) you could address the individual bytes as byte 0, byte 1, ... byte 1023. If you numbered them in hexadecimal it would be byte 0 ... byte 3FF.

As we learned a C-language char takes one byte of storage so its address would be one number.

A 32-bit integer requires 4 bytes. I guess one could imagine storing the 4 bytes spread out in memory, but that isn't done. Instead the integer is stored in 4 consecutive bytes, the lowest of the four byte addresses is the address of the integer.

Normally, integers are aligned i.e, the lowest address is a multiple of 4. On many systems a C-language double occupies 8 consecutive bytes the lowest numbered of which is a multiple of 8.

Little Endian vs. Big Endian

Let's consider a 4-byte (i.e., 32-bit) integer N that is stored in the four bytes having address 0x100-0x103. The address of N is therefore 0x100, which is a multiple of 4 and hence N is considered aligned.

Let's say the value of N in binary is
0010|1111|1010|0101|0000|1110|0001|1010
which in hex (short for hexadecimal) is 0x2FA50E1A. So the four bytes numbered 100, 101, 102, and 103 will contain 2F A5 0E 1A. However, a question still remains: Which byte contains which pair of hex digits?

Unfortunately two different schemes are used. In little endian order the least significant byte is put in the lowest address; whereas in big endian order the most significant byte is put in the lowest address.

Consider storing in address 0x1120 our 32-bit (aligned) integer, which contains the value 0x2FA50E1A. A little endian machine would store it this way.

  byte address  0x1120 0x1121 0x1122 0x1123
      contents   0x1A   0x0E   0xA5   0x2F

In contrast a big endian machine would store it this way.

  byte address  0x1120 0x1121 0x1122 0x1123
      contents   0x2F   0xA5   0x0E   0x1A

  int main(int argc, char *argv[]) {
    int a = 54321;
    showBytes((char *)&a, sizeof(int));
  }

An Example Using `showBytes()`

On the right is an example using the showBytes() routine defined just above that gives (in hex) the four bytes in the integer 54321. The output produced is (ignoring the third column)

  0x7ffd0a0ed8f4    31   
  0x7ffd0a0ed8f5    d4   
  0x7ffd0a0ed8f6     0   
  0x7ffd0a0ed8f7     0

So the four bytes are 0x31, 0xD4, 0x0, and 0x0. If the number in hex is 31 D4 00 00 it would be much bigger than 54321 decimal. Instead the number is 00 00 D4 31 hex which does equal 54321 decimal.

So the processor in my laptop is little endian (as are all x86 processors).

Homework: 2.58.

Remark: Now imagine connecting a little endian machine to a big-ending machine and sending an int from one to the other one byte at a time.

O'H-2.1.4 Representing Strings

As we know a string is a null terminated array of chars; each char occupies one byte. Given the string "tom", the char 't' will occupy one byte, 'o' will occupy the next (higher) byte, 'm' will occupy the next byte and '\0' the next (last) byte.

There is no issue of byte ordering (endian) since each character is stored in one byte and consecutive characters are stored in consecutive bytes.

O-H'2.1.5 Representing Code

Compiled code is stored in the same memory as data. However, unlike data, the format of code is not standardized. That is, the same C program when compiled on different systems will result in different bit patterns.

We will see many examples later.

Start Lecture #12

MIDTERM EXAM

Start Lecture #13

O'H-2.1.6 Introduction to Boolean Algebra

Now we know how to represent integers and characters in terms of bits and how to write each using hexadecimal notation. But what about operations like add, subtract, multiply, and divide.

We will approach this slowly and start with operations on individual bits, operations like AND and OR.

To define addition for integers you need to give a procedure for adding 2 numbers, You can't simply list all the possible addition problems since there are an infinite number of integers. However, there are only 2 possible bits and hence for a binary (i.e., two operand) operation on bits there are only four possible examples and we simply list all four possible questions and the corresponding answers. This list is often called a truth table.

The following diagram does this for six basic bit-level operations. Just below each truth tables is the symbol used for that operation when drawing a diagram of an electronic circuit (a circuit diagram).

NOT
A	~A
0	1
1	0

AND
A	B	A&B
0	0	0
0	1	0
1	0	0
1	1	1

OR
A	B	A\|B
0	0	0
0	1	1
1	0	1
1	1	1

XOR
A	B	^
0	0	0
0	1	1
1	0	1
1	1	0

NAND
A	B	NAND
0	0	1
0	1	1
1	0	1
1	1	0

NOR
A	B	NOR
0	0	1
0	1	0
1	0	0
1	1	0

Extending Boolean Operations to Bit Vectors

Once you know to compute A|B for A and B each a single bit, you can define A|B for A and B equal length bit vectors. You just apply the operator to corresponding bits.

The same applies to &, ~, and ^.

For example 0101 | 0010 = 0111 and 1100 ^ 1010 = 0110.

Universal Boolean Operators

It turns out that if you have enough chips that compute only NAND, you are able to wire them together to support any Boolean function. We call NAND universal for this reason. This is also true of NOR but it is not true of any other two input primitive.

O'H-2.1.7 Bit-Level Operations in C

~A	NOT A
A&B	A AND B
A\|B	A OR B
A^B	A XOR B

C directly supports NOT, AND, OR, and XOR as shown on the table to the left and mentioned previously in section 2.9. Note that these operations are bit-wise. That is, bit zero of the result depends only on bit zero of the operand(s), bit one of the result depends only on bit one of the operands, etc.

C does not have explicit support for NAND or for NOR.

O'H'2.1.8 Logical Operations in C

Done previously. Be careful not to confuse bit-level AND (&) with logical AND (&&). The logical operators (&&, ||, and ! treat any nonzero value as TRUE, and zero as FALSE. Also the value returned is always 0 or 1.

Note, for example that !0x00 = 0x01; whereas ~0x00=0xFF.

Also remember that C guarantees short-circuit evaluation of && and ||. In particular ptr&&*ptr cannot generate a null pointer exception since, when ptr is null, *ptr is not evaluated.

Boolean in C

This was introduced in C99 so is not in the text. You may use it, but it is not required for the course.

O'H-2.1.9 Shift Operations in C

In C, the expression x<<b shifts x b bits to the left. The b most left bits of x are lost and the b right most bits of x become 0.

There is a corresponding right shift >> but there is a question on what to do with the high order (sign) bit.

In a logical right shift all the bits move right and the new HOB becomes a zero. The >> operator is always a logical right shift for unsigned values.

In an arithmetic right shift, again all the bits shift right, but the new HOB becomes a copy of the old high order bit.

Most (perhaps all) systems perform arithmetic right shifts when the values are signed.

Homework: 2.61, 2.64.

Start Lecture #14

O'H-2.2 Integer Representations

O'H-2.2.1 Integral Data Types

Integers in C come in several sizes and two flavors. A char is a 1-byte integer; a short is a 2-byte integer; and an int is a 4-byte integer. The size of a long is system dependent. It is 4 bytes (32 bits) on a 32-bit system and 8 bytes (64 bits) on a 64-bit system.

What about the two flavors? That comes next.

O'H-2.2.2 Unsigned Encodings

The first flavor of C integers is unsignted.

We illustrate only unsigned short; the other sizes are essentially the same (but with a different number of bits). So we have 16 bits in each short integer, representing from right to left 2⁰ to 2¹⁵. If all these 16 bits are 1s, the value is
2¹⁵+2¹⁴+2¹³+2¹²+ 2¹¹+2¹⁰+2⁹+2⁸+ 2⁷+2⁶+2⁵+2⁴+ 2³+2²+2¹+2⁰ = 2¹⁶-1 = 65,535.
Question: Why?
Answer: If the number were one bigger, it would be a 1 followed by 16 zero bits so its value would be 2¹⁶.

In a sense these encodings are the most natural. They are used and they are well supported in the C language. Naturally the sum of two very big 16-bit unsigned numbers would need 17 bits; this is called overflow. Nonetheless, the situation is good for unsigned addition:

Just add and it works, except for overflow.
Just subtract the smaller from the bigger and it works (overflow is impossible).
Just multiply and it works, except for overflow.
Just divide (not by 0) and it works (overflow is impossible)

But there is a problem. Unsigned encodings have no negative numbers. That is why I didn't mention subtracting the bigger from the smaller.

O'H-2.2.3 Two's-Complement Encoding

To include negative numbers there must be a way to indicate the sign of the number. Also, since some shorts will be negative and we have the same number of shorts as unsigned shorts (because we sill have 16 bits), there will be fewer positive shorts than we had for unsigned shorts.

Before specifying how to represent negative numbers, let's do the easy case of non-negative numbers (i.e., positive and zero). For non-negative numbers set the leftmost bit (called the sign bit) to zero and use the remaining bits as above. Since the left bit (the high order bit or HOB) is for the sign we have one fewer for the number itself so the largest short has a zero HOB and 15 one bits, which equals 2¹⁵-1 = 32,767.

We could do the analogous technique for negative numbers: set the HOB to 1 and use the remaining 15 bits for the magnitude (the absolute value in mathematics). This technique is called the sign-magnitude representation and was used in the past, but is not common now. One annoyance is that you have two representations of zero 0000000000000000 and 1000000000000000. We will not use this encoding.

Instead of just flipping the leftmost (or sign) bit as above we form the so-called 2s-complement. For simplicity I will do 4-bit two's complement and just talk about the 16-bit analogue (and 32- and 64-bit analogues), which are essentially the same.

4-bit Twos's Complement Numbers

With 4 bits, there are 16 possible numbers. Since twos complement notation has one only representation for each number (including 0), there are 15 nonzero values. Since there are an odd number of nonzero values, there cannot be the same number of positive and negative values. In fact 4-bit two's complement notation has 8 negative values (-8..-1), and 7 positive values (1..7). (In sign magnitude notation there are the same number of positive and negative values, which is convenient; but there are two representations for zero, which is inconvenient.)

The high order bit (hob) i.e., the leftmost bit is called the sign bit. The sign bit is zero for positive numbers and for the number zero; the sign bit is one for negative numbers.

Zero is written simply 0000.

1-7 are written 0001, 0010, 0011, 0100, 0101, 0110, 0111. That is, you set the sign bit to zero and write 1-7 using the remaining three lob's (low order bits). This last statement is also true for zero.

-1, -2, ..., -7 are written by taking the two's complement of the corresponding positive number. The two's complement of a (binary) number is computed in two steps.

Take the (ordinary) complement, i.e. change each one to a zero and each zero to a one. This is sometimes called the one's complement.
For example, the (4-bit) one's complement of 3 is 1100.
Add 1.
For example, the (4-bit) two's complement of 3 is 1101.

If you take the two's complement of -1, -2, ..., -7, you get back the corresponding positive number. Try it.

If you take the two's complement of zero you get zero. Try it.

What about the 8th negative number?
-8 is written 1000.
But if you take its (4-bit) two's complement, you must get the wrong number because the correct number (+8) cannot be expressed in 4-bit two's complement notation.

Remarks:

MOSES Students: I have used the information from MOSES to determine MOSES 1.5 and MOSES 2.
Everyone takes the midterm exam on nyu Brightspace.
It will be during class time 9:30-10:45 thursday 14 Oct.
You may bring to the midterm the part of chapter 7 that lists the library routines (section 7.8).
The following is repeated from the remarks at the beginning of the last class.
The midterm will appear on Brightspace at 9:30 in the assignment tab of Brightspace. You may need to refresh the assignment tab in Brightspace.
The end time will be 10:45.
I will (try to) enable unlimited resubmissions, so save your work as you go.
Brightspace will block any submission after the end time so I strongly recommend saving your work as you go.
The difficulty will be comparable to lab one except:
- The midterm will have fewer questions.
- The material covered on the midterm will be all we did on C; whereas, the lab did not include the last few chapters.

The Sum of `x` and the Two's Complement of `x` Is Zero

Recall that the two's complement of x is ~x + 1. We want the two's complement to be the additive inverse. Let's see if it is. Remember that ~x is x complement and -x is the twos complement which equals ~x+1.

  Two'sComp(x) = ~x + 1
  x + Two'sComp(x) = x + (~x + 1)
                   = (x + ~x) + 1
                   = (111...111) + 1
                   = (-1) + 1
                   = 0

Success!

Two's Complement Addition and Subtraction

Amazingly easy (if you ignore overflows).

Addition: Just add the two 4-bit numbers, do NOT treat the sign bit in a special way, and discard any final carry-out.
Subtraction: Take the two's complement of the subtrahend (the second number) and add as above.

Comments on Two's Complement

You could reasonably ask what does this funny notation have to do with negative numbers. Let me make a few comments.

Question: What does -1 mean mathematically?
Answer: It is the unique number that, when added to 1, gives zero.

Our representation of -1 does do this (using regular binary addition and discarding the final carry-out) so we do have -1 correct.

Question: What does negative n mean, for n>0?
Answer: It is the unique number that, when added to n, gives zero.

The 1s complement of n when added to n gives all 1s, which is -1.
Thus the 2s complement, which is one larger, will give zero, as desired.

Size Ranges for 16-bit Numbers

	Decimal	Hex	Binary
Unsigned Max	65535	FF FF	11111111 11111111
Unsigned Min	0	00 00	00000000 00000000
Signed Max	32767	7F FF	01111111 11111111
Signed Min	-32768	80 00	10000000 00000000
-1	-1	FF FF	11111111 11111111

The table on the right shows the extreme values for both unsigned and signed 16-bit integers. It the signed case we also show the representation of -1 (there is no unsigned -1).

Note that the signed values all use the twos-complement representation. In fact I doubt we will use sign/magnitude (or ones'-complement) for integers any further.

	Width (bits)
	8	16	32	64
Unsigned Max	255	65,535	4,294,967,295	18,446,744,073,709,551,615
Signed Max	127	32,767	2,147,483,647	9,223,372,036,854,775,807
Signed Min	-128	-32,768	2,147,483,647	-9,223,372,036,854,775,808

The second table on the right shows the max and min values for various sizes of integers (1, 2, 4, and 8 bytes).

O'H-2.2.4 Conversions between Signed and Unsigned

General rule: Be Careful!.

O'H-2.2.5 Signed versus Unsigned in C

  #include <stdio.h>
  int main(int argc, char *argv[]) {
    int i1=-1, i2=-2;
    unsigned int u1, u2=2;
    u1 = i1; // implicit cast (unsigned)
    printf("u1=%u\n", u1);
    printf( "%s\n", (i2>u2) ? "yes" : "no");
    return 0;
  }

The code in the right illustrates why we must be careful when mixing unsigned and signed values. The fundamental rule that is applied in C when doing such conversions (actually called casts) is that the bit pattern remains the same even though this sometimes means that the value changes.

When I ran the code on the right, the output was

  u1=4294967295
  yes

When the code executes u1=i1, the bits in i1 are all ones and this bit pattern remains the same when the value is cast to unsigned and placed in u1. So u1 becomes all 1s which is a huge number, as we see in the output.

When we compare i2>u2, either the -2 in i2 must be converted to unsigned or the 2 in u2 must be converted to signed. The rule in C is that the conversion goes from signed to unsigned so the -2 bit pattern in i2 is reinterpreted as an unsigned value. With that interpretation i2 is indeed much bigger that the 2 in u2.

O'H-2.2.6 Expanding the Bit Representation of a Number

We have just seen signed/unsigned conversions. How about short to int or int to long?
How about unsigned int to unsigned long? I.e., converting when the sizes are different but the signedness is the same.

Converting one unsigned integer to a longer one: Pad the shorter value on the left with zeros.
Converting one signed integer to a longer one: sign extend the shorter. That is, propagate the sign bit of the shorter to the left to give the length of the longer

Summary of Conversion Ordering

In summary C converts in the following order. That is, types on the left are converted to types on the right.

int → unsigned int → long → unsigned long → float → double → long double.

O'H-2.2.7 Truncating Numbers

What if you want to put an int into a short or put a long into an int?

Bits are simply dropped from the left, which can alter both the value and the sign.

Advice: Don't do it.

O'H-2.2.8 Advice on Signed versus Unsigned

Be careful!!

O'H-2.3 Integer Arithmetic

O'H-2.3.1 Unsigned Addition

Binary addition (i.e., addition of binary numbers) is performed the same as decimal addition. You can add a column of numbers in binary as with decimal, but we will be content to just add two binary numbers.

You proceed right to left and may have to carry a "1".

The only problem is overflow, i.e., where the sum requires more bits than are available. That means there is a carry out of the HOB. For example if you were using 3-digit decimals, the sum 834+645 does not fit in 3 digits (there is a carry out of the hundreds place into what would be the thousands place). Similarly using 4-bit binary numbers, the sum 0111+1001 does not fit in 4 bits.

When there is no overflow, (computer, i.e., binary) addition is conceptually done right to left one bit at a time with carries just like we do for base 10.

In reality very clever tricks are used to enable multiple bits to be added at once. You could google ripple carry and carry lookahead or see my lecture notes for computer architecture.

O'H-2.3.2 Two's-Complement Addition

The news is very good—you just add as though it were unsigned addition and throw away any carry-out from the HOB (high order bit).

Only overflow is a problem (as it was for unsigned). However, detecting overflow is not the same as for unsigned. Consider 4-bit 2s complement addition; specifically (-1) + (-1). 1111 + 1111 = 11110 becomes 1110 after dropping the carry-out. But overflow did not occur 1110 is the correct sum of 1111 + 1111!

The correct rule is that overflow occurs when and only when the carry into the HOB does not equal the carry out of the HOB.

Summary of Two's Complement Addition (Includes Subtraction)

Assume we are adding two n-bit numbers.
Just add the bits R to L as normal.
If the carry-in to the HOB = the carry-out from the HOB, the answer is correct.
If the carry-in != the carry-out, an overflow occurred and the correct answer cannot be expressed in the number of bits available.

O-H-2.3.3 Two's Complement Negation

Recall that with two's complement there is one more negative number than positive number. In particular, the most-negative number has no positive counterpart. Specifically, for n-bit twos complement numbers, the range of values is

      most neg = -2^n-1 ... 2^n-1-1 = most pos

For every value except the most neg, the negation is obtain by simply taking the two's complement, independent of whether the original number was positive, negative, or zero.

O;H-2.3.4 Unsigned Multiplication

Multiply the two n-bit numbers, which gives up to 2n-bits and discard the n HOBs. Again, the only problem is overflow.

O'H-2.3.5 Two's Complement Multiplication

  3 * (-4) = 11100
                11
             -----
             11100
            11100
           ------
           1010100

A surprise occurs. You just mulitply, the twos complement numbers and truncate the HOBs and ... it works—except for overflow.

On the board do 3 * (-4) using 5 bits.

3 = 00011; 4 = 00100; (~4) = 11011; (-4) = 11100

The multiplication is on the right. Now truncate 1010100 to 5 bits. You get y = 10100. Is this y = -12?
~y = 01011. -y = 01100 = 8+4 = 12
It works!

O'H-2.3.6 Multiplying by Constants

You can multiply x*2^k (k≥0) by just shifting x<<k. This is reasonably clear for x≥0, but works for 2s complement as well.

Note that compilers are clever and utilize identities like

  x * 24  =  x * (32-8)  =  x*32 - x*8  =  x<<5 - x<<3

The reason for doing this is that shift/add/sub are faster than multiplication.

O'H-2-3-7 Dividing by Powers of 2

Division is even slower than multiplication; so we note that right shifting by k gives the same result as dividing by 2^k. Actually it gives the floor of the division.

If the value 2^k is unsigned, use logical right shift; if it is signed use arithmetic right shift.

O'H-2.3.8 Final Thoughts on Integer Arithmetic

Unsigned

Addition and multiplication work unless there is an overflow.

Adding two n-bit unsigned numbers gives (up to) an (n+1)-bit result, which we fit into n bits by dropping the HOB. So you get an overflow if the HOB of the result is 1

Multiplying two n-bit unsigned numbers gives (up to) a 2n-bit result, which we fit into n bits by dropping the n HOBs. So you get an overflow if any of the n HOBs of the result are 1.

Two's Complement.

Same idea but detecting overflow is more complicated. For addition of n-bit numbers, which includes subtraction, the non-obvious rule is that an overflow occurs if the carry into the HOB (bit n-1) != the carry-out from that bit.

Homework:

Convert 40 decimal to 7-bit binary and take the 2s complement (which represents -40).
Convert 28 decimal to 7-bit binary and take the 2s complement (which represents -28).
Add the two binary values (40 + (-28)).
Did you get 12?
Add the two binary values ((-40) + 28).
Take the 2s complement.
Did you again get 12?

O'H-2.4 Floating Point

Recitation topic.

2.4.1 Fractional Binary Numbers

Exactly analogous to decimal numbers with a decimal point. Just as 0.01 in decimal is one-hundredth, 0.01 in binary is one-quarter and 0x0.01 is one-twohundredfiftysixth.

Examples

1 1/2 = 1.1 3 5/8 = 11.101 9 3/16 = 1001.0011 abcd.efg = a*2³ + b*2² + c*2¹ + d + e*2^-1 + f*2^-2 + g*2^-3

If instead of powers of 2, we used powers of 10, the above would be how we write numbers with decimal points.

Why Fractional Binary is Not Used

Fractional binary notation requires considerable space for numbers that are very large in magnitude or very near zero.

  5 * 2¹⁰⁰ = 101000000...0
               | 100 0s |
  2^-100 = 0.00000000001
           | 100 0s |

(The second example above uses sign-magnitude.

But numbers like these comes up in science all the time and the solution used is often called scientific notation.
Avagadro's number ~ 6.02 * 10²³
Light year ~ 5.88 * 10¹² miles

The coefficient is called the mantissa or significand.

In computing we use IEEE floating point, which is basically the same solution but with an exponent base of 2 not 10. As we shall see there are some technical differences.

2.4.2 IEEE Floating-Point Representation (Done in Recitation)

Represent a floating number as

  (-1)^s × M × 2^E

Where

s (for sign) determines if the number is positive or negative.
M (possibly for mantissa) is called the significand. It is a fractional value related to fractional binary numbers above.
E (for exponent) weights the value by a (possibly negative) power of 2.

Naturally, s is stored in one bit.

For single precision (float in C) E is stored 8 bits and M is stored in 23. Thus, a float in C requires 1+8+23 = 32 bits.

For double precision (double in C) E is stored in 11 bits and M in 52. Thus, a double in C requires 1+11+52 = 64 bits.

Now it gets a little complicated; the values stored are not simply E and M and there are 3 classes of values.

The Exponent as Stored

Lets just do single precision, double precision is the same idea just with more bits. The number of bits used for the exponent is 8

Although the exponent E itself can be positive, negative, or zero the value stored exp is unsigned. This is accomplished by biasing the E (i.e., adding a constant so the result is never negative).

With 8 bits of exponent, there are 256 possible unsigned values for exp, namely 0...255. We let E = exp-127 so the possible values for E are -127...128.

Stated the other way around, the value stored for the exponent is the true exponent +127.

The Significand as Stored

With scientific notation we write numbers as, for example. 9.4534×10¹². An analogous base 2 example would be 1.1100111×2¹⁰.

Note that in 9.4535 the four digits after the decimal point each distinguish between 10 possibilities whereas the digit before the decimal point only distinguishes between 9 possibilities, so is not fully used.

Note also that in 1.1100111 the 1 to the left distinguishes between one possibility, i.e. is useless.

IEEE floating point does not store the bit to the left of the binary point because is always 1 (actually see below for the other two classes of values).

An Example (Single Precision, 32 Bits)

  Let F        = 15213.0₁₀
               = 11101101101101₂
               = 1.1101101101101₂×2¹³
  fract stored = 11011011011010000000000₂
  exp stored   = 13+127 = 140 = 10001100₂
  sign stored  = 0
  value stored = 0 10001100 11011011011010000000000

Denormalized (Subnormal) Encoding

Used when the stored exponent is all zeros, i.e., when the exponent is as negative as possible, i.e., when the number is very close to 0.0.

The value of the significant and exponent in terms of the stored value is slightly different.

Note there are two zeros since ieee floating point is basically sign magnitude.

Special Values Encoding

Used when the stored exponent is all ones, i.e., when the exponent is a large as possible.

If the significand stored is all zeros, the value represents infinity (positive or negative), for example overflow when doing 1.0/0.0.

If the significand is not all zero, the value is called NaN for not-a-number. It is used in cases like sqrt(-1.0), infinity - infinity, infinity × 0.

Summary

IEEE floating point represents numbers as (-1)^s × M × 2 ^E. There are extra complications to store the most information in a fixed number of bits.

Chapter O'H-3 Machine Level Representation of Programs

3.1 Historical Perspective

The book covers the Intel architecture, which dominates laptops, desktops, and data centers. Some of the fastest supercomputers also have (many) intel CPUs.

It is not used in cell phones and tablets.

The Intel architecture has been remarkably successful from commercial and longevity standpoints.

Modern systems are backwards compatible with the 8086 version introduced in 1978 (more than 40! years ago). In addition to the commercial advantages of backwards compatibility, its implementation is a technological tour-de-force, which has won awards for its engineering.

However ...

It has a horrendously complicated instruction set. It is called a CISC (Complex Instruction Set Computer) design. Architectures designed in the last few decades tend to have RISC (Reduced Instruction Set Computer) designs and current implementations of the Intel architecture actually (during execution!) translate many of the complex instructions to a simpler core set.

Our Usage

The book (wisely) only covers a small subset of the possible instructions. For example, we limit arithmetic to operations on 64-bit data, ignoring the 32- 16- and 8-bit arithmetic supplied for backwards compatibility. If you use gcc (or cc) on your laptop (or access, or linserv1) you will see these instructions

3.2 Program Encodings

gcc -Og

We have normally compiled C program with a simple cc command. Different C compilers can produce different assembly for the same C program. Also normally the goal of the compile is to generate high performance output. We instead are interested in simple assembler output. As a result, to compile the program joe.c we will use the command

    gcc -Og -S joe.c

On many computers, in particular on linserv1, gcc and cc are the same, but the -Og is needed to generated simple (vs. high performance) assembly code. The -S tells the compiler to make available the generated assembly language code.

3.2.1 Machine Level Code

The CPU State

The machine state of any processor has details that are under the covers in C or Java or Python or ... . The state for the Intel architecture includes.

Registers. The fastest memory in the system. The Intel architecture defines 16 registers, each containing 64-bits. Much of the action occurs in these registers. Each register has a fixed name used to reference it in assembly programming.
In addition, the low-order 32 bits of each register have a separate name as do the low-order 16 bits, as do the low-order 8-bits. We will make little use of the subset registers, which are provided primarily for the backwards compatibility mentioned previously.
The instruction pointer is one of these 32-bit registers, specifically %rip. It contains the address of the next instruction to execute.
Condition code registers. These (additional, one-bit registers) store status information about the most recent arithmetic or logical test. Used in conditional branching.

Memory

Memory is simply a huge array of bytes. Compiled instructions as well as data reside here. A portion of memory is used as a stack to support procedure calls and returns.

Interaction Between the CPU and Memory

Since the data and the program instructions are stored in memory, the CPU needs to fetch both during execution. The CPU sends an address to memory which responds with the contents of that address.

When the CPU needs to store a computed result into memory, it again sends the address in addition to sending the new value.

In summary

Addresses flow from the CPU to the Memory.
Instructions flow from the Memory to the CPU.
Data flows both ways (CPU to/from Memory).

Remarks: Mostly repeated from last class.

MOSES Students: If you have 1.5x or 2x time for the exam from MOSES, please send me an email with
1. your name
2. course = 201
3. either 1.5x MOSES or 2x MOSES
Everyone takes the midterm exam on nyu Brightspace.
It will be during class time 9:30-10:45 thursday 18 march.
You may bring to the midterm the part of chapter 7 that lists the library routines (section 7.8).
The following is repeated from the remarks at the beginning of the last class.
The midterm will appear on Brightspace at 9:30 in the assignment tab of Brightspace. You may need to refresh the assignment tab in Brightspace.
The end time will be 10:45. The accept until time will be ~10:50.
I will enable unlimited resubmissions, so save your work as you go.
Brightspace will block any submission after the accept until time so I strongly recommend saving your work as you go.
The difficulty will be comparable to lab one except:
- The midterm will have fewer questions.
- The material covered on the midterm will be all we did on C; whereas, the lab did not include the last few chapters.

3.2.2 Code Examples

Read. We will soon learn what many of these instructions do.

On access.cims.nyu.edu I have written mstore.c from the book.

  long mult2(long, long);

  void multstore(long x, long y, long *dest) {
    long t = mult2(x,y);
    *dest = t;
}

Register Name	Conventional Use	Register Name	Conventional Use
%rax	Return Value	%r8	5th argument
%rbx	callee saved	%r9	6th argument
%rcx	4th argument	%r10	caller saved
%rdx	3rd argument	%r11	caller saved
%rsi	2nd argument	%r12	callee saved
%rdi	1st argument	%r13	callee saved
%rbp	callee saved	%r14	callee saved
%rsp	stack pointer	%r15	callee saved

3.2.3 Notes on Formatting

I compiled mstore.c on crackle2 with gcc -Og -S , which generates mstore.s. I then removed the lines in mstore.s beginning with a dot. This gives the same code as the book. Here it is with line numbers and comments added (as in the book).

  // void multstore(long x, long y, long *dest)
  // x in %rdi, y in %rsi, dest in %rdx
  1  multstore:
  2      pushq	%rbx           // save %rbx on stack
  3      movq	%rdx, %rbx     // copy dest to %rbx
  4      call	mult2          // call mult2(x, y)
  5      movq	%rax, (%rbx)   // store result at *dest
  6      popq	%rbx           // restore %rbx
  7      ret                   // return

3.3 Data Formats

C declaration	Intel data type	Suffix	Size in bytes
char	Byte	b	1
short	Word	w	2
int	Double word	l	4
long	Quad word	q	8
char *	Quad word	q	8
float	Single precision	s	4
double	Double precision	d	8

We see from the table on the right that integers (which includes pointers, i.e., addresses) can be (on a 64-bit machine) either 1, 2, 4, or 8 bytes in length.
char * is used to refer to any pointer, perhaps void * would have been a better name.
Floating point can be 4 or 8 bytes. (There is an intel-only 10-byte float; we won't use it.)
There are no arrays or structures.
Instructions are stored as sequences of bytes.
The suffix is interesting, you don't declare a datum to be a short, you just access it with a instruction having a w suffix.

3.4 Accessing Information

Most operations are performed on the 16 registers (the fastest memory in the system). But memory can be accessed directly as well. Typically, data is moved from memory to registers, then operated on (add, sub, etc) and then put back in memory.

For historical reasons concerning backward compatibility the registers have funny names.

We will look at 3 types of assembly instructions.

Arithmetic and logical operations (add, or, etc) mostly on registers but one operand can be in memory.
Transfer data between memory and registers.
- Load data from memory into a register.
- Store data from a register into memory.
Transfer control
- Unconditional jumps to/from procedures.
- Conditional branches.

O'H-3.4.1 Operand Specifiers

As we shall see, most operations have one or two operands. There are three types of operands.

Immediate: written as $ followed by C notation. For examples $67, $-3, $0x3D.
Register: % followed by the register name. %rax, %rbx, %rcx, %rdx, %rsi, %rdi, %rbp, %rsp, %r8, %r9, %r10, %r11, %r12, %r13, %r14, %r15
Memory: Many possibilities. See Figure 3.3 in the book for the whole story. We will show many examples and it will be discussed in recitation.

The register names above are for the full 64-bit registers. For each of these registers there are other names for the low-order 32-bit subset, the low-order 16-bit subset, and the low-order 8-bit subset. We will use only the names for the full 64-bit operands.

Examples

Type	Example	Value if Source (r-value)	Value if Destination (l-value)
Immediate	$-3.2	-3.2	Not possible
Immediate	$0x10F	271	Not possible
Register	%rax	Contents of register rax	The register itself
Memory	12345	Contents of memory location 12345	The location itself
Memory	(%rax)	Contents of memory at location specified in register rax	The memory location specified in register rax
Memory	0xF(%rax)	Contents of mem loc(15 + contents rax)	The mem location (15 + contents rax)
Memory	(%rax,%rbx)	Contents of loc (contents rax + contents rbx	The location (contents rax + contents rbx)
Memory	8(%rax,%rbx)	Contents of loc (8 + contents rax + contents rbx	The location (8 + contents rax + contents rbx)
Memory	(,%rbx,4)	Contents of loc 4*(contents rbx)	The location 4*(contents rbx)
Memory	15(,%rbx,8)	Contents of loc (15 + 8*contents rbx)	The location (15 + 8*contents rbx)
Memory	(%rax,%rbx,2)	Contents of loc (contents rax + 2*contents rbx)	The location (contents rax + 2*contents rbx)
Memory	23(%rax,rbx,4)	Contents of loc (23 + contents rax + 4*contents rbx)	The location (23 + contents rax + 1*contents rbx)

3.4.2 Data Movement Instructions

The basic data movement instruction is called move and is written mov with a suffix to indicate the size of the data item moved. It is somewhat misnamed; it is really a copy not a move.

movb means move byte (1 byte).
movw means move word (2 bytes).
movl means move double word (4 bytes).
movq means move quad word (8 bytes).

The src is given first then the destination (the reverse of C). For example the C statement *dest = t; might become movq %rax, (%rbx)

A move instruction cannot have both operands in memory, at least one must be a register (or the source an immediate).

O'H-3.4.3 Data Movement Example

  long plus(long x, long y);

  void sumstore (long x, long y,
                 long *dest) {
    long t = plus(x, y);
    *dest = t;
  }

  sumstore:
    pushq   %rbx
    movq    %rdx, %rbx
    call    plus
    movq    %rax, (%rbx)
    popq    %rbx
    ret

The code on the right is just representative.
The caller of any function (e.g., sumstore()) must put the first argument in %rdi, the second in %rsi, and the third in %rdx.
The result of plus() will be in %rax.
These rules apply to all function calls and returns. We follow them when we call plus().
Since %rbx contains dest, (%rbx) is *dest!!
Assume pointers and longs are 8-bytes in C.

Operand Combinations

The size of a specified register must match the size of the move itself.

There are variants of move that sign extend or zero extend. These are used when you move a value to a longer format.

For example, movzbl moves from a byte to a doubleword (32-bits) by zero extending (on the left).

Similarly, movsbw moves from a byte to a word (16-bits) by sign extending (replicating the sign bit).

There are other special cases as well, see the book for details. One oddity is that, when the target is a register movzbl acts the same as movzbq: it moves the byte and then zeros the high-order 7 bytes of the quadword even though it name suggest it would only zero the 3 high-order bytes of the double word.

Simple (Memory) Addressing Modes

(registerName): A very easy case is when the desired memory address is already in a register, in which case you just put the register name inside parentheses.
movq (%rcx),%rax
This views the contents of %rcx as an address and moves (i.e., copies) the contents of that address (8 bytes) to register %rax.
Disp(registerName): The memory address is Disp + the contents of the register.
movq 0x80(%rcx),%rax
The same as the above but the memory address is 0x80 + the contents of %rcx.

Interlude: Swap in Assembler Language (Slides 1:23-29)

  void swap(int *xp, int *yp) {
      int t0, t1;
      t0 = *xp;
      t1 = *yp;
      *xp = t1;
      *yp = t0;
  }

Note: This swap uses two temporaries. The one we did a month ago used only one. However, that algorithm uses memory to memory moves so needs to be modified for the intel machine language.

General Memory Addressing

The most general form is Disp(Rb,Ri,S)

Rb and Ri are register names, b and i abbreviate base and index
Disp (often called D) is a displacement.
S is a scale factor and must be 1, 2, 4, or 8.
The address referenced is the sum of
Disp, the contents of Rb, and S times the contents of Ri.
Formally, the address accessed is Mem[Reg[Rb]+S*Reg[Ri]+D]

Explain why this (complicated) address is useful for stepping through a C array.

Special Cases

(Rb,Ri): Mem[Reg[Rb]+Reg[Ri]]
D(Rb,Ri): Mem[D+Reg[Rb]+Reg[Ri]]
(Rb,Ri,S): Mem[Reg[Rb]+S*Reg[Ri]]

Examples

Expression	Address Computation	Address
0x8(%rdx)	0xf000 + 0x8	0xf008
(%rdx,%rcx)	0xf000 + 0x100	0xf100
(%rdx,%rcx,4)	0xf000 + 4*0x100	0xf400
0x80(,%rdx,2)	2*f000 + 0x80	0x1e080

Assume these two registers have been set in advance.

%rdx = 0xf000
%rcx = 0x0100

Then the table on the right shows various address that can be composed using these two registers and some subsets of the general addressing mode above.

3.4.4 Pushing and Popping the Stack

The intel architecture has support for a stack maintained in memory. The three key components are the two instructions
pushq Src and popq Dest
and the dedicated register %rsp

Although the Src operand of pushq can be fairly general we will use only the case where Src is simply a register, for example %rbp. Then the instruction
pushq %rbp
has the same effect as the two instruction sequence.
subq $8,%rsp
movq %rbp,(%rsp)

Analogously,
popq %rax
has the same effect as the two instruction sequence.
movq (%rsp),%rax
addq $8,%rsp

Show how this corresponds to an (upside down) stack.

3.5 Arithmetic and Logical Operations

3.5.1 Load Effective Address

Recall that a memory address can be complicated, it can involve two registers, a scale factor, and an additive constant. Sometimes you want that arithmetic on some registers but don't want to reference memory at all. There is an instruction to do just that. It is called load effective address: leaq.

leaq Src, Dest

Src is typically an expression of the form permitted by the general addressing mode above
Dest is the register that is to get the result of the computation.

Typical uses

Translating the C statement p = &x[i];
If x is an array 4-byte ints, &x[0] is in rdx, and i is in reg rdi, then leaq (%rdx,%rdi,4),%rax, puts &x[i] in %rax.

Fun for compilers. For example x*12 becomes

       leaq (%rdi,%rdi,2), %rax   # t <-- x+x*2 = 3x
       salq $2, %rax              # t <-- t<<2 = 4t = 12x

Limited 3-operand instructions.

Start Lecture #16

3.5.2 Unary and Binary Operations

For now we shall ignore possible overflows. So by the miracle of 2s complement signed and unsigned are the same!

Unary Operations

  Instruction        Effect         C Equivalent
   incq Dest    Dest =  Dest + 1    Dest++
   decq Dest    Dest =  Dest - 1    Dest--
   negq Dest    Dest = -Dest        Dest = -Dest;
   notq Dest    Dest = ~Dest        Dest = ~Dest;

Binary Operators

  Instruction           Effect         C Equivalent
  addq  Src,Dest   Dest = Dest + Src   Dest += Src;
  subq  Src,Dest   Dest = Dest - Src   Dest -= Src;
  imulq Src,Dest   Dest = Dest * Src   Dest *= Src;
  xorq  Src,Dest   Dest = Dest ^ Src   Dest ^= Src;
  orq   Src,Dest   Dest = Dest | Src   Dest |= Src;
  andq  Src,Dest   Dest = Dest & Src   Dest &= Src;

The Shifts

There are of course left and right shifts, but remember that there are two kinds of right shift: arithmetic right shift, which sign extends, and logical, which just adds zeros on the left.

For consistency the assembler also has logical and arithmetic left shift commands but they are just synonyms for the same operation, which adds zeros on the right.

  Instruction         Effect         C Equivalent
  salq k,Dest    Dest = Dest << k    Dest <<= Src;
  shlq k,Dest    Dest = Dest << k    Dest <<= Src;
  sarq k,Dest    Dest = Dest >> k    Dest >>= Src;
  shrq k,Dest    Dest = Dest >> k    Dest >>= Src;

Although I wrote the two right shifts as having the same C equivalent; they are different. The book writes >>_A and <<_L to distinguish them. In C they are written the same, but most, if not all, C compilers use arithmetic right shift for signed values and logical right shift for unsigned.

A cheat sheet is on Brightspace/contents

  void swap3(long *xp, long *yp, long *zp) {
    long t = *xp;
    *xp = *yp;
    *yp = *zp;
    *zp = t;
  }

Homework: Write an assembly language version of swap3().

A version in C is on the right.
Assume that, when your assembler program begins, registers %rdi, %rsi, %rdx contain xp, yp, zp, respectively.
I suggest looking again at the swap example from the slides (you can find it on the home page).
You might want to rewrite the C version of swap3() in the assembly language style used in swap() above.
Questions like this are appropriate for the final exam.

Notes:

In C (or in math) we refer to A = B + C as a binary operation. However, it does not fit the examples of binary operators above because, counting the destination, there are three operands.
In more modern (less ancient) cpu designs, arithmetic is done on values in registers and the add command takes three operands. For example, you might see something like
add r3, r4, r2 // reg 3 = reg 4 + reg 2;
leaq (%r8, %r9), r10

An Example

Below left is a C program (written to look a little like assembler).

On the right is the assembler version, which assumes that initially x is in %rdi, y is in %rsi, and z is in rdx.

The register usage is in the middle.

  arith:
    leaq  (%rdi,%rsi), %rax   // t1
    addq  %rdx, %rax          // t2
    leaq  (%rsi,%rsi,2), %rdx
    salq  $4, %rdx            // t4
    leaq  4(%rdi,%rdx), %rcx  // t5
    imulq %rcx, %rax          // ans
    ret

Register	Use
%rdi	x
%rsi	y
%rdx	z,t4
%rax	t1,t2,ans
%rcs	t5

  long arith (long x, long y,
              long z) {
    long t1 = x+y;
    long t2 = z+t1;
    long t3 = x+4;
    long t4 = y*48;
    long t5 = t3 + t4;
    long ans = t2 * t5;
    return ans;
  }

Note that

Assembler instructions can be in a different order from the C code.
Some expressions need multiple assembler instructions.
Some assembler instructions cover multiple expressions.
A register can contain only one value at a time.
A register can contain different values at different times.

Some Comments on Multiplication and Division

Normal Multiplication: `imulq Src,Dest`

The normal (integer) multiply imulq Src,Dest is the analogue of addq. That is the 64-bit (quadword) src is multiplied by the 64-bit Dest and the (low-order 64 bits of the) product becomes the new contents of Dest.

Thanks to the miracle of 2s complement, this one instructions works for both unsigned and signed operands.

As with addq and subq, overflow is possible. Indeed, the true product can require 128 bits. There is a special operation (indeed two operations) that preserve all 128 bits of the product.

128-bit Multiplication" `mulq Src` and `imulq Src`

The 128-bit multiplies (one for signed and one for unsigned) do not fall into the pattern of the previous operations. Instead only one of the operands is specified in the instruction; the other operand must be %rax.

In addition, the location of the 128-bit result is not given in the instruction. Instead, the high-order 64-bits always go into %rdx and the low order 64-bits go inot %rax.

Division (and Modulus)

There is no normal divide or modulus instruction. Instead you first place the 128-bit dividend in %rdx (high order) and %rax (low order) and then issue divq src (for unsigned division) or idivq src (for signed division). In either case the quotient is put into %rax and the remainder into %rdx

64-bit Dividend

If the dividend is only 64-bits, it naturally is placed in the low-order register %rax and %rdx should be either all zeros or all one to act as the sign extension of %rdx. The cqto (copy quad to octal) does exactly this.

3.6 Control

So far we can do assignment statements and arithmetic. What about if/then/else or while?

3.6.1 Condition Codes

The idea is that some arithmetic (or other) operation (e.g., an add) generates a condition (e.g., a negative value) and some subsequent operation (e.g., a conditional jump) needs to know the condition from the add. The solution employed is to have every add (and other operations) set certain 1-bit condition codes that can be used by a subsequent jump instruction to decide whether to actually jump.

We will consider four condition codes; each 1-bit in size.

CF: Carry flag. Set if the most recent operation generated a carry out of the most significant bit. This is use to detect overflow for unsigned operations.
ZF: Zero flag. Set if the most recent operation had a zero result.
SF: Sign flag. Set if the most recent operation had a negative result.
OF: Overflow flag. Set if the most recent operation had a 2s complement overflow.

In fact we will mostly ignore overflow and will emphasize only ZF and SF.

Remember that the arithmetic instructions (like add and sub) can not tell if the operands are signed or unsigned so they might set either (or both) the CF and OF flags.

The condition for setting OF for an addition t=a+b is

  (a>0 && b>0 && t<0) || (a<0 && b<0 && t>0)

Setting Condition Codes Implicitly by the Operations of Sections 3.5.1 and 3.5.2

The lea instruction does not affect the condition codes.

Logical operations set carry and overflow to zero.

Shifts set carry to the last bit shifted out and set overflow to zero.

Inc and Dec set OF and ZF but leave CF unchanged.

Setting Condition Codes Explicitly

cmpq S1,S2 sets the condition codes the same as sub S2-S1 would set them (note S2-S1), but does not store the arithmetic result anywhere.

Similarly testq S1,S2 sets the condition codes the same as andq but does not store the result. So testq %rdx,%rdx sets ZF if %rdx is zero.

3.6.2 Accessing the Condition Codes

You can set a single byte to 0 or 1 based on certain combination of condition codes. This enables you to save the value of the flags, which change frequently during execution This uses the so-called setX operation, where X is replaced by am abbreviated name of the comparison desired (e.g., Less-or-Equal). For example.

Sete D (set if equal): sets D (a byte) to the value in ZF
setge D (set if greater or equal, i.e. set if not less): sets D to ~(SF^OF).
See figure 3.14 in the book for the complete list.
We will de-emphasize the overflow aspect and typically assume OF is false.
So for us figure 3.14 simplifies to the table below.

In all cases the low order byte is set to zero or one and the remaining bytes are unchanged (see below for movzbq that addresses this issue).

A Table of Sets

Set Instructions (Ignoring Overflows)
Instruction	Synonym	D Becomes	Set Condition
sete D	setz D	ZF	Equal / Zero
setne D	setnz D	~ZF	Not Equal / Not Zero
sets D		SF	Negative
setns D		~SF	Nonnegative

setg D	setnle D	~SF & ~ZF	(signed) Greater
setge D	setnl D	~SF	(signed) Greater or Equal
setl D	setnge D	SF	(signed) Less
setle D	setng D	SF \| ZF	(signed) Less or Equal

seta D	snbe D	~CF & ~ZF	Above (unsigned Greater)
setae D	setnb D	~CF	Above or Equal (unsigned Greater or Equal)
setb D	setnae D	CF	Below (unsigned Less)
setbe D	setna D	CF \| ZF	Below or Equal (unsigned Less or Equal)

An Example Using `movzbq`

Recall that the set instruction sets a byte. We will normally want values in C longs (which are 8 bytes). The solution is to use the 1 byte register that is contained inside the desired 8 byte register. There are names for the low byte of each of the 16 registers. In particular, %al is the low byte of %rax. We then must zero out the other 7 bytes. The movzbq does this and (as mentioned previously) so does (oddly) movzbl. I mention this oddity twice since, in the following example, the compiler on access uses movzbl (not movzbq) which, were it true to its name, would not zero the high-order 4 bytes.

Below left is a C program. On the right is the assembler version and the register usage is in the middle.

  cmpq   %rsi, %rdi  # compare x and y
  setg   %a1         # set low byte %rax
  movzbq %a1, %rax   # zero out the rest

Register	Use
%rdi	Argument x
%rsi	Argument y
%rax	Return value

  long GT(long x, long y) {
    return x > y;
  }

3.6.3 Jump Instructions

The operation specifies the condition that decides whether you jump, the operand specifies the target to jump to. Elsewhere in the program you have a statement label with that target.

  .always
    ...
    je .goHere    // jumps if ZF is set
    ...
    jge .goThere  // jumps if ~(SF^OF) evaluates true (for us, just ~SF)
    ...
    jmp .always   // unconditional jump
  .goHere
    ...
  .gothere
    ...

Indirect Jump

In general jmp *operand evaluates the operand and jumps to that location. This usage of * is similar to C's For example:

jmp *%r8: Evaluates register %r8 and jumps to the location having that address.
jmp *8(%rdx,%rax): First, evaluates the operand as 8 + the contents of %rdx + the contents of %rax. Second, fetch the value in that memory location. Third, jump to the location having that address.
Let's redo the above given that: %rdx contains 0x1000, %rax contains 0x100, and location 0x1108 contains 0xABD0
First evaluate 8(%rdx,%rax) as 8+0x1000+0x100 = 0x1108. Second, read location 0x1108 and get 0xABD0. Third, jump to location 0xABD0

A Table of Jumps

Jump Instructions (Ignoring Overflows)
Instruction	Synonym	Condition	Description
jmp Label		1	Direct Jump
jmp *Operand		1	Indirect Jump

je Label	jz	ZF	Equal / zero

jne Label	jnz	~ZF	Not equal / not zero

js Label		SF	Negative
jns Label		~SF	Nonnegative

jg	jnle	~SF&~ZF	Greater (signed)
jge	jnl	~SF	Greater or Equal (signed)
jl	jnge	SF	Less (signed)
jle	jng	SF\|ZF	Less or Equal (signed)

ja Label	jnbe	~CF&~ZF	Above (unsigned)
jae Label	jnb	~CF	Above or equal (unsigned)
jb Label	jnae	CF	Below (unsigned)
jbe Label	jna	CF\|ZF	Below or equal (unsigned)

3.6.4 Jump Instruction Encodings

Skipped.

3.6.5 Implementing Conditional Branches with Conditional Control

Below left is a C program (written to look a little like assembler).
On the right is the assembler version.
The register usage is in the middle.

  absdiff:
    cmpq  %rsi, %rdi // compare x, y
    jle   .L4
    movq  %rdi, %rax // x > y
    subq  %rsi, %rax
    ret
  .L4:               // x <= y
    movq %rsi, %rax
    subq %rdi, %rax
    ret

Register	Use
%rdi	x
%rsi	y
%rax	ans

  long absdiff (long x, long y) {
    long ans;
    if (x > y)
      ans = x - y;
    else
      ans = y - x;
    return ans;
  }

3.6.6 Implementing Conditional Branches with Conditional Moves

Instead of jumping to the correct case, it is sometimes faster to evaluate both possibilities and them move the right one to the answer. This is because, in modern machines, pipelining is very important and conditional branches break the pipeline.

Below left is the C program (written to look a little like assembler). On the right is the assembler version. The register usage is in the middle.

  absdiff:
    movq   %rdi, %rax
    subq   %rsi, %rax  // ans = x-y
    movq   %rsi, %rdx
    subq   %rdi, %rdx  // tmp = y-x
    cmpq   %rsi, %rdi  // cmp x : y
    cmovle %rdx, %rax  // if <=,
    ret                //   overwrite

Register	Use
%rdi	x
%rsi	y
%rax	ans
%rdx	tmp

  long absdiff (long x, long y) {
    long ans;
    if (x > y)
      ans = x - y;
    else
      ans = y - x;
    return ans;
  }

Bad Cases for Conditional Moves

If both computations are expensive, it is inefficient to evaluate both.
If one or both computations would be unsafe when not selected, it is unsafe to evaluate both.
If one or both computations have side effects, the unselected side effect occurs when not desired.

O'H-3.6.7 Loops

Do-While Loops

The assembly language does not include a do-while construct. Hence we re-write the C using an if and a goto.

  long count (unsigned long x) {
    long ans = 0;
  loop:
    ans += x & 0x1;
    x >>= 1;
    if (x)
      goto loop;
    return ans;
  }

  
===>>>

  long count (unsigned long x) {
    long ans = 0;
    do {
      ans += x & 0x1;
      x >>= 1;
    } while (x);
    return ans;
  }

Counting 1 Bits in X

For this problem we are given a C long and wish to determine how many of its 64 bits are one. The algorithm is clear; check each bit and, if it is 1, increment ans.

First we show the easy conversion from normal C to goto C. We simply replace the do-while by an if plus a goto.

  count:
    movq  $0, %rax     // ans = 0
  .Loop:
    movq  %rdi, %rdx
    andq  $1, %rdx     // tmp = x & 1
    addq  %rdx, %rax   // ans += tmp
    shrq  %rdi         // x >>= 1
    jne   .Loop        // if (x)
    ret                //   goto loop

Register	Use
%rdi	x
%rax	ans
%rdx	tmp

Next we generate the assembler from the "goto C". For this simple program, that is an easy task and is shown on the right. Also shown is the usual register assignment table.

Note, however, that even in this easy program, one can stumble. Specifically, it is easy to mistakenly use sarq, instead of shrq, thereby introducing a infinite loop.

While Loops

The idea is to convert a while loop into a do-while loop. All that is needed is to deal with testing the loop condition on entry to the loop.

We will consider two (fairly similar) methods and apply them to the same simple example as above.

Note that for our simple example, the same Body can be used with while and with do-while.

    goto test;
  loop:
    Body
  test:
    if (Test)
      goto loop;
  done:

  ===>>>

  while (Test)
     Body

Jump to the Middle

The idea is to keep the Body before the test as in do-while, but jump over Body when you first enter the loop.

On the right we show the C version of a generic while and then its conversion to do-while.

  long count (unsigned long x) {
    long ans = 0;
    goto test;
  loop:
    ans += x & 0x1;
    x >>= 1;
  test:
    if (x)
      goto loop;
    return ans;
  }

  
      ===>>>

  long count (unsigned long x) {
    long ans = 0;
    while (x) {
      ans += x & 0x1;
      x >>= 1;
    }
    return ans;
  }

On the right we show the conversion for the specific program used to count 1 bits. You can see how close the while and do-while are.

  long count (unsigned long x) {
    long ans = 0;
    if (!x) goto done:
  loop:
    ans += x & 0x1;
    x >>== 1;
    if (x)
      goto loop;
  done:
    return ans;
  }

  
      ===>>>

  long count (unsigned long x) {
    long ans = 0;
    while (x) {
      ans += x & 0x1;
      x >>= 1;
    }
    return ans;
  }

Guarded Do

The second conversion method is to introduce an initial test and goto to the do-while version to reproduce the while behavior.

In the example shown on the right the transformation is applied blindly, which results in an obvious inefficiency: Specifically, the first goto jumps to a return.

Any decent compiler would replace this goto with a return. It is normally silly to jump to a jump.

For Loops

Covered in recitation. Slides on home page.

3.6.8 Switch Statements

Slides on home page.

Basic Idea

I will start by showing the basic idea. Then I will give an example with actual assembly code.

  if (x==13)
    // do thing13;
  else if (x==22)
    // do thing22;
  else if (x==5)
    // do thing5
  ...
  else
    // do default;

  if (x==1)
    // do thing1;
  else if (x==2)
    // do thing2;
  else if (x==3)
    // do thing3
  ...
  else
    // do default;

You can always treat switch(x) as a big if-then-else, something like what is shown on the far right above. A disadvantage is that if you have n different cases you will execute on average about n/2 tests before you are successful.

The time when a switch statement is particularly efficient is when the various cases are selected by a (nearly) contiguous range of integers.

Highlighted in yellow we show the simplest case where the possible values for x are 1,2,3,... .

In this case we construct a jump table, i.e., a table of jumps. The first jump in the table jumps to thing1, the second to thing2, etc. See the diagram above in light green. You use the value in x to jump to the correct jump in the table and from there you jump to the correct thing.

Note that even if there are hundreds of things, you execute only two jumps: one into the table and then one to the correct thing.

Detailed Example

  // the jump table
    .align 8
  .JmpTbl
    jmp    .L10
    jmp    .L11
    jmp    .L12
    jmp    .Ldefault
    jmp    .L1415
    jmp    .L1415

  // The switch code
  // preamble, assuming
  // s is stored in %rdi
  // ans stored in %rax
  .switch_eg:
    movq  $1, %rax  // ans = 1
    subq  10, %rdi
    cmpq  $5, %rdi
    ja    .Ldefault
    jmp   .JmpTbl(,%rdi,8)
  // the "things" go here

Note the possibilities.

Fall through cases
Here: s=11
Missing cases
Here: s=13
This gets the default action
Multiple case labels
Here: s=14 & 15
Out of bounds
Here: s<10, s>15
These get the default action

long sw_eg
  (long s, long y,
   long z) {
  long ans=1;
  switch (s) {
  case 10:
    ans = y+z;
    break;
  case 11:
    ans = y-z;
  case 12:
    ans += 7
  case 14:
  case 15:
    ans =z-y
  default:
    ans = 2;
  }
  return ans;
}

Above we first see a simple C case statement. Next to it we note that this simple example covers many possibilities.

Then we show the begining of the switch code. We first initialize ans = 1; and then handle the out-of-bounds cases. Note that the ja (jump above) uses an UNsigned comparison so catches both %rdi>5 and %rdi<0. The jmp instruction jumps to a location 8*%rdi bytes past the .JmpTbl, making the assumption that each jmp instruction in the table is 8 bytes long.

Finally, we see the jump table. In this simple implementation the jump table is just a table of jumps to labels corresponding to the cases in the switch statement. So the switch statement translates to a jump into the table and then a jump to the specific case. We will soon see a better implementation that reduces these two jumps to just one (indirect) jump. This new implementation will also drop the (perhaps incorrect) assumption that jmp instructions are 8 bytes long.

Later we will show the assembly code for the various cases, which I referred to as things above.

Reducing Two jmp's to One jmp*

  .align 8
.JmpTbl
  .quad .L10      // s = 10
  .quad .L11      // s = 11
  .quad .L12      // s = 12
  .quad .Ldefault // s = 13
  .quad .L1415    // s = 14 
  .quad .L1415    // s = 15
  // replace jmp  .JmpTbl(,%rdi,8)
  // with    jmp* .JmpTbl(,%rdi,8)

Instead of jumping to the correct jump, we can use a single jmp*, a so-called indirect jump. The idea is that instead of a table of jumps with the i^th jump targeting the i^th thing, we have, as shown on the right, a table of address, where the i^th address is the address of i^th thing. We still have to write the things. When we do, the thing for s==10 will have statement label .L10, etc.

As mentioned above, the indirect jump is quite a powerful instruction. When executed it first does the address calculation specified in the instruction (for us multiplying %rdi by 8 and adding the address specified by the label .LJmpTbl). It then accesses the resulting address and reads its contents. Finally, it jumps to the address just read. Phew.

Another advantage is that we can specify (via .quad) that each address will be 8 bytes in length (as needed by the jmp*).

The Things

We will show each case separately. The C code is on the right; the assembler code is in the middle with a yellow background; and the registers assigned are a table on the left

  ans = 1;
  switch(s) {
  case 10:  // .L10
    ans = y+z;
    break;
  ...
  }

  .L10:
      movq  %rsi, %rax  # y
      addq  %rdx, %rax  # y + z
      ret               # return

Register	Use
%rdi	s
%rsi	y
%rdx	z
%rax	ans

Case 10

We will see very soon that the specific registers chosen were not arbitrary. There are definite conventions that we must follow.

  switch(s) {
  ...
  case 11:  // .L11
    ans = y-z
    // fall through
  case 12:  // .L12
    ans += 7;
    break;

  .L11:
      movq  %rsi, %rax  # y
      subq  %rdx, %rax  # y - z
      jmp   .Merge1112
  .L12:
      movq  $1, %rax    # ans = 1
  .Merge1112:
      addq  $7, %rax
      ret               # return

Register	Use
%rdi	s
%rsi	y
%rdx	z
%rax	ans

Case 11 and case 12

Note that when case 12 is selected we don't want the effect of case 11. Hence we re-initialize ans.

Case 13

There is no case 13 so we use the default.

  switch(s) {
  ...
  // multiple cases 
  case 14:  // .L1415
  case 15:
    ans = z-y;
    break;

  .L14:
  .L15:
      movq  %rdx, %rax  # z
      subq  %rsi, %rax  # z - y
      ret               # return

Register	Use
%rdi	s
%rsi	y
%rdx	z
%rax	ans

Case 14 and case 15

Since these cases are identical we just use two labels for the same section.

    switch(s) {
    ...
    default:   // .Ldefault
      ans = 2;
      break;
  }

  .Ldefault:
      movq  $2, %rax  # ans = 2
      ret

Register	Use
%rdi	s
%rsi	y
%rdx	z
%rax	ans

Default case

The default case: ans = 2;

3.7 Procedures

Slides on home page.

What Happens When Procedure f() Calls Procedure g()?

Memory allocation and de-allocation.

In our examples to date the local variables for g() were placed in registers.
But there are only 16 registers, if g() has more than 16 active local variables, we need to allocate memory space for the excess when g() starts and de-allocate this space when g() returns.

Transfer of control.
1. From the call point in f() to the beginning of g().
2. From the return point in g() to right after the call point in f().
Data movement.
1. Arguments in f() are passed to the corresponding parameters in g().
2. If g() returns a value, it is passed back to f()

We treat these three issues in turn

3.7.1 The Run-Time Stack

Memory allocation/de-allocation for variables local to a procedure follows a stack-like discipline. That is

While f() is running prior to calling g(), the local variables of f() can be accessed (but those of g() cannot).
When g() is called its local variables must now be accessible (but any old values they had are not restored). So the call of g() by f() necessitates space be set aside for g()'s local variables.
When g() eventually returns these local variables are no longer accessible and their previous values are lost. We can give back the space used for them and reuse this space later for other purposes.
Draw on the board the situation for f() calls g() calls h().
Conclusion: A stack like discipline is perfect for the run-time stack.

The Mechanisms of Stacks / Push / Pop / Top / LIFO

The basics from 101/102 will be enough.

Choices / Idiosyncrasies of the Intel Run-Time Stack

When I think of a (non-linked) stack, I visualize a column that gets taller on pushes and shorter on pops. That is, I view the stack a being grounded at a low place and growing and shrinking at its highest address.

The x86-64 run-time stack has the opposite properties: As indicated in the diagram on the upper right, the stack's fixed bottom is at a very high address. It is sort of fixed in the sky; it grows and shrinks by having its other end, its top, get lower and higher.

When implementing a stack, the designer must also decide whether top points to the location containing the last element inserted, or the space where the next element will go.

The x86-64 run-time stack uses the first technique. Hence a pop() first retrieves the value and then increments top; whereas a push() first decrements top.

Another choice made for the Intel stack is to dedicate a register (a precious commodity) to holding the top-of-stack pointer. Specifically, %rsp (register-stack-pointer) is used for this purpose.

So to push onto the stack the value currently in %rdx one could write.

  subq  $8, %rsp
  movq  %rdx, (%rsp)  // assume %rdx contains 0x12FA

This results in the picture on the bottom right showing a bigger stack and a smaller value in the stack pointer %rsp. Register %rdx itself unchanged, but its value is now at the top of the stack. A pop would retrieve that value.

In fact there is a single instruction pushq SRC that both decrements %rsp and inserts SRC on the top of the stack In our case
pushq %rdx
accomplishes the desired stack push.

Naturally there is also a popq. Both pushq and popq require that %rsp be used as the stack pointer. This last comment leads to the following table.

Start Lecture #17

Conventional Uses for the I86-64 Registers

Register Name	Conventional Use	Register Name	Conventional Use
%rax	Return Value	%r8	5th argument
%rbx	callee saved	%r9	6th argument
%rcx	4th argument	%r10	caller saved
%rdx	3rd argument	%r11	caller saved
%rsi	2nd argument	%r12	callee saved
%rdi	1st argument	%r13	callee saved
%rbp	callee saved	%r14	callee saved
%rsp	stack pointer	%r15	callee saved

The table on the right gives the conventional use of the 16 registers in the I86-64 architecture. With a very few exceptions, these are not enforced (or used) by the hardware but more by compilers.

In most cases it would not matter which registers were assigned to which purpose, providing it was done consistently. It would not work if, for example, compilers put the first argument in %r11 but looked for the first parameter to be in %r13.

Two examples where it does matter to the hardware are (the 128-bit multiply and divide) and (the pushq/popq pair) where specific registers have specific hardware functions.

Caller Saved vs. Callee Saved Registers

Assume we are studying the assembler code of g(), which was called by f(). If a register is labeled callee saved, and it is altered by g() (the callee), g() must save the register and restore it before returning since f() (the caller) is permitted to assume this behavior.

In contrast, if the register is labeled caller saved, g() can alter it and not restore the original value since that was the responsibility of f(), the caller.

Note that caller-saves implies callee-(might)-destroy.

Argument and Return Value Are Caller Saved

In addition to the registers explicitly listed as caller saved, all 6 registers used for arguments and the one register used for the return value may be altered by the callee, g() in our example. Hence these seven should also be considered caller save.

`f()` Calls `g()` Calls `h()`

Consider writing the function g() in the situation where f() calls g() and g() calls h(). In this (common) case g() may need to save every register that it modifies, both caller saved and callee saved. Explain why.

Two Trivial Examples: Addition

  long sum2(long x, long y) {
    return x+y;
  }
  long sum3(long x, long y,
            long z) {
    return x+y+z;
  }

  sum2:
    leaq (%rdi,%rsi), %rax
    ret
  sum3:
    addq %rsi, %rdi
    leaq (%rdi,%rdx), %rax
    ret

Register	Use
%rdi	x
%rsi	y
%rdx	z
%rax	retVal

The assembler was obtained via
cc -Og -S simple.c

Without the -Og the assembly language produced would have been much more complicated.

Notes:

The register conventions almost write the entire program.
sum3() overwrites the first parameter %rdi. Fear not, parameters are caller-saved registers so sum3() need not preserve their value.

Another Simple Example, But With a Trick

  long add2(long, long);
  void addStore(long x, long y,
                long *dest) {
    long t = add2(x,y);
    *dest = t;
  }

  addstore:
    pushq %rbx
    movq  %rdx, %rbx
    call  add2
    movq  %rax, (%rbx)
    popq  %rbx
    ret

Register	Use
%rdi	x
%rsi	y
%rdx	dest

Assume add2() is compiled separately and like sum2() calculates the sum of 2 integers. The trick in addstore() is that it needs to put that sum in the memory location given in dest. This seems easy, dest is given in register %rdx.

The trouble is that, since %rdx is caller-saved, add2() might change its value, hence addstore() must save %rdx before calling add2() and must restore it after that call. The simplest way would be to use the stack. The compiler chose instead to save %rdx in %rbx (a callee-saved register) and save/restore %rbx on the stack.

The third argument (in %rdx, naturally) is a address. Notice how it is enclosed in () to access memory. (Actually, as just mentioned, %rdx is copied into %rbx, which is then placed in parentheses).

addstore() is an example of a middle function g in the triple
f()->g()->h(). See how it treats the caller-saved %rdx and the callee-saved %rbx.

Start Lecture #18

Last Simple Example

  // return sum; set diff
  long sumDiff(long a, long b,
               long *diff) {
    *diff = a - b;
    return a + b;
  }

  sumDiff:
    movq %rdi, %rax
    subq %rsi, %rax
    movq %rax, (%rdx)
    leaq (%rdi,%rsi), %rax
    ret

Register	Use
%rdi	a
%rsi	b
%rdx	dest
%rax	tmp ret

We know SumDiff will need to set %rax (which is caller-saved) to the returned value. But before calculating the returned value, it can use that register as a temporary.

Note that it requires two instructions to set %rax=%rdi-%rsi

Yet Another Example: Multiplication

  long mult2 (long, long);
  void multStore(long x, long y,
                 long *dest) {
      long t = mult2(x, y);
      *dest = t;
  }
  long mult2 (long a, long b) {
      long ans = a * b;
      return ans;
  }

  <multStore>:
    pushq  %rdx         # caller save
    callq  mult2        # multq(x,y)
    popq   %rdx         # restore reg
    movq   %rax,(%rdx)  # store at dest
    ret                 # return
  <mult2>:
    movq   %rdi,%rax    # a
    imulq  %rsi,%rax    # a * b
    ret                 # return

Register	Use
%rdi	x a
%rsi	y b
%rdx	dest
%rax	ans

The C program on the far right is simple and so is the assembly.

One point to note is the first two parameters multStore() receives are the same (and in the same order) as the first two arguments multStore() passes to mult2(). Were they in the reverse order, some movq's would be needed.

Another point is that, since %rdx is caller-saved, mult2() can destroy it and hence multstore() must save and restore it.

A third point is that, since %rdx contains dest (an address), the assembly syntax (%rdx) corresponds to *dest.

Homework: Assume the C call mult2(x,y) was instead mult2(y,x). What would the assembly language for multStore() look like? Do not make use of the mathematical identity
x * y = y * x

Stack Frames When `f()` Calls `g()`

The diagram on the far right shows the run-time stack just before f() calls g(). The diagram on the near right shows the stack after g() has begun execution.

The green region is the portion of the stack associated with the current invocation of f(). (If f is recursive there can be several stack frames for f(), but ignore that for now).

The blue region is for functions that are higher in the call chain leading to f().

When f() actually calls g(), the first thing that happens is that the return address (the address in f() where g() is to return when finished) is pushed on the stack and is momentarily the top-of-stack. The return address is considered part of f()'s stack frame.

What is in a Stack Frame?

The stack frame for g() (or any other function) typically contains three groups of items.

Saved registers. We have seen that certain registers are callee saved, which means f() can depend on them containing the same values when g() returns as they contained when f() called g(). So, if g() needs to modify any of those registers (perhaps to perform some computation), it needs to save them someplace and restore them when g() returns to f().
Local Variables. This is where variables local to g() are stored. These variables do not exist either before or after g() is executed (again ignore recursion for now).
Argument build area. This is used to calculate arguments if g() is to call another function say h() (remember that arguments are caller save registers). Also if g() is to call a function with more than 6 arguments, the excess must be stored on the stack.

The stack does not normally contain instructions to be executed. The instructions are stored in a different part of the memory and do not change during execution.

Note: It is possible for some parts of the stack frame for g() to be empty. Indeed, some functions g() don't need a stack frame at all. For example if g() doesn't call another function, there is no argument build area. If, furthermore, g() is simple, its local variables and computation may fit in the registers designated for its use (we shall discuss these caller-save registers, and their callee-save counterparts soon).

O'H-3.7.2 Control Transfer

Transfer of control from f() to g() is accomplished by the procedure call
callq target
which

first pushes the return address on the stack.
1. reduces %rsp by 8.
2. stores the address after the call (e.g., 0x401500) on the stack (in the location just added by the push).
then jumps to the target (i.e., sets %rip to e.g., 0x400550).

Eventually, the called program mult2() returns by executing a retq, which

Pops the stack.
1. Obtains the value at the current top-of-stack (i.e., obtains 0x40549 from location 0x108).
2. Increases %rsp by 8.
Jumps to the location popped (0x40549)

Note: In the examples we have seen, the target has been a label. This is the common situation and is the one we will emphasize. Also possible, however, is an indirect call
callq *operand
where operand is one of the address forms we have seen above (the most complicated being Disp(Rb,Ri,S)). In the callq *operand, case the jump is to the address that is the contents of operand.

A Simple Example Step by Step

Show the animation on slides 11-14, which corresponds to the multstore()/mult2() example just given.

O'H-3.7.3 Data Transfer

Register Name	Conventional Use	Register Name	Conventional Use
%rax	Return Value	%r8	5th argument
%rbx	callee saved	%r9	6th argument
%rcx	4th argument	%r10	caller saved
%rdx	3rd argument	%r11	caller saved
%rsi	2nd argument	%r12	callee saved
%rdi	1st argument	%r13	callee saved
%rbp	callee saved	%r14	callee saved
%rsp	stack pointer	%r15	callee saved

In the x86-64 architecture, the primary method of data transfer between the calling procedure (f() above) and the called procedure (g() above) is via machine registers used to transmit arguments in the caller to the corresponding parameters in the callee. In the other direction the return value in the callee is transmitted to the function value in the caller, again using a register. As mentioned above and repeated to the right, specific registers are designated for these purposes.

Thanks to these conventions if f(), containing a call g(x,y), is compiled on monday in LA using a California C compiler, the values of x and y will be stored in %rdi and %rsi respectively. Then, if on Wednesday
g(long a, long b)
is compiled in Newark using a NJ compiler, g() will retrieve the values for a and b from %rdi and %rsi respectively.

3.7.4 Local Storage on the Stack

The first choice for local variables (in g() say) is to use some of the leftover registers (since registers can be accessed much faster than stack elements). However, if g() is complex, it probably has more local variables than would fit in the available registers.

A second reason for storing local variables in memory rather than a register is that the & operator (in C) may have been used. Remember, that when &var is used in C, the compiler is required to provide the address of var.

A third reason for stack usage is for large objects like arrays and structures.

A Simple Example: `incr()`

  long incr(long *p,
            long val) {
    long x = *p;
    long y = x + val;
    *p = y;
    return x;
  }

  incr:
    movq  (%rdi), %rax
    addq  %rax, %rsi
    movq  %rsi, (%rdi)
    ret

Register	Use
%rdi	p
%rsi	val, y
%rax	x, ret

The incr() function is like x++ in that it increases *p and returns the old (pre-incremented) value.

Note that the C code is not what you would normally write; rather it is there to help understand the assembler. In particular, C programmers would not have the variable y. Instead, they would write simply *p = x+val;

A Simple (But Not Completely Trivial) Example: `callIncr()`

  long call_incr() {
    long v1 = 15213;
    long v2 = incr(&v1, 3000);
    return vi + v2;
  }

  call_incr:
    subq  $8, %rsp
    movq  %15213, (%rsp)
    movq  $3000, %rsi
    leaq  (%rsp), %rdi
    call  incr
    addq  (%rsp), %rax
    addq  $8, %rsp
    ret

Register	Use
%rdi	&v1
%rsi	3000
%rax	v2, ret

See slides 20-24 for diagrams illustrating the execution of call_incr. In those slides the lines in red have just been executed.

Brief Description of Slides

Slide #1
- The initial stack structure shows the result of the call to call_incr() (not the call to incr(), which has not yet occurred). In particular the return address on the stack is the address that call_incr() will return to. That address is not the address of anything on this slide.
- Since the C code includes &v1, we need a memory address for v1. The only memory addresses we can currently use are those on the stack so we push the value 15213 on the stack and will pass that stack address to incr() as the value of &v1.
- However, we don't use the pushq instruction since pushq has subtle effects when a constant is pushed. We will only use pushq to push register contents, in which case there are no subtleties. So we push the constant 15213 manually, i.e, we decrement the stack pointer in one instruction and store the constant in the second.
Slide #2
- The second slide establishes the two arguments to incr() in reverse order.
- I believe the first argument could have been established with the simpler movq %rsp, %rsi
Slide #3
- The third slide shows the effect of incr() having been run
- The in-memory (on stack) value of v1 has been incremented, as shown.
- Recall that incr() returns the un-incremented value of v1. The returned value is (as always) returned in %rax, which is the current home of v2.
Slide #4
- The updated v1 is added to the value of v2.
- The stack is popped to remove v1 resulting in %rsp now pointing to the return address.
Slide #5

The last slide shows the return accomplished.
The stack is restored to its original state before the call.
Now %rax contains the return value of call_incr().

O'H-3.7.5 Local Storage in Registers

As mentioned a common method of transferring values from the calling program (say f()) to the called program (say g()) is for f() to put a value in an agreed upon register and for g() to access that register. For constants this works without issues.

But what if we want to share a variable x that both f() and g() use? If f() puts x into a register and calls g(), is g() permitted to say increment the register?

The answer is yes and no.

Caller-save Registers vs. Callee-save Registers.

Consider the situation where f() calls g(x) which in turn calls h(x). g() must answer two questions.

Can I update x freely or must I be sure to restore x to its original value before returning to f()?
Will h() alter x or can I assume the value of x is the same after I call h() as it was before I called h()

Some registers e.g. %r12 are designated as callee-save. This means that, answering question 1 above, g() (the callee) must restore the restore the register to the value it had when f() called g(). It also means that, answering question 2, g() can assume that h() restores the register to the value it had when g() called h().

Other registers are designated as caller-save. For these registers, answering question 1 above, g() need not restore the register before returning to f(). It also means that in answering question 2 g() cannot assume that h() restores the register when h() returns to g().

Our table of registers lists 6 registers as callee-save and only 2 as caller-save, but this is misleading. The 6 registers designated for arguments and the one register designated for the return value are also caller-save, for a total of 2+6+1=9 caller-save registers.

An Example with Callee-Save

  long call_incr2(long x) {
    long v1 = 15213;
    long v2 = incr(&v1, 3000);
    return x+v2;
  }

  call_incr2:
    pushq  %rbx
    subq   8, %rsp
    movq   %rdi, %rbx
    movq   $15213, (%rsp)
    movl   $3000, %rsi
    leaq   (%rsp), %rdi
    call   incr
    addq   %rbx, %rax
    addq   $8, %rsp
    popq   %rbx
    ret

Register	Use
%rdi	x
%rax	ret
%rbx	tmp
%rsp	stk

The compiler likes to use %rbx for a temporary. Since it is callee-saved, call_incr2 must save the register before using it and must restore its original value before returning. The good news, however, is that call_incr2 can assume that incr will not change %rbx (more precisely, incr will restore any changes it makes).
The parameter x is saved in %rbx to be used in the eventual return.
A stack slot is reserved by sub $8, %rsp. It is used to hold the value of v1 so that call_incr2() can pass the address of v1 to incr().
The two arguments to incr() are then placed in the required registers and the call is made.
As required incr() has placed its result in %rax, which call_incr2 increments to produce its own result.

3.7.6 Recursive Procedures

What is the Issue?

Look at the recursive routine pcount() below. It counts the total number of 1 bits in x. I realize that pcount() can be easily written without recursion.

When pcount calls pcount, we have two different x's. In particular the x in the child is a right-shifted version of the x in the parent. We need both and cannot overwrite one with the other.

If all the bits of x are 1, the recursion will go on for 64 levels and we must keep all that information around using only 16 registers.

Not So Bad

  /* Recursive popcount */
  long pcount
    (unsigned long x) {
    if (x == 0)
      return 0;
    else
      return (x & 1) +
             pcount(x >> 1);
  }

  pcount:
    movq    $0, %rax
    testq   %rdi, %rdi
    je      .L6
    pushq   %rbx
    movq    %rdi, %rbx
    andq    $1, %rbx
    shrq    %rdi
    call    pcount
    addq    %rbx, %rax
    popq    %rbx
  .L6:
    ret

Register	Use
%rdi	x
%rax	ret
%rbx	tmp

The assembly code is only about a dozen instructions and uses only 3 registers.

The first three instructions take care of the non-recursive (base) case where x==0.
Then save %rbx (which is callee-saved) on the stack and use that register to calculate x & 1.
Then shift x (i.e., %rdi) right 1 bit as required and call ourselves recursively.
Do the required addition, which updates the running sum from our child to include our bit of x.
Restore %rbx (callee-saved) and return the result already in %rax.

What is the Secret Sauce?

The stack.

Do the computation on the board with x=00...00101 binary (= 0x000000000000005).

Imagine redoing it with x=0xFFFFFFFFFFFFFFFF.

There would still be only about a dozen instructions in the program (several executed many times) and still only 3 registers would be used. However, many different values of %rbx would be pushed on to the stack and subsequently popped off and used. At one point (when all the calls are done, but none of the returns) there would be about 64 values on the stack.

The register saving conventions (caller/callee) prevent one invocation of the function from altering registers that another invocation still is using.

O'H-3.8 Array Allocation and Access

O'H-3.8.1 Basic Principles

Array Storage

  long A[5];   // 8B each
  char *B[5];  // 8B each
  double C[5]; // 8B each
  int D[5];    // 4B each
  float E[5];  // 4B each
  short F[5];  // 2B each
  char G[5];   // 1B each

Sizes: On the right are the sizes for elements of each C type.
Alignment: As we learned, types requiring n bytes of storage must be aligned so that their address is a multiple of n. Note that the type refers to the base type. For example, G is an array of chars so G[0] can begin at any byte address. Similarly, F[0] can begin at any even-numbered byte address.
Contiguity: Arrays are stored contiguously. That is, A[1] starts in the very next byte after A[0] ends.
Hence aligning A[0] properly guarantees that all A[i] will be properly aligned.

O'H-3.8.2 Pointer Arithmetic

Consider the declarations
long A[10], *p, i;
If we reference A[i] and increment i, we reference the next element of A, which is 8 bytes further in memory.
In C, the same thing occurs with pointers. If we write
p = &A[2]; p++;
again p advances not by one, but by eight, the space (in bytes) used for one long.

Trivial Example: Obtaining an Array Element

  // Array access
  long arrElt(long z[], long idx) {
    return z[idx];
  }

  movq (%rdi,%rsi,8), %rax
  ret

Register	Use
%rdi	&z[0]
%rsi	idx
%rax	ret

Notes:

The desired element is at location
%rdi + 8*%rsi
The contents of this location is simply (%rdi,%rsi,8)

Another Simple Example Showing Off a Complicated Memory Address Form

  // Array addition
  void arrAdd(long A[],
              long B[],
              long C[]) {
    long i;
    for (i=0; i<10; i++)
	A[i] = B[i] + C[i];
  }

  arrAdd:
	xorl  %rax, %rax
  .L2:
	movq  (%rdx,%rax,8), %r8
	addq  (%rsi,%rax,8), %r8
	movq  %r8, (%rdi,%rax,8)
	incq  %rax
	cmpq  $10, %rax
	jne   .L2
	ret

Register	Use
%rdi	&A[0]
%rsi	&B[0]
%rdx	&C[0]
%rax	i
%r8	tmp

Notes:

The exclusive OR sets %rax to zero.
Each memory operand adds 8*i to the start of its array, which is perfect for subscripting elements of size 8.
We need the temporary register since no x86-64 instruction can reference two memory locations.

Start Lecture #19

O'H-3.8.3 Nested (i.e., Multidimensional) Arrays

Consider the declaration of twoD on the right. It is a two dimensional array of longs. As we saw earlier in the course, it can be viewed as a matrix with 2 rows and three columns. It is stored contiguously as shown in the bottom of the diagram.

In C as in most, if not all, modern languages a 2D array is stored in row major order; each row is stored contiguously. When viewed as a matrix it is stored the way a book is read in English.

What Is the Address of twoD[i,j]?

Let's do the general case where twoD has R rows and C columns, and each element of the array requires K bytes. Let A be the address of twoD[0,0]. Then the address of twoD[i,j] is

A + i * (C * K) + j * K = A + (i*C + j) * K

A key to understanding the somewhat cryptic-looking formula is that C * K is the space required to store one complete row of the matrix.

Accessing Multidimensional vs. Multi-Level Arrays in Assembler

    long oneD1[3] = {1, 5, 7},
         oneD2[3] = {2, 4, 6};
        *twoDv2[2] = {oneD1, oneD2};
                // = {&oneD1[0], oneD2[0]

We did this when learning C. See section K&R-5.9. C hides much of the details; now they will all come out.

The idea is that, instead of using a 2D matrix of elements, we can implement a 2D array as a 1D array of pointers to 1D arrays of elements.

To access the entry with value 7 in the nested array we would write twoD[0][2]

To access the entry with value 7 in the multi-level array we would write twoDv2[0][2], i.e., we write the same thing.

But they are implemented quite differently in assembler. The first version first does some arithmetic (see above) to calculate the needed address and then accesses that address. The second version access one element of twoDv2 to find (a pointer to) the correct oneD array and then accesses the appropriate entry in that array. The summary is that the first requires more arithmetic; the second an extra memory access.

See slide set machine-level-5.pptx slides number 11-13 for a larger example and more details

O'H-3.8.5 Fixed-Size Arrays

Skipped

O'H-3.8.6 Variable-Size Arrays

Skipped

O'H-3.9 Heterogeneous Data Structures

O'H-3.9.1 Structures

Arrays in Structures and the Ultimate Memory Addressing Mode

  struct st {
    long a;
    long b[10];
    long c[10];
  };
  void f(struct st *s) {
    long i;
    for (i=0; i<10; i++)
	s->b[i] = s->c[i];
  }

  f:
	movq  $0, %rax
	jmp   .L2
  .L3:
	movq  88(%rdi,%rax,8), %rdx
	movq  %rdx, 8(%rdi,%rax,8)
	addq  $1, %rax
  .L2:
	cmpq  $9, %rax
	jle	  .L3
	ret

Register	Use
%rdi	s
%rax	i
%rdx	tmp;tmp

The C code on the right is a simple loop copying one array to another, each of which happens to be part of the same structure. A pointer to this structure is the sole parameter of f().

Note that the address IN s (not of s) is the address of s->a. Also s->b[0] is located 8 bytes after the address in s and s->c[0] is 80 bytes after that.

Admire the last two movq's in the assembly code, which contain the most complicated memory address form.

Notes:

The for loop uses a jump-to-the-middle technique to ensure a test is done initially.
Since %rdi contains s, (%rdi) contains *s, which is s->a.
Eight bytes further 8(%rdi) is s->b[0] and s->b[i] is 8i bytes after that, which is 8(%rdi,%rax,8).
Finally, s->c[i] is 80 bytes after s->b[i], which explains 88(%rdi,%rax,8).

Traversing a Linked List

  struct rec {
    long a[3];
    long i;
    struct rec *next;
  };
  void set_val
    (struct rec *r, long val) {
    while (r) {
	  long i = r->i;
	  r->a[i] = val;
	  r = r->next;
    }
  }

  .L3:                          # loop:
    movq   24(%rdi), %rax       # i = Mem[r+18]  
    movq   %rsi, (%rdi,%rax,8)  # Mem[r+8*i] = val
    movq   32(%rdi), %rdi       # r = Mem[r+32]
    testq  %rdi, %rdi           # test r
    jne    .L3                  # loop if r != 0
    ret

Register	Use
%rdi	r
%rsi	val
%rax	i

On the right we see a small program involving a strange data structure. We have a C struct, containing an array a[] of three longs and another long i, indicating which of the 3 elements of the array is the active element. These structs are linked together via a next pointer.

The set_val() function is given a pointer to one of these structures (presumably the head of a list) and another long val. The goal is to set all the active elements in the list to val.

Try not to be confused by the pink diagram using hex addresses; whereas, the assembler uses decimal. I used hex for the diagrams since all entries are multiples of 8, which is most readily expressed in hex.

The assembly program is simple but studying the addressing is worthwhile. The first line grabs i (remember decimal 24 is hex 18); the second line updates the i^th entry of a; finally we loop if next is not NULL.

Remark: The program assumes that the list has at least one entry, i.e., it assumes r≠NULL.

O'H-3.9.2 Unions

Skipped.

O'H-3.9.3 Data Alignment (and Padding)

  struct stt {
    char c1;
    long l1;
    char c2;
    long l2;
  } ss, *pstt

How do we align ss, which is a struct stt? First we look at the components: c1 and c2 are each 1 byte and can be aligned on any byte. However, l1 and l2 are each 8 bytes and hence must be aligned on an 8-byte boundary. That is the address of each one must be a multiple of 8.

The four components of the structure have 2 different alignment requirements. The rule employed is that the structure itself must be aligned to conform to the strictest alignment of its components, which in this case says that every variable of type struct stt must be aligned on an 8-byte boundary.

So ss begins on an 8-byte boundary. c1 can begin anywhere; so far so good. But l1 must be aligned on a 8-byte boundary and that means we need 7 bytes of (wasted) padding. This repeats for c2 and l2

Look how much better it lays out if we put first the bigger components (with the more stringent alignment requirements). The compiler is not permitted to change the order of components; the programmer must do it.

O'H-3.10 Combining Control and Data in Machine-Level Programs

skipped

O'H-3.11 Floating Point Code

skipped

O'H-3.12 Summary

Most programming is done in high level languages. In this chapter we looked under the covers at the level of instructions that the computer actually executes.

3.A Memory Layout (Linux x86-64)

The diagram on the right gives an overview of the memory regions in a running program under Linux. We will study this in more detail next chapter. For now we just want to mention examples of each region type.

Green Regions (Heap & Stack)

These regions can grow during during execution, which requires extra attention by the system so that one such region does not overwrite another. They contain read/write data. The heap grows when malloc() is called; the stack grows when register %rsp is decremented, which occurs on most procedure invocations. We will see shared libraries soon.

Yellow Regions (Text & Shared Libraries)

These regions, which contain executable/non-writable instructions, do not grow and do not change their contents during execution. This text region contains the compiled assembler instructions we have studied. Shared libraries are a memory-saving idea that permits many running programs to share, for example, the library routine printf().

Pink Region (Data)

Statically allocated data including global variables, static variables, and string constants. Unlike local variables inside C functions that come and go when procedures are called and return, these variables remain for the lifetime fof the execution.

Chapter O'H-7 Linking

O'H-7.1 Compiler Drivers

The program cc (or gcc) is technically more than a compiler. It invokes a series of programs that step by step transform your C source program into an form executable by the computer hardware. Since cc controls or drives the compilation process it is sometimes called a compiler driver.

We have already seen that cc includes a compiler translating C code to assembly language and an assembler translating assembly to actual computer instructions. Those are the first two arrows in the diagram on the right.

Now we want to go further in the diagram.

Remark: Although we normally use cc to compile C programs, it can also be used to translate assembly language into native machine instructions. Indeed you can enter the pipeline any point and can leave an any point downstream from the entry. The entry point is determined by the name of the input file: cc x.c starts at the first arrow, cc x.s starts at the second, etc. The exit point is determined by flags to the cc command: cc -s x.c stops after the first arrow; cc -o x.c stops after the second.

O'H-7.2 Static Linking

We first study static linking, which is the vertical line at the right of the diagram. Static linking is performed by a program normally called the linker, which is invoked automatically by the compiler driver cc. In linux/unix the linker program is (unfortunately) called ld. If you execute cc without options, ld is run on all the object files produced by the assembler to produce a single load file, which can be executed.

Note: The linker was originally called a linkage editor by IBM.

Historical note: The linker on Unix was mistakenly called ld (for loader), which is unfortunate since it links but does not load.

Unix was originally developed at Bell Labs; the seventh edition of Unix was made publicly available (perhaps earlier ones were somewhat available). The 7th ed man page for ld begins (see https://cm.bell-labs.com/7thEdMan).

    .TH LD 1
    .SH NAME
    ld \- loader
    .SH SYNOPSIS
    .B ld
    [ option ] file ...
    .SH DESCRIPTION
    .I Ld
    combines several
    object programs into one, resolves external
    references, and searches libraries.

By the mid 80s the Berkeley version (4.3BSD) man page referred to ld as link editor and this more accurate name is now standard in Unix/Linux distributions.

During the 2004-05 fall semester a student wrote to me:

BTW - I have meant to tell you that I know the lady who wrote ld. She told me that they called it loader, because they just really didn't have a good idea of what it was going to be at the time.

What is (Static) Linking

Linking is the process of collecting and combining various pieces of code and data into a single file that can be loaded into memory and executed.

Why Linkers: What is the Advantage?

Modularity
- Write a large application as collection of small .c files, rather than as one giant .c file.
- Support libraries of common utilities
  - The math library implements trig and many other functions.
  - The string and i/o libraries that we have used in many applications.
Efficiency
- Recompile only the .c files that changed rather than the entire application.
- Only store the math/io/string/etc libraries once in the entire system.

Remarks on Midterm Grades

The midterm grade sent to the registrar is based solely on the midterm exam. Anyone who did not take the exam received a UE (unable to evaluate) as a midterm grade.
I do NOT give + or - grades as midterm grades.
I DO give + or - grades as final grades.
The midterm grades were assigned 90-100:A, 80-89:B, etc

7.2.A The Big Picture: What Do Linkers Do?

            file main.c
  #include <stdio.h>
  int x = 10;
  void f(void);

  int main(int argc, char *argv[]) {
    printf("main says x is %d\n", x);
    f();
  }
            file f.c
  #include <stdio.h>
  extern int x;

  void f(void) {
    int y = 20;
    printf("f says x is %d\n", x);
    printf("f says y is %d\n", y);
  }

Short Answer: Resolve External References and Relocate Relative Addresses

For a simple example of what the linker needs to do, consider the small example on the right consisting of two files main.c and f.c, which are compiled separately.

When compiling main.c, the compiler does not know the contents of f.c. In particular it does not know how big the resulting compiled code for f() will be. Since it doesn't know how much room to leave for f(), it leaves no room and assumes that main() starts at the beginning of the compiled code.
Similarly, when compiling f.c the size of main.c is not known so the compiler assumes that f() is at the beginning of the compiled code.
At least one of these assumptions must be wrong! One of the functions will have to come after the other one.
Put another way the local addresses generated by the compiler when compiling each file are based on the starting address for that compilation. In each case this base address is assumed to be zero (or some other constant) which can't be right for both compilations.
The terminology used is that these address are relative to the base address for the given compilation.
The linker sees the (compiled version of) both files and decides which goes first, second, etc.
Assuming the module to be placed first starts at the beginning of memory, the absolute address for each local address in that module is equal to its relative address.
For the module to be placed second, the size of the first module must be added to each relative address to obtain the corresponding absolute address.
So one task for the linker is to relocate relative addresses into the corresponding absolute addresses.
The second problem to be solved is that, when compiling main.c, the compiler does not know where the function f() will be loaded and, when compiling f.c, the compiler does not know where x will be loaded.
In both cases the symbol in question is external to the module being loaded and the linker is required to resolve external addresses.

The diagram on the far fight illustrates relocating relative addresses. Specifically, it shows that the relocation constant is calculated as the sum of the lengths of the preceding modules. Once the relocation constant C is known, each absolute address in the modulated is calculated simply as the corresponding relative address + C.

The diagram on the near right illustrates resolving external references. In this case the reference is to f(). Note that the Base of M4 is the same as its relocation constant, i.e., the sum of the lengths of the preceding modules.

Two Passes Help

Note from the diagram on the near right, that the linker encounters the required address jump f before it knows the necessary relocation constant.

The simplest solution (but not the fastest) is for the linker to make two passes over the modules. During pass 1 the relocation constants for each module are determined and a symbol table is produced giving the absolute address for each global symbol. During pass 2, references to external addresses are resolved using the symbol table constructed during pass 1.

O'H-7.3 Object Files/Modules

Relocatable Object File/Module (a `.o` File)

Contains code and data in a form that can be combined with other .o files to form an executable object file.
Each .c file is compiled (and assembled) into a corresponding .o file.

Executable Object File/Module (the `a.out` File)

Contains code and data in a form that can be loaded directly into memory and executed.
They used to be called load files.
The linker produces a load file by linking together .o files.

Shared Object File/Module (a `.so` File

Special type of relocatable object file that can be loaded into memory and linked dynamically, at either load time or run time.
Also called a Dynamic Link Library (DLL), especially in M/S Windows.
See diagram at beginning of the chapter and section 7.10 below.

ELF: Executable and Linkable Format (ELF)

Standard format for binary object files including
- Relocatable object files (.o).
- Executable object files (a.out).
- Shared object files (.so)
Generic name: ELF binaries.
Originally proposed for AT&T System V Unix.
Use by all modern Unix and Linux systems.

O'H-7.4 Relocatable Object Files

The table on the right tabulates the information contained in an ELF binary. The various fields contain the following information.

ELF Relocatable Object File
Elf header
Segment header table
.text section
.rodata section
.data section
.bss section
.symtab section
.rel.text section
.rel.data section
.debug section
Section header table

EFL header: Word size, byte ordering, machine type, etc., location of section header table
Segment header table: Page size, segment sizes, et al (needed only for executables).
.text: The machine code of the compiled module.
.rodata: Read-only data, e.g., format strings.
.data: Initialized global and static C variables.
.bss: Uninitialized global; just length; loader loads zeros.
.symtab: Symbol table used for linking.
.rel.text: Relocation info for .text section. Addresses of instructions that will need to be modified during linking.
.rel.data: Relocation info for .data section. Addresses of pointer data that will need to be modified during linking.
.debug: Info for symbolic debugging, e.g. local symbol table (gcc -g)
.line: Line numbers for use with gcc -g
.strtab: Used by .symtab and .debug

O'H-7.5 Symbols and Symbol Tables

The linker symbols (which are stored in the linker symbol table) come in three flavors.

Global symbols
- These are symbols defined in the current module that can be referenced by other modules.
- Examples include non-static C functions and non-static global variables.
External symbols
- Global Symbols that are referenced by the current module but defined in another module.
Local symbols
- Symbols defined in one module but used in many functions within that module.
- Examples include static C functions and staticexternal variables.
NON-example
- C variables local to a function
- The linker doesn't deal with these

O'H-7.6 Symbol Resolution

The left figure above shows a very simple C program that nonetheless exercises many of the linker's abilities.

Note the distinction between global references, which define symbols, and external references that need to be resolved by the linker to equate to the corresponding global.

Also note that bufp1, although known only to swap.c, must be given a slot by the linker. Since it is static, bufp1 must maintain its value across multiple calls to swap() and hence, unlike tmp, cannot be stored on the stack.

As an optimization the executable file does not contain space for bufp1, just a count of how much space to reserve when the executable is loaded.

We see in the right figure that the linker combines corresponding segments from each object file when producing the executable that is the linker's output.

O'H-7.6.1 How Linkers Resolve Duplicate Symbol Names

Strong and Weak Symbols

Recall that declarations give just the type of an identifier. This tells the compiler how to interpret the identifier, but does not necessarily reserve space for the identifier. Declarations that reserve storage are called definitions.

External definitions (e.g., procedures and initialized globals) are considered strong symbols by the linker.
External declarations and un-initialized globals are considered weak symbols by the linker.
Internal declarations or definitions are not considered at all by the linker.

  file f1.c:
    int svar1=5;
    int sfun1(int x) {
      code
    }

  file f2.c:
    int wvar1;
    int wfun1(int z);
    int sfun2(void) {
      int igsym1=3;
    }

Looking at the code on the right

Functions sfun1() and sfun2(), and variable svar1 are each strong symbols.
In contrast wvar1 and wfun1() are weak symbols.
Finally, igsym1 is ignored (actually not seen) by the linker

The linker obeys the following rules.

There can be only one strong symbol with a given name. If more than one strong symbol has the same name, the linker reports a multiply defined symbol error.
If there are one of more weak symbols with the same name as a strong symbol, the linker resolves all these references to the symbol name to the strong symbol.
If there are one or more weak symbols with the same name, but no strong symbol with that name, the linker picks one of the weak symbols and resolves all references to that one.

A Few C Language Examples

  int x;          int x=7;         Both x's are the same; the second is strong.
  f1() {...}      f2() {...}       The second x is chosen.

  int x;                           Two strong symbols have the same name, f1.
  f1() {...}      f1() {...}       Link time error.

  int x;          int x;           Both x's are the same; each is weak.
  f1() {...}      f2() {...}       *Either* could become the location for x.

  int x=7;        double x;        The first x is strong and is chosen.
  int y=5;        f2() {...}       Writes to x in f2() WILL overwrite y!
  f1() {...}                       Scary!

  int x;          double x;        Both x's weak; either might be chosen
  int y;          f2() {...}       Writes to x in f2() MIGHT overwrite y!
  f1() {...}                       Maximum terror!!

Start Lecture #20

Remark: The final for this class will be on zoom, just like the midterm. I mistakenly copied something for 202 into our class notes for 201. The erroneous comment has been removed.

O'H-7.6.2 Linking with Static Libraries

Static vs. Dynamic Linking

The figure in 7.1 contains two kinds of libraries: statically-linked libraries that are processed by the linker and dynamically-linked libraries (DLLs) processed by the loader. How do they differ?

Static Linking

You know well that when your programs run, some functions are executed that you did not write (e.g. printf()). Many common routines are placed in libraries that the linker searches by default.

Two Bad Ideas:

Too coarse and too fine

Too coarse grained: all the library's functions are in one big .o file.
- Inefficient: Link in many functions when only a few are needed. For example, printf() is in libc.a, which contains over 1500 functions).
Too fine grained: Each function in a separate .o file.
- Inconvenient: May need to explicitly name dozens of library files (printf.o, getchar.o, strcpy.o, etc.) in the cc command.

One Good Idea:

One (or a few) big file(s), but (each) with an index stating which functions are contained and where in the file they are located, a so-called static library. A static library has a .a suffix because it is an archive of several .o files.

When linking a program the linker (ld) automatically searches (the index in) the standard C archive libc.a. If a function referenced by the user program is found in the index, the corresponding .o file is extracted from the archive and linked with the program.

So 1500+ functions are available in just one archive file, but only the dozen or so actually referenced are linked in.

Since cc knows to search libc.a, no further user action is needed. Actually cc knows to search a number of standard archives. For other library routines, the user must tell the linker to search the corresponding archive by giving the appropriate option to the cc command.

Creating a Static Library (a `.a` file)

Assume you (as system administrator) have compiled all the .c files you want to put in the standard library libc.a. Then you would execute one very long archive (ar) command, which concatenates the files and constructs the index. Specifically, you would write.

  ar rs libc.a atoi.o printf.o strcpy.o random.o ... (1500+ entries)

O'H-7.6.3 How Linkers Use Static Libraries to Resolve References

The linker makes one pass over the .a and .o files in the order that the names were given on the command line. During this process, the linker maintains a list of currently unresolved references.

When a .a file is encountered in the list, the linker searches the archive's index and links in any file containing a definition for a currently unresolved reference.

If any references remain after the linker has completed its scan of the files mentioned on the command line, the linker prints an error message and the command fails.

For this reason the compiler drivers cc and gcc search the standard libraries after processing the files you mentioned explicitly on the command line.

  // f.c
  void f(void) {
    return;
  }   // callf.c
  void f(void);
  int main (int argc, char *argv[]) {
    f();
  }   $ cc -c f.c
  $ ar rs libf.a f.o
  $ cc -c callf.c
  $ cc libf.a callf.o
  callf.o: In function `main':
  callf.c:(.text+0x10): undefined reference to `f'
  collect2: error: ld returned 1 exit status
  $ cc callf.o libf.a
  $ ./a.out

A Simple Example Requiring Care

The example begins with a do-nothing function f().
Next we have a main program that invokes f.
Now we execute some simple commands, all of which work as expected.
- Compile f().
- Create an archive containing just f().
- Compile the callf() main program.

So far so good, we compiled a main program and a utility program and have placed the latter in an archive. You might want to think of f() as printf(), and the archive as libc.a.

Now look at the fourth command, which attempts to link the two functions f() and callf().

This command fails asserting that it cannot resolve the reference to f().

But why? We just put f() in libf.a and included the latter in our link command (remember that the compiler driver cc invokes the linker, ld).

The trouble is that the linker looks at libf.a before it looks at callf() and thus sees no need to link in f().

When the compiler driver is given the archive at the end of the command (as in the last line), the linker detects the need for f() before processing the archive and hence extracts f.o and all is well.

O'H-7.7 Relocation

Now that we have shown how the linker resolves external references, we turn to its other main function, relocation. Recall that the linker receives as input a number of independently produced object files. Each of these object files likely contains a .text section (containing the executable code), a .data section (containing the read/write initialized data, and several other sections.

The linker combines all the .text sections into one, and similarly for the other sections. We gave an overview of this action when we discussed relocating relative addresses in section O'H-7.2.A. We are skipping the detailed explanation given in O'H-7.7.1 and O'H-7.7.2.

O'H-7.7.1 Relocation Entries

Skipped.

O'H-7.7.2 Relocating Symbol References

Skipped.

O'H-7.8 Executable Object Files

As indicated in section 7.6 the linker combines the .text (compiled code) sections from all the .o files into one such section for the executable file. Similarly all the .rodata sections are combined. The linker also generates a .init section containing system code run to start the program.

The diagram on the right shows all three in yellow. They are often referred to as read-only code, even though they contain data. The unifying property is that, when the program executes, the yellow section is read-only.

Similarly, the pink region is produced containing all the .data (initialized data) sections as well as all the .bss (uninitialized data) sections. These two sections are in pink and represent data that may be rewritten during execution.

Note: The .bss data are actually initialized to zero by the loader. They are called uninitialized since the initial value itself is not in the elf file (since it is known to be zero).

The Yellow and Pink Segments are Fixed Size‐But Not the Green

The yellow and pink segments remain their original size during execution. This makes their run-time placement in RAM easy; just put one after the other. But we know programs grow (and shrink) in size during execution: We can call malloc()/free() and we have seen stack pushs and pops during function invocation and exits. Loading regions that grow comes next.

O'H-7.9 Loading Executable Files

Unlike the previous diagrams, the figure on the right, does not show the contents of a file, but rather the contents of memory during execution of the user program.

As mentioned the yellow and pink sections are easy to load since they don't change size and don't need to move. We use green to indicate regions that can grow.

We have already met two of the green regions (the stack and heap), and have seen them grow during execution. We will discuss shared libraries next section.

If you ignore the middle green region, we have a great situation: one region is at high addresses, the other is at low addresses, and they grow toward each other. If they grow into each other the program aborts, but for a good reason: it needs more memory than we have.

But with three growing regions, no matter where we place them, a situation can arise where two of them collide, but there is still space available; the free space is in the wrong place.

Who Can Afford This Much Memory?

The previous comment about three region is correct, but seems of no practical importance since the amount of initially free memory is enormous.
After all, 2⁴⁸ = 281,474,976,710,656 is almost 300 terabytes, which vastly exceeds the total RAM on all our laptops combined. Who can afford this much memory?

The answer comes in chapter 9.

But What About Multiple Jobs and/or Multiple Users

If two jobs are running at the same time how can both of their yellow sections start at 0x400000? Why don't their stacks collide?

The answers come in chapter 9.

O'H-7.10 Dynamic Linking with Shared Libraries

Again referring to the figure in section 7.1, we have just discussed static libraries (.a files) used by the linker and now wish to go one step further and discuss shared libraries used by the loader.

Disadvantages of Static Libraries

Duplication in the stored executables, i.e. the (perhaps renamed) a.out files. For example, nearly every compiled and linked C program would contain a copy of printf().
Duplication in the running programs (again think of printf().
A bug fix to a function in a static library requires each application using this function to know about the fix and then to relink.

A Modern Solution: Shared Libraries

Shared libraries are object files that are loaded and linked into an application dynamically, at either load-time or run-time.
They are also called dynamic link libraries, DLLs, and .so files.
See the diagram on the right, which illustrates dynamic linking occurring when the executable is first loaded.
As indicated in the diagram, the normal linker only processes the relocation and symbol table information from the shared library (libc.so in the diagram). In particular the text and data segments of the shared library are not linked in at this point.
The linker leaves a marker in the executable that this executable must be further linked. When the program is loaded, the loader detects the marker and completes the linking.
Question: How is this better?
Answer #1: The compiled version version of printf() is not included in nearly every a.out file (i.e., nearly every executable file).
Answer #2: The loader keeps track of already linked-in shared libraries from other, already running, programs and resolves references to shared libraries by using the already loaded copies.
This second answer requires position-independent code and virtual to physical address translation.
- The former is an option to the compiler requesting it generate only relative (not absolute) addresses. It is discussed in section 7.12 and enables the module to be loaded starting at any address.
- The second is an important component of memory management and is discussed in chapter 9, Virtual Memory.

O'H-7.11 Loading and Linking Shared Libraries from Applications

  #include <dlfcn.h>
  int x[2] = {1,2}, y[2] = {3,4}, z[2];
  int main(int argc; char *argv[]) {
    void *handle;
    void (*addvec)(int *, int *, int *, int);
 // dynamically load the shared lib containing addvec()
    handle = dlopen("./libvector.so", RTLD_LAZY);
    if (!handle) {
      fprintf(stderr, ...);
      exit(1);
    }
 // get ptr to addvec()
    addvec = dlsym(handle, "addvec");
    if ((error = dlerror()) != NULL) {
      fprintf(stderr, ...);
      exit(1);
  }
 // now *addvec is a "normal" function
    *addvec(x, y, z, 2);
  }

Two Real World Examples

The book gives two real world uses of run-time dynamic linking and loading, specifically software distribution and high performance web servers.

A Simple Example

On the right is a skeleton program that dynamically links during run time.

The function addvec() adds two 1D-arrays producing a third array (the 4th argument is the dimensionality of the arrays). We assume addvec() has been previously compiled and placed in the shared library ./libvector.so.

Multiple steps are involved in executing the addvec() function contained in the libvector library.

First, the addvector library is opened (but symbols from the library are not yet linked in; a sort of lazy linking).
Second, the symbol addvec()is searched for in the library and extracted.
The loader loads addvec(), unless it has already been loaded for another running program.
Finally, addvec() is used.

O'H-7.12 Position Independent Code (PIC)

We have seen that an advantage of dynamic shared libraries is that, when many processes are all using the same library (e.g., most C programs use printf() from the standard C library, libc.a), we need only one version of the code in memory.

A difficulty arises with jump instructions. If the jump instruction includes the target address explicitly, all programs linking that shared function would need to put it in the same place so that the given address will jump to the same instruction in all copies. This is not practical; it would essentially require a fixed starting memory address for every program in all the shared libraries.

Instead PIC is employed, i.e., every instructions can be in any location. Here is a simple example (making the simplifying assumptions that every instruction is 4 bytes in length and that the argument of the jump instruction is the address to jump to

  0x100000 addl  %rsi, %rdi
  0x100004 subl  $5,   %rsi
  0x100008 cmpl  %rsi, %rdi
  0x10000C jg    $0x100000

This program assumes that the addl instruction is loaded into location 0x100000 and would not work if all the instructions were loaded starting in 0x200000. We say the code sequence is not position independent.

Now consider consider a fictitious instruction picj that jumps to the current location plus the argument. Then the following code

    0x100000 addl  %rsi, %rdi
    0x100004 subl  $5,   %rsi
    0x100008 cmpl  %rsi, %rdi
    0x10000C picjg $-0xC

works the same as the above but also works if loaded in 0x200000 or any place else. It is called position independent code and is just what we needed at the end of section 7.10.

There is more to PIC then this, but the idea remains the same: have the compiler generate code that works correctly no matter where the code is loaded. See the book for more complicated examples.

O'H-7.13 Library Interpositioning

Skipped

O'H-7.14 Tools for Manipulating Object Files

List of useful utilities.

O'H-7.15 Summary

Now we understand all the boxes and arrows in the diagram that began our study of linkers and that is repeated on the right.

However, we will get a better understanding of the advantages of dynamic linking when we study virtual memory in chapter 9.

Chapter O'H-6 Memory Hierarchy

O'H-6.1 Storage Technology

What do we want from an ideal memory?

Big (in capacity)
Fast (to access; also high throughput)
Cheap (always helps)
Non-volatile (maintains data when power is off)
Secure (doesn't leak data)

We will emphasize the first two, skip the third, and mention the last two.

Unfortunate Laws of Hardware: The Basic Memory Trade-off

Big and Fast are essentially contradictory; they are traded-off against each other.
Big, Fast, and Cheap is basically hopeless.

The Basic Trade-off Becomes the Basic Goal

We can get/buy/build small and fast and big and slow.

Our goal is to mix the two and get a good approximation to the impossible big and fast.

O'H-6.1.1 Random Access Memory (RAM)

DRAM vs SRAM
Name	Trans per bit	Access time	Needs refresh	Volatile	Cost	Where used
SRAM	4 or 6	1x	No	Yes	100x	Cache Memory
DRAM	1	10x	Yes	Yes	1x	Main Memory

Two varieties: Static RAM (SRAM) and Dynamic RAM (DRAM).

RAM constitutes the memory in most computer systems. Unlike tapes or CDs they are not limited to sequential access. The table on the right compares the two varieties.

SRAM is much faster but (for the same cost) has much lower capacity. Specifically, trans per bit gives the number of transistors needed to implement one bit of each memory type. The 4-transistor SRAM is harder to manufacture than the 6-transistor version.

Volatility and Refresh requirements

Both SRAM and DRAM are volatile, which means that, if the power is turned off, the memory contents are lost. Due to the volatility of both RAM varieties, when a computer is powered on, its first accesses are to some other memory type (normally a ROM—read-only memory).

DRAM, in addition to needing power, needs to be refreshed. That is, even if power remains steady, DRAM will lose its contents if it is not accessed. Hence there is circuitry to periodically generate dummy accesses to the DRAM, even if the system is otherwise idle.

O'H-6.1.2 Disk Storage.

Magnetic Disks (Hard Drives)

I have in my office some disks from the 1980s and 90s. Unlike modern disks, these relics are big enough to see the active components. In normal years, I drag some of these monsters to class and show their internals. For today, we will have to settle for some pictures (one picture is from Henry Muhlpfordt).

Platter
Surface
Head
Cylinder
Track
Sector
Seek time
Rotational latency
Transfer rate

Performance

Disks are huge (~1TB). and slow (10ms).
Unlike RAM, disks have moving parts.
Unlike tape, disks support random access.

Consider the following characteristics of a disk.

Seek time. The time to move the heads radially to get to the desired cylinder. (This is actually quite complicated to calculate exactly since you need to consider, acceleration, travel time, deceleration, and settling time.)
RPM (revolutions per minute).
Rotational latency. The average value is the time for one half a revolution.
Transfer rate. This is determined by the RPM and bit density.
Sectors per track. This is determined by the bit density.
Tracks per surface (i.e., the number of cylinders). This is determined by the bit density.
Tracks per cylinder (i.e, the number of surfaces).

Disk Performance‐Choice of Block Size

It is important to realize that a disk always transfers (reads or writes) a fixed-size sector.

If the OS needs to read part of a sector, it must read a full sector and discard the part not needed..
If it needs to write part of a sector, it reads the full sector, modifies the part it wants to change and writes back the modified full sector. This requires two I/Os.
If the system wants to write a piece that occupies the end of one sector and the beginning of the next, it needs four(!) I/Os even if the piece is only two bytes in size.
Until recently all sectors were 512 Bytes; some new systems are 4KB = 4096 Bytes.

Current commodity disks have (roughly) the following performance.

Seek time is around 5ms (amazing!).
The rotation rate for a desktop disk is often 7200 RPM (revolutions per minute) with 10k, 15k and 20k available. Note that 6000 RPM is 100 revolutions per second or one revolution per 10ms. So half a revolution (the average rotation needed to reach a given point) for a 7200RPM disk is a little less than 5ms.
Transfer rates are around 100MB/sec = 100KB/ms.
Laptop disks normally spin a little slower (5400 RPM) to save power.
Note that these times are in ms, processors work in ns, and 1ms = 1,000,000ns.
If multiple request are rapidly issued to the same disk, queuing delays occur. We don't study these, but take 202 for more on this issue.

The above performance figures are quite extraordinary. For a large sequential transfer, in the first 10ms, no bytes are transmitted; in the next 10ms, 1,000,000 bytes are transmitted.

The OS actually reads blocks, each of which is 2^k sectors. The OS sets the number of sectors per block when it creates a filesystem.

This analysis suggests using large disk blocks, 100KB or more. But much space would be wasted since many files are small. Moreover, transferring small files would take longer with a 100KB block size.

In practice typical block sizes are 4KB-8KB.

Multiple block sizes have been tried (e.g., blocks are 8KB but a file can also have fragments that are a fraction of a block, say 1KB).

O'H-6.1.3 Solid State Disks (SSDs)

This is flash RAM (the same stuff that is in thumb drives) organized in sector-like blocks as is a disk.

Unlike RAM, SSD is non volatile.
SSD are slower than RAM, but cheaper per byte.
Unlike a hard disk an SSD has no moving parts (and hence is much faster).
An SSD is more expensive per byte than a hard disk.

The blocks in an SSD can be written a large number of times (thousands or tens of thousands). However, this large number is not large enough to be ignored. Instead frequently accessed data is moved to previously unused portions of the device.

O'H-6.1.4 Storage Technology Trends

Summary: Everything is getting better but the rates of improvement are quite different for different technologies..

Cost per Byte Decrease from 1985 to 2015

  SRAM:  factor of 100
  DRAM:  factor of 50,000
  DISK:  factor of 3,000,000

Speed Increase from 1985 to 2015

  SRAM:  factor of 100
  DRAM:  factor of 10
  DISK:  factor of 25
  CPU:   factor of 2,000 (includes multiprocessor effect)

Importance of the Memory Hierarchy

The hierarchy is needed to close the processor-memory performance gap, i.e., the gap between processor speed improvement and DRAM speed improvement.

Alternately said it is the gap between the processors need for data and the (DRAM) memory's ability to supply data.

O'H-6.2 Locality

Remember we want to cleverly mix some small/fast memory with a large pile of big/slow memory and get a result that approximates well the performance of the impossible big/fast memory.

The idea will be to put the important stuff is the small/fast and the rest in big/slow. But what stuff is important?

The answer is that we want to put into small/fast the data and instructions that are likely to be accessed in the near future and leave the rest in big/slow. Unfortunately this involves knowing the future, which is impossible.

We need heuristics for predicting what memory addresses will likely be accessed in the near future. The heuristic used is the principle of locality: programs will likely access in the near future addresses close to those they accessed in the near past.

The principle of locality is not a law of nature: One can write programs that violate the principle, but normally the principle works very well. Unless you want your programs to run slower, there is no reason to deliberately violate the principle. Indeed, programmers seeking high performance, try hard to increase the locality of their programs.

We often use the term temporal locality for the tendency that referenced locations are likely to be re-referenced soon and use the term spacial locality for the tendency that locations near referenced locations are themselves likely to be referenced soon.

We will have more to say about locality when we study caches.

O'H-6.3 The Memory Hierarchy

In fact there is more than just small/fast vs big/slow. We have minuscule/blitz-speed, tiny/super-fast, ..., enormous/tortoise-like. Starting from the fastest/smallest, a modern system will have.

Registers
Level 1 (L1) Cache
L2 Cache
L3 Cache
Main Memory
Secondary Storage (Local Disks)
Tertiary Storage (Robotic Disks, Web Servers, LAN Disks)
Offline Storage

Registers

Today a register is typically 8 bytes in size and a computer will have a few dozen of them, all located in the CPU. A register can be accessed in well under a nanosecond and modern processors access at least one register for most operations.

In modern microprocessor designs (think phones, not laptops), arithmetic and many other operations are performed on values currently in registers. Values not in registers must be moved there prior to operating on them.

Registers are a very precious resource and the decision which data to place in registers and when to do so (which normally entails evicting some other data currently in that register) is a difficult and well studied problem. The effective utilization of registers is an important component of compiler design—we will not study it in this course.

Caches

For the moment ignore the various levels of caches and think of a single cache as an intermediary between the main memory, which (conceptually, but not in practice) contains the entire program, and the registers, which contains only the currently most important few dozen values.

In this course we will study the high-level design of caches and the performance impact of successful caching.

A memory reference that is satisfied by the cache requires much less time (say one tenth to one hundredth the time) than a reference satisfied by main memory.

Our primary study of the memory hierarchy will be at the cache/main-memory boundary. (In 202, we emphasize the main-memory/local-disk boundary.) In 201 we will see the performance effects of various hit ratios, i.e., the percentage of memory references satisfied in the cache vs satisfied by the main memory.

Multilevel Caches

When first introduced, a cache was the small and fast storage class and main memory was the big and slow. Later the performance gap widened between main memory and caches so intermediate memories were introduced to bridge the gap. The original cache became the L1 cache, and the gap bridgers became the L2 and L3.

The fundamental ideas remained the same: if we make it smaller it can be faster; it we let it be slower, it can be bigger.

Main Memory

For now, we shall pretend that the entire program including its data resides in main memory. Later in this course and again in 202, operating systems, we will study the effect of demand paging, in which the main memory acts as a cache for the disk system that actually contains the program.

Secondary Storage (Local Disks)

We know that the disk subsystem holds all our files and thus is much larger than main memory, which holds only the currently executing programs. It is also much slower: a disk access requires several MILLIseconds; whereas a main memory access is a fraction of a MICROsecond. The time ratio is about 100,000.

Tertiary Storage

One possibility is robot controlled storage, where the robot automatically fetches the requested media and mounts it. Tertiary Storage is sometimes called nearline storage because it is nearly online.

Other possibilities are web servers and local-area-network-accessible disks. We shall not discuss these possibilities.

Offline (Removable) Storage

Requires some human action to mount the device (e.g., inserting a cd). Hence the data is not always available.

O'H-6.3.1 Caching in the Memory Hierarchy

In the hierarchy diagram above, we see three levels of caches. But in a sense every level, except the bottom, is a cache of the level below it. The main memory in your laptop is a cache of your disk. Compared to disk, the main memory is small and fast. We will see in chapter 9 another sense in which main memory is a cache of the disk.

O'H-6.3.2 Summary of Memory Hierarchy Concepts

If a program chose to reference random locations, the hierarchy would not work since we would have no clue what portion of the big/slow memory we should place in the small/fast memory. But in practice programs do not reference random locations; rather references exhibit locality.

Temporal locality is the observed property of programs to reference a few locations many times each. Such locations should be moved to small/fast, and modern caches do just that.
Spacial Locality is the observed property of programs to soon reference locations near the currently referenced location. Modern caches exploit this property by grabbing from big/slow not just frequently used locations but nearby locations as well.

O'H-6.4 Cache Memories

In this chapter, we will concentrate on the cache-to-main-memory interface. That is, for us the cache will be the small/fast memory and the main (DRAM) memory will be big/slow.

O'H-6.4.1 Generic Cache Memory Organization

A cache is a small fast memory between the processor and the main memory. It contains a subset of the contents of the main memory.

A Cache is organized in units of blocks or lines. Common block sizes are 16, 32, and 64 bytes.

A block is the smallest unit we can move between a cache and main memory (some designs move subblocks, but we will not discuss such designs).

We view memory as organized in blocks as well. The size of a memory block is the same as the size of a cache block. If the block size is 16B, then bytes 0-15 of memory constitute memory block 0, bytes 16-31 constitute block 1, etc.
Transfers from memory to cache and back are one block in size.
Big blocks make good use of spatial locality. Explain why.

Start Lecture #21

Remark: The final for this class will be on zoom, just like the midterm. I mistakenly copied something for 202 into our class notes for 201. The erroneous comment has been removed.

The Basic Performance Equation

Definitions

A hit occurs when a memory reference is found in the upper level (small/fast) of the memory hierarchy.
We will be interested in cache hits, when the reference is found in the cache.
A miss is a non-hit.
The hit rate is the fraction of memory references that are cache hits.
The miss rate is 1 - hit rate, which is the fraction of references that are misses.
The hit time is the time required for a hit.
The miss time is the time required for a miss.
The miss penalty is miss time - hit time.

  Let m be the cache hit time.
  Let M be the miss penalty, i.e., the additional time for a cache miss.
  Let p be the probability that a memory access is a cache hit.

Then the average time is

  Avg access time = p*m + (1-p)(m+M)
                  = m + (1-p) M

The goal is to run fast, i.e. to have the average access time small. So we want to

decrease M (buy faster DRAM)
decrease m (buy faster SRAM)
increase p (design a bigger and/or better cache)

A Simple Example

Assume the following (somewhat reasonable) data.

m = 1ns
M = 0.1μs = 100ns
p = 0.9

What is the average access time?
First note that 0.1μs = 100ns. Then the above equation tells us that the average access time is
1ns + (1-0.9) * 100ns = 1ns + 0.1(100ns) = 11ns

Lets spend more and get double speed SRAM (i.e., m=0.5ns), but save money and get half speed DRAM (i.e., M=200ns.). Then the average access time is
0.5ns + (1-0.9) * 200ns = 0.5ns + 0.1(200ns) = 20.5ns
Bad idea.

Let's try again. Forget the double speed SRAM. Instead, spend the money saved on half speed DRAM and get a cache with one quarter the miss rate. Then the average access time is
1ns + (1-0.975) * 200ns = 1ns + 0.025(200ns) = 6ns.
Good. But how do we lower the miss rate? Stay tuned.

Addressing Bytes, (4-byte) Words, and Blocks

Consider the following address (in binary). 10101010_11110000_00001111_11001010.
This is a 32-bit address. I used underscores to separated it into four 8-bit pieces just to make it easy to read; the underscores have no significance.

Machine addresses are non-negative (unsigned) so the address above is a large positive number (greater than 2 billion).

All the computers we shall discuss in this section are byte addressed. Thus the 32-bit number references a byte. So far, so good.

The (4-Byte) Word Addressed and the Byte Offset

We will assume in our study of caches that each word is four bytes. That is, we assume the computer has 32-bit words. This is not always true (many old machines had 16-bit, or smaller, words; and many new machines have 64-bit words), but to repeat, in our study of caches, we will always assume 32-bit words.

Since 32 bits is 4 bytes, each word contains 4 bytes. We assume aligned accesses, which means that a word (a 4-byte quantity) must begin on a byte address that is a multiple of the word size, i.e., a multiple of 4. So word 0 includes bytes 0-3; word 1 includes bytes 4-7; word n includes bytes 4n, 4n+1, 4n+2 and 4n+3. The four consecutive bytes 6-9 do NOT form a word.

Question: What word includes the byte address given above, 10101010_11110000_00001111_11001010?
Answer: 10101010_11110000_00001111_110010, i.e, the address divided by 4.
Question: What are the other bytes in this word?
Answer: 10101010_11110000_00001111_11001000, 10101010_11110000_00001111_11001001, and 10101010_11110000_00001111_11001011

Question: What is the byte offset of the original byte in its word?
Answer: 10 (i.e., two), the address mod 4..
Question: What are the byte-offsets of the other three bytes in that same word?
Answer: 00, 01, and 11 (i.e, zero, one, and three).

The 32-Byte Block Addressed and the Word and Byte Offset

Blocks vary in size. We will not make any assumption about the block size, other than that it is a power of two number of bytes. For the examples in this subsection, assume that each block is 32 bytes.

Since we assume aligned accesses, each 32-byte block has a byte address that is a multiple of 32. So block 0 is bytes 0-31, which is words 0-7. Block n is bytes 32n, 32n+1, ..., 32n+31.

Question: What block includes our byte address 10101010_11110000_00001111_11001010?
Answer: 10101010_11110000_00001111_110, i.e., the byte address divide by 32 (the number of bytes in the block) or the word address divided by 8 (the number of words in the block).

O'H-6.4.2 Direct-Mapped Caches

We start with a tiny cache having a very simple cache organization, one that was used on the Decstation 3100, a 1980s workstation. In this design, cache lines (and hence memory blocks) are one word long.

Small blocks like this do not take advantage of spatial locality so are not use in modern machines.
Later we shall briefly consider better caches with bigger blocks.

Also in this Decstation 3100 design each memory block can only go in one specific cache line.

This is called a Direct Mapped organization.
The location of the memory block in the cache (i.e., the block number in the cache or the cache block number) is the memory block number modulo the number of blocks in the cache.
For example, if the cache contains 100 blocks, then memory block 34452 is stored in line 52. Memory block 352 is also stored in line 52 (but not at the same time, of course).
In real systems the number of lines in the cache is a power of 2 so taking modulo is just extracting low order bits.
Example: if the cache has 4 lines, the location of a block in the cache is the memory block number mod 4, which is the low order 2 bits of the memory block number.
A direct mapped cache is simple and fast, but has more misses than the set associative caches we will also study.

We shall assume that each memory reference issued by the processor is for a single, complete word.

Accessing a Direct-Mapped Cache

On the right is a diagram representing a direct mapped cache with C=4 blocks and a memory with M=16 blocks.

Let's assume each cache block and each memory block is one word long and to keep it simple we assume that each reference is to a single aligned word.

How can we find a memory block in such a cache? This is actually two questions in one.

Is the memory block present in the cache?
Where in the cache is the memory block, assuming it is present?

The second question is the easier. Let C be the number of blocks in the cache. Then memory block number N can be found only in cache line number N mod C (it might not be present at all).

Why Put Memory Block Number N in Cache Block `N mod C?`

Referring to the diagram we have 16 memory blocks and 4 cache blocks so we will have to assign N/C=16/4 memory blocks to each cache block. (In this example N=C², but that is a coincidence.)

In fact we assign memory block N to cache block N mod C

For example, in the upper diagram to the right all the green blocks in memory are assigned to the one green block in the cache.

Contrast this diagram with bad design immediately below it.

The good design has the important property that consecutive memory blocks are assigned to different cache blocks. Consider an important array. Its elements will be spread out in the cache and will not fight with each other for the same cache slot. For example in the picture any 4 consecutive memory slots will be assigned to 4 different cache slots.

So the first question reduces to: Is memory block N present in cache block N mod C?

Start Lecture #22

Remark: The final for this class will be on zoom, just like the midterm. I mistakenly copied something for 202 into our class notes for 201. The erroneous comment has been removed.

Referring to the diagram we note that, since only a green memory block can appear in the green cache block, we know that the rightmost two bits of the number of any memory block in the green cache block are 10 (the number of the green cache block). So to determine if a specific green memory block is in the green cache block we need the rest of the memory block number. Specifically is the memory block in the green cache block 0010, 0110, 1010, or 1110? It is also possible that the green cache block is empty (called invalid), i.e, it is possible that no memory block is in this cache block.

We need the rest of the address (i.e., red digits lost when we reduced the block number modulo the size of the cache) to see if the block in the cache is the memory block of interest. That number is N/C, using the terminology above. (Again, don't be confused by the coincidence that in this example N/C = N mod C. The coincidence occurs because N=C²).
The cache stores the rest of the address, called the tag and we check the tag when looking for a block.
Since we will always choose C to be a power of 2, the tag (N/C) is simply the high order bits of N and the cache slot used is N mod C, the low order bits of N.
Also stored is a valid bit per cache block so that we can tell if there is a memory block stored in this cache block.

A sequence of memory references with a cold cache
Addr(10)	Addr(2)	hit/miss	block#
22	10110	miss	110
26	11010	miss	010
22	10110	hit	110
26	11010	hit	010
16	10000	miss	000
3	00011	miss	011
16	10000	hit	000
18	10010	miss	010

When the system is first powered on, all the cache blocks are invalid so all the valid bits are off.

On the right is a table giving a larger example, with C=8 (rather than 4, as above) and M=32 (rather than 16).

We still have M/C=4 memory blocks eligible to be stored in each cache block. Thus there are two tag bits for each cache block.

The example gives a sequence of memory references.
Do this example on the board showing the addresses stored in the cache at all times. The cache is initially empty, i.e., all cache blocks are invalid.
Also show the tags.
In the table on the right, all the addresses are word addresses. For example the reference to 3 means the reference to word 3 (which includes bytes 12, 13, 14, and 15).

Remarks:

Naturally, if a reference is a hit, the current memory block remains assigned to this cache block (what other memory block could you choose to assign?).
If the reference experiences a miss because the cache block is currently invalid, we assign this memory block to the cache block (there are no other contenders).
If reference experience a miss and the cache block is valid, we have a dilemma. Should we leave the existing memory block in the cache block or should we replace it with the newly referenced cache block. Our choice (in this example, not in all possible designs) is to discard the current contents of the cache block and let the new reference takes its place.
Remember that in this very simple design, we have a direct-mapped cache with block size one word. Also, all memory references are for one word. I often abbreviate this situation by saying blksize=refsize=1.

Cache Contents, Hits, and Misses

Shown on the right is a eight entry, direct-mapped cache with block size one word. As usual all references are for a single word (blksize=refsize=1). In order to make the diagram and arithmetic smaller, the machine has only 10-bit addressing (i.e., the memory has only 2¹⁰=1024 bytes), instead of more realistic 32- or 64-bit addressing.

Above the cache we see a 10-bit address issued by the processor.

There are several points to note.

The valid bit. If a valid bit is not set, then the corresponding line is invalid. When the system is first powered on all the lines are invalid.
This machine, like all the ones we will study, is byte addressed and has 4-byte words. Since the cache only handles references to a word, the rightmost two bits of the address from the processor, which specify the byte offset within the word, are ignored for cache access.
The (byte) address given is 1101010011.
Once we drop the byte-offset bits, the word address of the reference is 11010100. Since the block size is one word, the block number is also 11010100.
Since the cache has eight entries, the cache line number is 11010100 mod 8 = 11010100 mod 2³ = 100 (the low order three bits) and the tag is 11010100 div 8 (the high order 5 bits).
We see that the valid bit is on for entry 100 (i.e., entry 4) so the line contains valid data.
However, the tags do not match. Hence the reference is a cache miss.
Question: Would a memory reference 1000001001 be a hit or miss?
Answer: A hit since the tags would match (100000).
Question: Would a memory reference 0000001001 be a hit or miss?
Answer: A miss since the tags do not match (100000 vs 000000)).
Explain in class how we know that the data field in entry 2 contains the contents of word 130.
Make sure you understand why the other data fields contain the contents indicated.

Circuitry Needed to Detect Hits and Misses

The circuitry needed for a simple cache (direct mapped, blksize=refsize=1) is shown on the right. The only difference between this cache and the example above is size. This cache holds 1024 blocks (not just 8) and the memory holds 2³⁰ = 2^10*3 = (2¹⁰)³ ∼1,000,000,000 blocks (not just 256). That is, the cache size is 4KB and the memory size is 4GB.

To determine if we have a hit or a miss, and to return the data in case of a hit is quite easy, as the circuitry indicates.

Make sure you understand the division of the 32 bit address into 20, 10, and 2 bits.

Calculate on the board the total number of bits in this cache and the number used to hold data.

Processing a Read for this Simple Cache

The action required for a read hit is clear, namely return to the processor the data found in the cache.

For a read miss, the best action is fairly clear, but requires some thought.

Clearly, we must go to central memory to fetch the requested data since it is not available in the cache.
If the cache line was invalid, we store the memory block in the cache as well as returning it to the processor.
The question remaining is: If the current cache line is valid, but has the wrong tag, should we replace the current line with the data just fetched from memory, evicting the old data (which was for a different address), or should we keep the old data in the cache.
We definitely want to store the new data and evict the old.
Question: Why?
Answer: Temporal Locality.
Question: What should we do with the old data? Can we just toss it or do we need to write it back to central memory?
Answer: It depends! Specifically, it depends on whether the data in central memory is up-to-date (specifically is the cache write-through).
We will see shortly that the action needed on this read miss, depends on our choice of action for a write hit. The easiest solution is to employ write-through (see immediately below).

Processing a Write for this Simple Cache

The simplest write policy is write-through, write-allocate (see below for definitions). The decstation 3100 discussed above adopted these policies and performed the following actions for any write, hit OR miss. (The 3100 was a personal workstation not a fancy supercomputer costing millions of dollars, so simplicity of the design was important. This desire for simplicity also explains why, for the 3100, block size = reference size = 1 word and the cache is direct mapped.)

Index the cache using the correct LOBs (i.e., not the very lowest order bits as these give the byte offset).
Write the data and the tag into the cache.
- For a hit, we are overwriting the tag with itself. It is almost always easier, and sometimes faster, to always do something than to test if you need to do it.
- For a miss to an invalid line, we are performing a write allocate, i.e. we are allocating a cache line for this write.
- For a miss to a valid line, we overwrite the existing entry and, since the cache is write-through (see just below), memory is up-to-date with respect to the replaced entry so we can simply overwrite the current entry.
Set Valid to true (it may already be true).
Send the current request to main memory, ensuring that the memory always contains the up-to-date value of every cache line. This action is called write-through.

Although the above policy has the advantage of simplicity (it performs the same actions for all writes, hits or misses; and simplifies the handling of read misses), it is out of favor due to its poor performance (other designs make few requests to main memory).

A Key Formula

Divide the memory block number by the number of cache blocks. The quotient is the tag and the remainder is the cache block number.

  MBN
  --- = tag    MBN % NCB = CBN
  NCB

Improvement: Use a Write Buffer

Unified vs Split I and D (Instruction and Data) Caches

Analogy: If you have N numerical address but only n<N mailboxes available, one possibility (the one we use in caches) is to put mail for address M in mailbox M%n. Then to distinguish addresses assigned to the same mailbox you need the quotient M/n. In caches we call the mailbox assigned the cache index (or cache line or cache block number) and we call the quotient needed for disambiguation the tag.

The key principle is the Fundamental Theorem of Fifth Grade

  Dividend = Quotient * Divisor + Remainder

We divide the dividend (the memory block number) by the divisor (the number of cache blocks) and look in the cache slot whose number is the Remainder (the cache index or line number). We check whether the Quotient (the tag) matches the stored value.

Homework: Consider a cache with the following properties, which are essentially the ones we have been using to date:

movl	$0x11ff,	0x0
movl	0x0,	%r8
movl	$0x22FF,	0x80
movl	0x0,	%r9
movl	$0x33FF,	0x8
movl	$0x44FF,	0x8
movl	$0x55FF,	0x38
movl	$0x66FF,	0x28
movl	0x38,	%r10

Direct mapped (there is only one possible location in the cache for the contents of a given memory location).
Write-through and store-allocate (i.e., the Decstation 3100).
All references are to (4-byte) words.
The block size is one word.
The memory uses 32-bit addresses.
The cache has 16 entries (i.e., 16 blocks or 16 lines).

The cache is initially empty, i.e. all the valid bits are 0. Then the references on the right are issued in the order given.

For each instruction tell whether the memory reference is a cache hit, a cache miss due to the entry being invalid, or a cache miss due to a tag mismatch.
After all the instructions have been executed what is the contents of the cache, that is, for each line, give its validity, its tag, and its contents. If one or more fields are still junk, label them junk.

Remind me to do this one in class next time.

Improvement: Multiword Blocks

The setup we have described does not take much advantage of spatial locality. The idea of having a multiword blocks is to bring into the cache words near the referenced word since, by spatial locality, they are likely to be referenced in the near future.

We continue to assume that all references are for one word and that all memory address are 32-bits and reference a byte. For a while, we will continue to assume that the cache is direct mapped.

The figure on the right shows a 64KB direct mapped cache with 4-word (16-byte) blocks.
Questions: For this cache, when the memory word referenced is in a given block, where in the cache does the block go, and how do we find that block in the cache?
Answers:

Byte n is in word n/4, for the 4-byte words we are assuming. So bytes 4n...4n+3 are in word n.
For a cache like the one on the right with 16B (4 word) blocks, byte n is stored in memory block n/16 and word w is stored in memory block w/4. So bytes 16n...16n+15 are stored in block n.
The word-in-block = the word address modulo 4 (the number of words per block).
The number of blocks in the cache = the size of the cache divided by the size of each block. For the pictured cache this is 64KB / 16B = 4K.
The cache block number or cache line number = the memory block number modulo the number of blocks in the cache (for the direct mapped caches we are now studying).
The tag = the memory block number / the number of blocks in the cache.

Show from the diagram how this gives the pink portion for the tag and the green portion for the index or cache block number.

Consider the cache shown in the diagram above and a reference to word 17003.

17003 / 4 = 4250 with a remainder of 3. Hence the memory block number is 4250 and the word-in-block is 3.
A 64KB cache with 16B blocks has 64KB/16B=4K=4096 entries. Since 4250 / 4096 gives 1 with a remainder of 154, memory block number 4250 is stored in cache line 154 with a tag of 1.

Summary: Memory word 17003 resides in word 3 of cache block 154 with tag 154 set to 1 and with valid 154 true.

Cache Sizes

The cache size or cache capacity is the size of the data portion of the cache (normally measured in bytes).

For the caches we have seen so far this is the block size times the number of entries. For the diagram above this is 16B * 4K = 64KB. For the simpler direct mapped caches block size = word size. So the cache size is the word size times the number of entries. Specifically the cache in the previous diagram has size 4B * 4K = 16KB.

You should not be surprised to hear that a bigger cache has a higher hit rate. The interesting comparison is between the last cache and an enlarged version of the previous cache with 16K entries i.e. comparing two caches of size 64KB. Experiments have shown that spacial locality does occur and real programs have higher hit rates on caches with 4-word blocks than they do on caches with 1-word blocks.

For a simple example imagine a program that sweeps once through an array of one million entries, each one word in size. For our simple cache, the hit rate is zero! For the last cache, the hit rate is .75.

Total Size of a Cache

Note that the total size of the cache includes all the bits. Everything except for the data portion is considered overhead since it is not part of the running program. For the caches we have seen so far the total size is
(block size + tag size + 1) * the number of entries
We shall not emphasize total size in this class, but we do in 436, computer architecture.

Processing Hits and Misses for a Cache With Multi-word Blocks

Read hit: As before, return the cached data to the processor.
Read miss, block invalid (assuming write allocate): Fetch the needed line from memory, return the referenced word to the processor.
Read miss, tag mismatch: (assuming write allocate and store through): Read the new line from memory replacing the old line in the cache and return the referenced word to the processor.
Write hit: As before, write the word in the cache and memory (assuming store through).
Write miss: A new consideration arises. As before we might or might not decide to replace the current block with the referenced block and, if we do decide to replace the block, we might or might not have to write the old block back. The new consideration is that if we decide to replace the block (i.e., if we are implementing store-allocate), we must remember that we only have a new word and the unit of cache transfer is a multiword block. We will not discuss this (advanced) consideration.
Question: Since bigger blocks take advantage of spacial locality and have a lower percentage of the cache memory used for overhead, why not have enormous blocks? For example, why not have the cache be one huge block.
Answer: Not all access are sequential. With too few blocks misses go up again.

Harder Questions that we will Not Discuss in Detail

Write-through vs write-back:
On a write hit, we update both the cache and the memory. This is called write-through, to indicate that the write goes through the cache and reaches memory.
- Updating memory is slow, why not skip it?
- If memory is permitted to be out of date (i.e., only the cache has the current value), we must remember to update memory when we evict the cache entry.
- This write-back policy is harder but has the advantage that if we update the variable z 100 times before evicting its cache block, we only send the last value of z to memory. Is this reduction of memory accesses worth the extra complexity?
Write-allocate vs no-write-allocate:
- Are we wise in storing write misses in the cache (sometimes evicting other data to permit this). This is called write-allocate.
- Write-allocate with multiword cache blocks:
  If a cache block contains many words and we get a write miss, do we fetch from memory the rest of the block or is just the referenced word kept in the cache?

Homework: Consider two 256KB direct-mapped caches (i.e., each cache contains 256KB of data). As always, a memory (byte) address is 32 bits and all references are for a 4-byte word. The first cache has a block size of one word, the second has a block size of 32 words.

What is the total size (in bits) of each cache.
Given a 32-bit memory reference, for each cache, which bits are used to index the cache and which bits are matched against the tag.

O'H-6.4.3/4 Set/Fully Associative Caches

Consider the following sad story. Jane's computer has a cache that holds 1000 blocks and Jane has a program that only references 4 (memory) blocks, namely blocks 13, 1013, 113013, and 7013. In fact the references occur in order: 13, 1013, 113013, 7013, 13, 1013, 113013, 7013, 13, 1013, 113013, 7013, 13, 1013, 113013, 7013, etc. Referencing only 4 memory blocks and having room for 1000 blocks in her cache, Jane expected an extremely high hit rate for her program. In fact, the hit rate was zero. She was so sad, she gave up her job as a web-mistress, went to medical school, and is now a brain surgeon at the mayo clinic in Rochester MN.

The Easy Way Out: Direct Mapped

So far we have studied only direct mapped caches, i.e., those for which the block number in the cache is determined by the memory block number, i.e., there is only one possible location in the cache for any block. In Jane's sad story I picked four memory blocks so that they were all assigned to the same cache block and hence kept evicting each other. The rest of the cache was unused and essentially wasted.

Although this direct-mapped organization is no longer used because it gives poor performance, it does have one performance advantage: To check for a hit we need compare only one tag with the high-order bits of the addr.

Going All the Way: Fully Associative

The direct-mapped organization, in which a given memory block can be placed in only one possible cache block, is one extreme. The other extreme is called a fully associative cache in which a memory block can be placed in any cache block. Since any memory block can be in any cache block, the cache index tells us nothing about which memory block is stored there. Hence the tag must be the entire memory block number. Moreover, we don't know which cache block to check so we must check all cache blocks to see if we have a hit.

The larger tag would be a problem.
The search over all cache blocks would be a disaster.
- The search could be done sequentially (one cache block at a time), but this is much too slow.
- We could have a comparator with each tag and test all the blocks at once to select the one that matches.
  - This is too big due to both the many comparators and the humongous multiplexor needed to combine all the results.
  - However, it is exactly what is often done when implementing translation lookaside buffers (TLBs), which are used with demand paging (as you will see soon and see again if you take 202, operating systems).
  - Question: Are the TLB designers magicians?
    Answer: No, TLBs are small.
An alternative would be to have a table with one entry per memory block telling if the memory block is in the cache and if so giving the corresponding cache block number. This is too big and too slow for caches but is exactly what is used for demand paging (again, 202) where the memory blocks in cso correspond to pages on disk in os and the table we would need in cso is called the page table in os.

The Middle Ground: Set Associative

Most common for caches is an intermediate configuration called set associative or n-way associative (e.g., 4-way associative). The value of n is typically a small power of 2.

If the cache has B blocks, we group them into B/n sets each of size n. Since an n-way associative cache has sets of size n blocks, it is often called a set size n cache. For example, you often hear of set size 4 caches.

In a set size n cache, memory block number K is stored in set number K mod (the number of sets), which equals K mod (B/n).

The picture on the right shows a system storing memory block 12 in three cache, each cache having 8 blocks. The left cache is direct mapped; the middle one is 2-way set associative; and the right one is fully associative.

The blue (both light and dark) indicate the cache blocks in which memory block 12 might have been stored.
The dark blue is the cache block in which the memory block 12 is stored.
The arrows show the blocks (i.e., tags) that must be searched to look for memory block 12. The arrows point to the tags corresponding to blue blocks.
We explain the figure below.

The Left-Hand Diagram: Direct Mapped

We have already done direct mapped caches but to repeat:

If a given memory block resides in the cache, the cache block number containing the block equals the memory block number mod the number of cache blocks.
Expressed algebraically: CBN = MBN mod NCB.
Similarly, the tag = MBN / NCB.
In the example, the memory block number is 12 and there are 8 cache blocks so memory block 12 is stored in cache block 12 mod 8 = 4 and the tag is 12 div 8 = 1.

The Middle Diagram: Set Associative

The middle picture shows a 2-way set associative cache also called a set size 2 cache. A set is a group of consecutive cache blocks.

As the name indicates the size of each set in the picture is 2 (cache blocks) and hence the number of cache sets = the number of cache blocks / 2.
In general, the number of cache sets is the number of cache blocks divided by the size of each set. Algebraically: NCS = NCB / SS.
If a given memory block resides in the cache, the set number containing the block equals the memory block number mod the number of sets.
Algebraically: SN = MBN mod NCS.
Similarly, the tag = MBN / NCS.
Specifically, in the example, memory block 12 is stored in cache set 12 mod (8/2) = 12 mod 4 = 0.
Similarly, the tag is 12/(8/2) = 3.
We do not know which (if any) of the cache blocks in the set contain the memory. Hence we must check the tags for all entries in the set to see if any match, which explains the two arrows in the middle diagram.

The Right-Hand Diagram: Fully Associative

The right picture shows a fully associative cache, i.e. a cache where there is only one set and it is the entire cache.

In the example the one set has 8 elements so memory block 12 is stored in set 12 mod (8/8) = 12 mod 1 = 0, which is the only set.
The tag is 12 / (8/8) = 12/1 = 12, i.e., the entire memory block number.
We do not know which (if any) of the cache blocks in the set contain the desired memory block. Hence we must check the tags for all entries in the entire cache to see if any match. This explains the eight arrows in the right diagram.
These three points hold for fully associative caches of any size.
1. There is only one set, set 0.
2. The tag is the entire memory block number
3. We need check the tags for all entries.

For a cache holding n blocks, a set-size n cache is fully associative. Any set-size 1 cache is direct mapped.

When the cache was organized by blocks and we wanted to find a given memory word we first converted the word address to the MemoryBlockNumber (by dividing by the #words/block) and then formed the division

  MemoryBlockNumber / NumberOfCacheBlocks

The remainder gave the index in the cache and the quotient gave the tag. We then referenced the cache using the index just calculated. If this entry is valid and its tag matches the tag in the memory reference, that means the value in the cache has the right quotient and the right remainder. Hence the cache entry has the right dividend, i.e., the correct memory block.

Determining the Set Number and the Tag

Recall that for the a direct-mapped cache, the cache index is the cache block number (i.e., the cache is indexed by cache block number). For a set-associative cache, the cache index is the set number.

Just as the cache block number for a direct-mapped cache is the memory block number mod the number of blocks in the cache, the set number for a set-associative cache is the (memory) block number mod the number of sets.

Just as the tag for a direct mapped cache is the memory block number divided by the number of blocks in the cache, the tag for a set-associative cache is the memory block number divided by the number of sets in the cache.

Summary: Divide the memory block number by the number of sets in the cache. The quotient is the tag and the remainder is the set number. (The remainder is normally referred to as the memory block number mod the number of sets.)

Do NOT make the mistake of thinking that a set size 2 cache has 2 sets, it has NCB/2 sets each set containing 2 blocks.

Ask in class.

What is another name for an 8-way associative cache having 8 blocks?
What is another name for a 1-way set associative cache?

Question: Why is set associativity good? For example, why is 2-way set associativity better than direct mapped?
Answer: Consider referencing two arrays of size 50KB that start at location 1MB and 2MB.

Both contend for the same cache blocks in a direct mapped 128KB cache.
They fit together in a 128K 2-way associative cache.

Question: What is the advantage of associativity?
Answer: The advantage of increased associativity is normally an increased hit ratio.

Question: What are the disadvantages?
Answer: It is slower, bigger, and uses more energy due to the extra logic.

Remark: Go over the homework from last time. -- Note that an absolute memory address say location 0x0 does not have ().

Start Lecture #23

Mapping the Big (and Slow) to the (Fast and) Small

We know that a cache is smaller than a central memory and realize that at any one time only a subset of the central memory can be cache resident. Given a central memory address A we want to know

Is A in the cache.
If so, where is it in the cache.

Actually we answered them in reverse order. We first determined where A must be in the cache if it's there at all, and then we look to see if it is there.

Our First Attempt

We started with a simple cache (direct mapped, blocksize one word). This cache contained only 4 words (2-bit addresses) and the central memory had only 16 words (4-bit addresses). Given a word in memory, we divided its MBN (its memory word) address by the NCB (# words in the cache) and examined the quotient and remainder (div and mod). By coincidence 16 = 4² so both the div and mod are 2-bit numbers; in general they are not the same size. We used the remainder (mod) to specify the index, i.e., the cache location for this word (in the diagram the index is the color) and used the quotient (called the tag) to determine, for example, if the green cache entry is the particular green memory block we desire.

Summary of Bit Division in General

The basic idea is to first number the units in the big and small memories, second divide the number given to a unit in the big memory by the number of units in the small. In the simplest example above, the number of a memory unit was its word address and the number of cache units was the number of words in the cache.

In reality, memory is composed of blocks (each several words) and caches are composed of sets (each several blocks). Specifically, given the cache parameters and memory byte address (32-bits) we proceed as follows.

Calculate MBN, the memory block number.
1. Divide the byte address by 4 (drop low order 2 white bits) to get the memory word number MWN.
2. Divide MWN by #words/block to get MBN (drop magenta bits used for word in block).
Calculate NCS, the number of cache sets (the number of rows in the cache diagrams, the index in the cache).
1. NCB (number of cache blocks) = Cache size / block size
2. NCS = NCB / (number of blocks per set, i.e., the set size)
Divide MBN / NCS.
1. The remainder is the set number, which is the index, i.e., the row, in the cache). These are low order (green) bits.
2. The quotient is the tag; these are high order (pink) bits.

Dividing Decimal Numbers by 10ⁿ or binary numbers by 2ⁿ

To divide 134782993 by 100, you reach for a pencil not a calculator! You draw a vertical line with the pencil; the left part is the div (aka quotient) and the right part is the mod (aka remainder). You use the same vertical line technique to divide a binary number by 8=2³ (or by 4K=2¹²).

Locating a Block in a Set Associative Cache

Question: How do we find a memory block in a 4KB 4-way set associative cache with block size 1 word?
Answer: This is more complicated than for the simple direct mapped caches we started with. The three macro steps are:

Calculate MBN
Calculate NCS
Divide MBN/NCS to get the quotient (which is the tag) and the remainder (which is the set number).

We proceeds as follows. (Do on the board an example: address 0x000A0A08 = 00000000_00001010_00001010_00001000)

First drop the low 2 bits (byte in word) of the memory address, leaving 30 bits for the memory word number (MWN). The dropped bits are white in the diagram to the right.
The MBN = the MWN since, in this example, the block size is 1 word. In general the MBN = MWN / numberOfWordsInABlock
The cache contains NCB blocks, NCB = (size of cache)/(size of block) = 4KB / 4B = 1024.
Each set contains 4 blocks.
Hence the cache has NCS = NCB / NumberBlocksPerSet = 1024 / 4 = 256 sets
Divide the memory block number by the number of sets. The quotient is the tag and the remainder is the set number. That is, tag = MBN / NCS = MBN / 256 and the set number = MBN mod 256 Since 256=2⁸, dividing a binary number by 256 is simply separating the dividend into two pieces: the right 8 bits are the remainder and the rest is the quotient.
The quotient, i.e, the tag, is shown in pink in the diagram.
The remainder is the set number (i.e., the index of the entry). This portion of the address is shown in green. Again note that no division is required.
Compare all the tags and valid bits in the set with the tag of the memory block.
If exactly one valid tag matches, a hit has occurred and the corresponding data entry contains the memory block.
If more than one valid tag matches, an error has occurred: the same memory block is stored in two or more sets. The diagram does not check for this error.
If no valid tag matches, a miss has occurred.

An Example Done on The Board

In 2020-21, the class is taught remotely and the zoom whiteboard crashes. So we fake it. The example has address hex 0x000A0A08 = 00000000_00001010_00001010_00001000 in binary. The cache remains 256KB 4-way set associative, with blocksize one word.

First drop the low 2 bits (byte in word) of the memory address, leaving 30 bits for the memory word number (MWN). The dropped bits are white in the circuit diagram on the right. MWN = 00000000_00001010_00001010_00001000 / 4 = 00000000_00001010_00001010_000010
The MBN = the MWN since, in this example, the block size is 1 word.

MBN = 00000000_00001010_00001010_000010

The cache contains NCB blocks. NCB = (size of cache) / (size of block) = 4KB / 4B = 1024.
Each set contains 4 blocks (the cache is 4-way set associative).
Hence the cache has NCS = NCB / (numBlks/set) = NCB / setSize. NCS = 1024 / 4 = 256 = 2 ⁸ sets.
Divide the memory block number by the number of cache sets. The quotient gives the tag and the remainder gives the set number.
That is, tag = MBN / NCS = 00000000_00001010_00001010_000010 / 256 = 00000000_00001010_000010 and is shown in pink in the circuit diagram.
The remainder (i.e., the memory block number mod the number of cache sets) is the set number (i.e., the index of the entry). The set number is MBN mod NCS = 00000000_00001010_00001010_000010 mod 256 = 10000010 and is green in the circuit diagram.
(Since 256=2⁸, dividing by 256 is simply separating the dividend into two pieces: the right 8 bits are the remainder and the rest is the quotient. Again note that no actual division is performed.)
Compare all the tags and valid bits in the set with the tag of the memory block.
If exactly one valid tag is 00000000_00001010_000010, a hit has occurred and the corresponding data entry contains the memory block.
If more than one valid tag matches, an error has occurred: the same memory block is stored in two or more sets. The circuit diagram does not check for this error.
If no valid tag matches, a miss has occurred.

Combining Set-Associativity with Multiword Blocks

This is a fairly simple combination of the two ideas and is illustrated by the diagram on the right.

Start with the picture just above for a set-associative cache with blocksize = 1 word.
Each blue portion of the cache is now a 4-word block, not just a single word.
Hence the data coming out of the multiplexor at the bottom right of the previous diagram is now a block. In the diagram on the right, the block is 4 words.
As with direct-mapped caches having multi-word blocks, we again use the word-within-block bits to choose the proper word. In the diagram this is performed by the very bottom multiplexor, using the magenta word-within-block bits as the selector line.
Note that this cache is bigger than the one above. Each has 256 sets. Each has 256*4=1024 blocks. But the bottom one has bigger blocks.

Our description and picture of multi-word block, direct-mapped caches is here, and our description and picture of single-word block, set-associative caches is just above. It is useful to compare those two picture with the one on the right to see how the concepts are combined.

Below we give a more detailed discussion of which bits of the memory address are used for which purpose in all the various caches.

Choosing Which Block to Replace

When an existing block must be replaced, which victim should we choose? The victim must be in the same set (i.e., have the same index) as the new block. With direct mapped (a.k.a 1-way associative) caches, this determines the victim so the question doesn't arise.

With a fully associative cache all resident blocks are candidate victims. For an n-way associative cache there are n candidates. Victim selection in the fully-associative case is covered extensively in 202. We will only mention some possible algorithms.

LRU: Least Recently Used
LFU: Least Frequently Used
FIFO: First-In First-Out
Random

Summary of What We Know How to Do For Loads and Stores

When you write a C language assignment statement y = x+1; the processor must first read the value of x from the memory. This is called a load instruction. The processor also must write the new value of y into memory. This is called a "store" instruction.

Direct Mapped Caches

For a direct mapped cache with 1-word blocks we know how to do everything (we assume Store-Allocate and Write-Through).

Load Hit: The cache returns the stored value to the processor. The memory is not involved.
Load Miss: The cache obtains the value from memory, stores it in the cache and returns it to the processor.
Store Hit: Cache and memory are updated with the new value.
Store Miss: Memory is updated. The cache stores the new value and tag overwriting whatever was there.

If a block contains multiple words the only difference for us is that on a miss the rest of the block must be obtained from memory and stored in the cache.

Set Associative Caches

An extra complication arises on a cache miss (either a load or a store). If the set is full (i.e., all blocks are valid) we must replace one of the existing blocks in the set and we are not learning which one to replace. As mentioned previously, in 202 you will learn how operating systems deal with a similar problem. However, caches are all hardware and hence must be fast so cannot adopt the complicated OS solutions.

We will not deal with this replacement question seriously in 201.

How Big Is a Cache?

There are two notions of size.

Definition: The cache size is the capacity of the cache.

This means, the total size of all the blocks.
In the diagrams above it is the size of the blue portions.
The size of the cache in the last diagram is 256 * 4 * 16B = 16KB.
The size of the cache in the previous diagram is 256 * 4 * 4B = 4KB.
So you should not compare the performance of these two; of course the bigger cache will do better.
Instead the second cache should be reduced to 64 sets or the first increased to 1024 sets.

Another size of interest is the total number of bits in the cache, which includes tags and valid bits. For the 4-way associative, 1-word per block cache shown above, this size is computed as follows.

The 32 address bits contain 8 bits of index and 2 bits giving the byte offset.
So the tag is 22 bits (more examples just below).
Each cache entry contains 1 valid bit, 22 tag bits and 32 data bits, for a total of 55 bits.
There are 256*4=1K entries.
So the total size is 55Kb (kilobits).

Question: For this cache, what fraction of the bits are user data?
Answer: 4KB / 55Kb = 32Kb / 55Kb = 32/55.

Calculate in class the equivalent fraction for the last diagrammed cache, having 4-word blocks (and still 4-way set associative).

Tag Size and Division of the Address Bits

As always we assume a byte addressed machines with all references to a 4-byte word.

The 2 LOBs are not used (they specify the byte within the word, but all our references are for a complete word). We show these two bits in white. We continue to assume 32-bit addresses so there are 2³⁰ words in the address space.

Let us review various possible cache organizations and determine for each the tag size and how the various address bits are used. We will consider four configurations each a 16KB cache. That is the size of the data portion of the cache is 16KB = 2¹⁴ bytes = 2¹² words.

Direct Mapped, Block Size 1 (Word)

This is the simplest cache.

Since the block size is one word, there are 2³⁰ memory blocks and all the address bits (except the white 2 LOBs that specify the byte within the word) are used for the memory block number (MBN). Specifically, 30 bits are so used.
Any 16KB cache contains 2¹⁴ bytes = 2¹² words, which, for this direct mapped, blocksize=wordsize cache, is 2¹² blocks = 2¹² sets. So NCS = 2¹².
Divide the MBN by NCS. The remainder, i.e., the low order 12 bits of the MBN, gives the index in the cache (the Cache Set Number, CSN, shown in green).
The quotient (the remaining 30-12=18 high-order bits) is the tag, shown in pink.

Direct Mapped, Block Size 8

Modestly increasing the block size is an easy way to take advantage of spacial locality.

Three bits of the address give the word within the 8-word block. These bits are shown in magenta.
The remaining 27 HOBs of the memory address give the memory block number.
The cache is still 2¹² words, but this is now only 2⁹ blocks = 2⁹ sets, i.e., NCS=2⁹.
Divide MBN/NCS. The remainder, the low order 9 green bits of the MBN, gives the CSN, the index in the cache.
The quotient (the remaining 27-9=18 pink bits) are the tag.

4-Way Set Associative, Block Size 1

Increasing associativity improves the hit rate but only a modest associativity is practical.

The block size is 1 word so there are 2³⁰ memory blocks and the MBN is again 30 bits.
The cache has 2¹² words = 2¹² blocks = 2¹⁰ sets (each set has 4=2² blocks).
Divide MBN/NCS. The remainder, the low order 10 green bits of MBN, gives the index in the cache (the CSN).
The 20 bit pink quotient is the tag.
Question: Why does the tag get bigger as the associativity grows?
Answer: Growing associativity (for a fixed size cache) reduces the number of sets into which a block can be placed, which increases the number of memory blocks eligible to be placed in a given set. Hence more bits are needed to see if the desired block is there.

4-Way Set Associative, Block Size 8

The two previous improvements are often combined.

Three low order magenta bits of the memory address give the word within the block.
The remaining 27 HOBs gives the MBN.
The cache has 2¹² words = 2⁹ blocks = 2⁷ sets. NCS=2⁷.
Divide MBN/NCS. The remainder (the low order 7 green bits) gives the index in the cache.
The remaining 20 pink bits form the tag.

On the board calculate, for each of the four caches, the memory overhead percentage. For all four, the cache size is 16KB.

Direct mapped, block size one word
We have seen that the cache has 2¹² sets (i.e, 2¹² rows) and that each tag is 18 bits. Each row contains 1 valid bit, one tag, and a 1-word data block. So each rows contains 1 + 18 + 32 = 51 bits of which 32 bits are data so the memory overhead is (51-32)/32 ~ 59%
Direct mapped, block size eight words
This cache has 2⁹ sets (i.e, 2⁹ rows) and 18-bit tags. Each row contains 1 valid bit, one tag, and an 8-word data block for a total of 1 + 18 + 8*32 = 275 bits of which 256 bits are data so the memory overhead is (275-256)/256 ~ 7.4%.
4-way set associative, block size one word
This cache has 2¹⁰ sets (rows) and 20-bit tags. Each row contains 1 valid bit, one tag, and a 1-word data block for a total of 1 + 20 + 32 = 53 bits of which 32 bits are data so the memory overhead is (53-32)/32 ~ 66%.
4-way set associative, block size eight words
This cache has 2⁷ sets (rows) and 20-bit tags. Each row contains 1 valid bit, one tag, and an 8-word data block for a total of 1 + 20 + 8*32 = 277 bits of which 256 bits are data so the memory overhead is (277-256)/256 ~ 8.2%.

Homework: Redo the four caches above with the size of the cache increased from 16KB to 64KB determining the number of bits in each portion of the address as well as the overhead percentages.

Summary of Bit Division (Repeat)

Given the cache parameters and memory byte address (32-bits).

Calculate MBN, the memory block number.
1. Divide the byte address by 4 (drop low order 2 white bits) to get the memory word number MWN.
2. Divide MWN by #words/block to get MBN (drop magenta bits used for word in block).
Calculate NCS, the number of cache sets (the number of rows in the cache diagrams, the index in the cache).
1. NCB (number of cache blocks) = Cache size / block size
2. NCS = NCB / (number of blocks per set, i.e., the set size)
Divide MBN / NCS.
1. The remainder is the set number, which is the index, i.e., the row, in the cache). These are low order (green) bits.
2. The quotient is the tag; these are high order (pink) bits.

First Example

The memory blksize is 1 word. The cache is 64KB direct mapped. To which set is each of the following 32-bit memory addresses (given in hex) assigned and what are the associated tags?

10000000₁₆
0101F0F0₁₆

Answer. Let's follow the three step procedure above for each address.

10000000₁₆ = 2²⁸.
1. The first step is to find MBN. The memory word number for this address is 2²⁶ (there are 4 bytes in a word). For this example the memory block number MBN is also 2²⁶, since the blocksize is one word.
2. The second step is to find NCS. We are given that the cache has size 64KB so it has 2¹⁶ bytes or 2¹⁴ words. Since the blocksize is one word, the cache has 2¹⁴ blocks and, since the set size is 1 (direct mapped), NCS = 2¹⁴/1 = 2¹⁴.
3. Divide MBN / NCS. 2²⁶ / 2¹⁴ has quotient 2¹² and remainder 0. So the set assigned to 10000000₁₆ is set 0 and the tag is 2¹².
0101F0F0₁₆
1. Let's write the address in base 2 with a - between each group of 4 bits. So the address is 0000-0001-0000-0001-1111-0000-1111-0000. That is the byte number. There are again 4 bytes in a block so, to get the MBN, we need to divide the above number by 4, which means drop the low order two bits. This gives MBN=00-0000-0100-0000-0111-1100-0011-1100.
2. The cache is the same as above so NCS=2¹⁴.
3. Divide MBN / NCS. Since the divisor is again 2¹⁴, the set assigned is again the low order 14 bits of the MBN and the tag is the remaining high order bits. Specifically, the assigned set is 11-1100-0011-1100 = 3C3C₁₆ and the tag is 0000-0001-0000-0001 = 101₁₆.

Second Example

The block size 64B. The cache is 64KB, 2-way set associative. To which set is each of the following 32-bit memory addresses (given in hex) assigned and what are the associated tags?

10000000₁₆
0101F0F0₁₆

Answer. Same 3-step procedure.

10000000₁₆ = 2²⁸.
1. Now the block size is 64=2⁶ Bytes so the MBN = 2²⁸/2⁶ = 2²². You could also do this in two steps: byte number 2²⁸ is word number 2²⁶ and with 16 words/block we have MBN equals 2²⁶/2⁴=2²².
2. NCB = 64KB / (64 Bytes/Block) = 2¹⁰. Set size is 2 so NCS = NCB / 2 = 2⁹.
3. Divide MBN / NCS. 2²² / 2⁹ = 2¹³ remainder 0. So the set assigned is set 0 and the tag is 2¹³.
0101F0F0₁₆
1. MBN = 0101F0F0₁₆ / 2⁶ = 0000-0001-0000-0001-1111-0000-1111-0000 / 2⁶ = 0100-0000-0111-1100-0011.
2. NCS is still 2⁹.
3. Divide MBN / NCS. The set number is the remainder (the low order 9 bits) and the tag is the quotient (the remaining high order bits). Specifically the set number is 1-1100-0011 = 1C3₁₆ and the tag is 10-0000-0011 = 203₁₆.

Homework: Redo the second example just above for a 2MB set size 16 cache with a block size of 64B (these are the sizes of one of the caches on at intel i7 processors). What is the total size of this cache.

O'H-6.4.5 Issues with Writes

Write-through vs write-back
Store-allocate vs no-store-allocate (aka write-allocate vs no-write-allocate)

We have already (briefly) discussed both these choices.

O'H-6.4.6 Anatomy of a Real Cache Hierarchy

The issue of unified vs split I and D caches is covered in 436, Computer Architecture. We are not covering it in 201.

Similarly we are leaving the analysis of multilevel-caches to 436.

O'H-6.4.7 Performance Impact of Cache Parameters

Trade-offs, Trade-offs, and more Trade-offs

Cache Size: Bigger caches have higher hit rates (good), but are slower (remember big implies slow).
Block Size: Bigger blocks helps utilize spacial locality, but imply fewer blocks, which can hurt hit rates. Also bigger blocks implies more traffic on misses so can hurt the miss penalty.
Associativity: Increased associativity lowers the miss rate but makes the cache bigger and hence slower.
Write Strategy: Write-through is simpler and lowers the cost of a read miss (a memory write is not needed), write-back generates fewer memory updates.

O'H-6.5 Writing Cache-Friendly Code

For compute-intensive programs with significant run times, the programmer can often speed up execution by making the program cache-friendly, i.e., by increasing locality.

This is a much-studied problem, especially by programmers of numerical algorithms on supercomputers. Often one can reorder operations to improve spacial locality. Specifically (in say C) one can try to reference a 2D matrix by rows (i.e., the second subscript varies faster) rather than by columns

  double A[100][200], sum
  // by rows
  for (int i=0; i<1024; i++)
      for int j=0; j<2048; j++)
          sum += A[i][j];

  double A[100][200], sum
  // by columns
  for (int j=0; j<1024; j++)
      for int i=0; ij<2048; i++)
          sum += A[i][j];

For example consider the trivial example on the far right and assume a cache block is the size of 8 C doubles and the cache holds 128 blocks. The elements of matrix A are referenced in the order
A[0,0], A[0,1], A[0,2], ... A[0,2047], A[1,0], ..., A[1023,2047]
A[0,0] will be a fault, but (since the elements are stored consecutively), the next 7 references will be to the same block and hence will be hits. This pattern repeats and the hit rate is 7/8.

In contrast the similar example on the near right references the same elements but in the order
A[0,0], A[1,0], ...
Consecutive references are to far apart memory locations and hence target distinct cache blocks so we do not get any hits at all.

Serious Business

Please don't let the above trivial example give you the misimpression that improving cache performance just involves interchanging the order of nested loops. Just do a google search for high performance matrix multiply to get an idea of the serious effort that is involved.

O'H-6.6 Putting it Together: The Impact of Caches on Program Performance

Skipped

O'H-6.7 Summary

The desire for memory to be big and fast (and other properties) meets the reality that memory is ether big and slow or small and fast. Real systems contain a multilevel memory hierarchy where higher levels are smaller and faster than lower levels.

Successful designs put the important data in higher levels. Choosing which data to put in higher levels is guided by the locality principles.

In this chapter we emphasized the cache (sram) / main memory (dram) boundary. Chapter 9 will look at the lower boundary between main memory (now considered small and fast) and local disk (big and slow).

Chapter O'H-9 Virtual Memory

O'H-9.1 Physical and Virtual Addressing

Virtual and Physical Addresses

We have been a little casual about memory addresses. When you write a program you view the memory addresses as starting at a fixed location, probably 0. But there are often several programs running at once. They can't all start at 0! In OS we study this topic extensively.

Monoprogramming

Way back when (say 1950s), the picture on the right was representative of computer memory. Each tall box is the memory of the system. Three variants of the OS location are shown, but we can just use the one on the left.

Note that there is only one user program in the system so, we can imagine that it starts at a fixed location (we use zero for convenience).

Using the appropriate technical terms we note that the virtual address, i.e., the addresses in the program, are equal to the physical addresses, i.e., the address in the actual memory (i.e., the RAM). The virtual address is also called the logical address and the physical address is also called the real address.

Simplest Multiprogramming

The diagram on the right illustrates the memory layout for multiple jobs running on a very early IBM multiprogramming system entitled MFT (multiprogramming with a fixed number of tasks).

When the system was booted (which took a number of minutes) the division of the memory into a few partitions was established. One job at a time was run in each partition, so the diagrammed configuration would permit 3 jobs to be running at once. That is it supported a multiprogramming level of 3.

If we ignore the OS or move it to the top of memory instead of the bottom, we can say that the job in partition 1 has its memory starting in location 0 of the RAM, i.e., it logical addresses (the addresses in the program) are equal to its physical addresses (the addresses in the RAM).

However, for the other partitions, this situation does not hold. For example assume two copies of job J are running, one copy in partition 1 and another copy in partition 2. Since the jobs are the same, all the logical addresses are the same. However, every physical address in partition 2 is greater than every physical address in partition 1.

Specifically, equal logical addresses in the two copies have physical addresses that differ by exactly the size of partition 1.

Multiprogramming Via Swapping

The picture on the right shows a swapping system. Each tall box represents the entire memory at a given point in time. The leftmost box represents boot time when only the OS is resident (blue shading represent free memory). Subsequent boxes represent successively later points in time.

The first snapshot after boot time shows three processes A, B, and C running. Then B finishes and D starts. Note the blue hole where B used to be.

The system needs to run E but each of the two holes is too small. In response the system moves C and D so that E can fit. Then F temporarily preempts C (C is swapped out then swapped back in). Finally D shrinks and E expands.

In summary, not only does each process have its own set of physical addresses, but, even for a given unchanging process, the physical addresses change over time.

However, each process stays consistent, i.e., the physical address space remains contiguous for each process. The processes are not interleaved with each other. When you are seated in a plane that is climbing, your waist stays the same distance below your shoulders.

Simple Paging

Now it gets crazy.

Moving a processes is an expensive operation. Part of the cause for this movement is that, in a swapping system, the process must be contiguous in physical memory.

As a remedy the (virtual) memory of the process is divided into fixed size regions called virtual pages and the physical memory is divided into fixed sized regions called physical pages. Virtual pages are often called pages and physical pages are often called frames

All pages are the same size; all frames are the same size; and the page size equals the frame size. So every page fits perfectly in any frame.

The pages are indiscriminately placed in frames without trying to keep consecutive pages in consecutive frames. The mapping from pages to frames is indicated in the diagram by the arrows.

But this can't work! Programs are written under the assumption that, in the absent of branches, consecutive instructions are executed consecutively. In particular, after executing the last instruction in page 4, we should execute the first instruction in page 5. But page 4 is in frame 0 and the last instruction in frame 0 is followed immediately by the first instruction in frame 1, which is the first instruction in page 3.

In summary the program needs to be executed in the order given by its pages, not by its frames.

This is where the page table is used. Before fetching the next instruction or data item, its virtual address is converted into the corresponding physical address as follows. Similar to the procedure with caches, we divide the virtual address by the page size and look at the quotient and remainder. (The former is the page number and the latter the offset in the page.) We look up the page number p# in the page table to find the corresponding entry, called the page table entry or PTE. The PTE contains the associated frame number f#. The offset in the frame is the same as the offset in the page. (Since the page size is always a power of 2, the division is done using a pencil).

Start Lecture #24

Remark: The cheet sheets for the final are on Brightspace.

Summary of the Historical Development

Uniprogrammed: Unused memory wasted; cpu idle during I/O and between jobs.
Small-scale Multiprogrammed: Rigid; less wasted memory.
Swapping: Better memory utilization, but bad external fragmentation.
Paging: Even better memory utilization, but needs page tables.

O'H-9.2 Address Spaces

We see in the paging example directly above that there are two kinds of address: virtual addresses and physical addresses. The set of virtual address in the program is called its virtual address space. We also call it the virtual memory of the process. Similarly we have the physical address space (or physical memory), which is composed of all the physical addresses. These two address spaces will help answer some previously raised questions.

In the diagram showing the layout of a process in memory, the read-only segment started at address 0x400000 and the stack started at 2⁴⁸-1 for every process. Why won't the processes clobber each other?
Short answer:All the process have their stacks begin at virtual address 2⁴⁸-1. Each process has its own page table and hence the same virtual address can be mapped to different physical addresses when different processes run.
With the Intel x86-64 instruction set, address are 64-bit quantities, which means each process can address 2⁶⁴=18,446,744,073,709,551,615 bytes. But no computer I know of has (close to) that much memory.
Short answer: That big number is the largest virtual address. The physical address assigned to that virtual address will be much smaller and will be no bigger than the amount of real (DRAM) memory on the computer.
It is possible to run simultaneously five 10GB programs on a 32GB computer. How can all five fit?
Very short answer: The five programs will require a total of 50GB of virtual memory (5 chunks of 10GB each). At any given time, most of these virtual address will not have any corresponding physical addresses.

And Now the Bad News

The program contains virtual address. The real memory is accessed via physical addresses. Hence we must convert (on the fly) each virtual address to a physical address. This requires separating the page# from the offset, which is easy and fast, and reading the page table, which is easy but slow. We are essentially saying that each memory access in the program requires two access, first the page table and then the memory itself.

First access the page table to convert the virtual address to its corresponding physical address.
Then access the physical address itself.

We must eliminate this two-to-one slowdown, or the whole idea is doomed.

The MMU (Memory Management Unit)

Translating each virtual address into the corresponding physical address is the job of the Memory Management Unit or MMU. So far this looks to be a simple task.

Divide the Virtual Address by the pagesize (using a pencil). The quotient is the virtual page number (VPN) and the remainder is the offset.
Index the page table by the virtual page number and get the physical page number (PPN).
The required physical address is the physical page number concatenated with the offset as shown on the right.

As noted this simplicity belies the performance penalty of converting each memory reference into two references. We will fix this with a caching-like approach, specifically we shall cache the page table in a structure called a TLB (translation lookaside buffer).

In addition we shall make the scheme more complicated in order to permit modern computers to concurrently execute programs whose total memory requirement exceed the memory present on the system.

Demand Paging

In section 9.1 we have seen a series of historical advances enabling computer systems to fit more and more jobs in memory simultaneously.

Modern systems have gone a step beyond the simple paging scheme previously described and use instead demand paging in which it is no longer true that the entire program is in memory all the time it runs. Instead all the program's virtual pages are on disk. Only some pages are, in addition, in physical pages as in the figures above. For other pages the page table simply lists that the physical page is not resident in memory, i.e. there is no physical page containing this virtual page.

A program reference to a non-resident virtual page is called a page fault and triggers much activity. Specifically, an unused physical page must be found (often by evicting the virtual page currently residing in this physical page) and the referenced virtual page must be read from the disk into this newly available physical page.

If the above sounds similar to caching, you are right! For caching, the SRAM acts as a small/fast cache of the big/slow DRAM. For the demand paging scheme just described the DRAM acts as a small/fast cache of the big/slow disk.

How Big Should a Page Be (Granularity)?

This question comes up in caching as well (how big should a cache block be?).

Because of the differing speed characteristics of disks and RAM, the typical page size is a few thousand bytes (4K and 8K are common) instead of the tens of bytes common for cache blocks.

O'H-9.3 VM as a Tool for Caching

Skipped.

O'H-9.4 VM as a Tool for Memory Management

The separation of Virtual Memory from Physical Memory offers several advantages.

Simple view of (virtual) memory:: Each process sees the virtual memory as its own, with no no other process involved. The virtual memory of a process is a large contiguous range (i.e. there are no holes).
Protection:: Since separate processes do not normally share memory, normally one process cannot alter the memory of another processor.
Easy sharing when desired:: Sometimes it is desired to have processes share code and read-only data (e.g., printf() and other routines in shared libraries). The routines can be at different virtual addresses in different processes. The only requirement is that the physical addresses are the same in the different processes. This is accomplished by having them linked at different addresses in the different processes but having the page table entries for these virtual pages be equal. In the diagram on the right printf() has the pink virtual pages in P1 and the green virtual pages in P2. But these refer to the same physical pages so there is only one copy of printf() in the memory.
Efficient usage of (physical) memory:: By not assigning physical memory to the entire process, parts of the process not currently in use are present only in the slow/big/cheap disk not in the fast/small/expensive DRAM.
: Recall the memory layout for a process. Certain sections started at fixed locations for all processes, but, for example, not all stacks can begin at the same real address. The address given in the diagram are virtual addresses and the real addresses are different for each processes, which is accomplished by the page tables for the different processes having different entries for those virtual addresses.

Multiprocessing: Running Multiple Processes Simultaneously

Our treatment emphasizes one process running on one CPU. Today, even simple systems have multiple processors running many processes. We will ignore multiple processors in this course, but must acknowledge multiple processes. In 202 Operating Systems we consider multiple processes and multiple processors in more detail.

One change needed is that my pictures indicate one page table, which maps virtual page numbers to physical page numbers. In reality there is a separate page table for each process. For 201, we simply note that when the OS switches from running one process to running another it must determine the location of the new process's page table. Our diagrams ignore this detail.

O'H-9.5 VM as a Tool For Memory Protection

Since the page table is read for each memory reference, we can use it to hold protection information. (A similar technique is used to have some files belonging to one user not writable by other users.)

The diagram on the right shows the first three entries in the page tables for two processes i and j with three permission bits per virtual page.

Sup: supervisor mode required. Must the process be in supervisor mode to access this page. We will not emphasize the Sup protection bit in 201; supervisor mode will be discussed in 202 Operating Systems.
Rd: can the process read this page.
Wr: can the process write this page.

We see that each process can read each of it's first three virtual pages.

Process j cannot write physical page 9, perhaps its virtual page 0 contains text or read-only data.

Most interesting is physical page 6, which is shared between the two processes. Each process can read the physical page, but using different virtual addresses in the two processes: Physical page 6 is virtual page 0 in process i, but it is virtual page 1 in process j. Process j can write the shared page, but process i cannot. Perhaps this page contains data produced by process j and consumed by process i, the so called producer-consumer problem, which you will study in 202.

Note that the diagram is definitely not drawn to scale. Each blue box is a physical page, which is a few thousand bytes in size. The protection flags are a single bit each. The physical page number boxes are pointers to physical memory, these pointers need log(N) bits where N is the maximum possible number of physical pages. Since log(N) will probably not exceed 64 in our lifetime, 8 bytes is a good estimate for the size of these boxes.

O'H-9.6 Address Translation

Page Hits and Page Faults

Recall that in section 9.2 when describing demand paging we wrote all the program's virtual pages are on disk. Only some pages are, in addition, in physical pages.

To distinguish those virtual pages that have corresponding physical pages from those that do not, we define another bit stored in each PTE, the valid bit. When the valid bit is true (the good case) there is a copy of the virtual page in memory and the PTE contains the number of that physical page, as in the example just above in section 9.5. Copying the terminology from caches, we call this case a page hit.

If there is no physical page for a given virtual page, we have the bad case, commonly called a page fault (or a page miss). In this case the physical page number field contains junk

The Good Case: A Page Hit

The diagram on the right illustrates the steps performed in the good case. Although the terminology used is most appropriate for a load, the steps are very similar for a store as well.

When a process runs, the MMU references that process's page table, which we treat as a simple 1-dimensional array of PTEs.

Then to load a data item in the good case involves the 5 steps shown on the right

First the CPU sends the desired virtual address to the MMU (the CPU doesn't deal with physical address).
The MMU extracts the virtual page number and offset. It uses the physical address of the page table together with the virtual page number to calculate the physical address of the relevant PTE. This physical address is sent to the memory (i.e. the memory hierarchy).
The memory responds with the contents of the requested PTE.
The MMU extracts the physical page number from the PTE and combines it with the offset calculated in 2 to obtain the the physical address of the virtual address sent in 1. This physical address is sent to the memory.
The memory responds to the request: it returns the requested value if the operation was a load and stores a value if the operation was a store.

The Bad Case: A Page Fault

The full story for the bad case involves the operating system and we cover it in more detail in 202. The first three steps from the previous diagram remain intact. Steps 4 and 5 differ.

If the virtual address in invalid (for example outside the range of virtual addresses for this process), the OS kills the process.

If the virtual address is valid, but there is no physical address (the valid bit is off), a page fault has occurred and again the OS (and 202) are involved. Very briefly (and grossly over-simplified) an existing virtual page is evicted from its physical page and this now-empty physical is used to hold the requested virtual address.

O'H-9.6.1 Integrating Caches and VM

Virtual Address Caches vs Physical Address Caches

Now that we understand the difference between virtual and physical address, we can discuss the trade-off between caching based on each.

Advantage of Virtual Addressed Caches: Speed

An address from the program itself is the virtual address, the system then translates it to the physical address using the page table, as described above. Thus, with a virtual address based cache, the cache lookup can begin right away; whereas, with a physical address based cache, the cache lookup must be delayed until the translation to physical address has completed.

Advantage of Physical Addressed Caches: No Aliasing

Many concurrently running processes will have the same virtual addresses (for example, all processes have their stacks starting at the same virtual address). However, all these virtual addresses are different physical address and represent parts of different programs. Hence they must be cached separately. But with a straightforward virtual address cache, all the virtual address for the base of the stack would be assigned to the same cache slot. Instead, the virtual address caching scheme adds complexity to the cache hardware to distinguish identical virtual address issued by different processes.

O'H-9.6.A Two Remaining Challenges

We have two remaining problems to solve.

Speed: Referring to our diagram at the beginning of 9.6 we see that when the processor issues one request (the blue circled 1), the MMU generates two requests to the memory hierarchy (2 and 4). Doubling the work must be too slow.
Size: Page tables can be very big; and we have one per process!

O'H-9.6.2 Speeding Up Address Translation with a TLB

Modern systems dedicate a small virtually addressed cache to hold a small number of PTEs. This cache is very fast because it is small and located within the MMU (hence on the CPU chip itself). Thus, when the PTE is found in the TLB (a TLB hit), transmissions numbered 2 and 3 in the above diagram are avoided.

TLB misses (fortunately a rare occurrence) proceed as above: all 5 steps are performed

Often the TLB has a high degree of associativity, which improves its hit rate.

Start Lecture #25

Remark: A practice final is on Brightspace (resources tab). IMPORTANT: The practice final only covers the material since the midterm. Don't be mislead. The real final will be cumulative, i.e. will cover all the material in the course.

Remark: Two cheat sheets are in the resources tab of Brightspace: one has some C library names the other is for assembler. Both may be used on the final.

O'H-9.6.3 Multi-Level Page Tables

At first glance the size of a page table does not seem significant: For each physical page, we need 1 PTE. Since each physical page is several KB and each PTE is just several bytes, the later is merely about 0.1% overhead.

The problem is that, as described so far, the entire page table must be stored in physical memory, even though the vast majority of the entries indicate that the corresponding virtual page is invalid (and hence there is no corresponding physical page).

For the Intel architecture we have used, the virtual address space (see this previous diagram) starts at 0x400000 and ends at 2⁴⁸-1. Since the machine is byte addressable, the virtual size of the process includes nearly 2⁴⁸ bytes.

If the page table were just 0.1% of this number, it would require physical memory (the kind you must buy) of about 2^48-10 Bytes or 256GB for each process!

Recall that most of the virtual space in the process diagram is in gray, i.e., is unused.

A Better Way

I hope the diagram on the right is helpful. Everything in the diagram is in physical memory. For example, the label virt page 96 abbreviates the physical page containing virtual page 96.

Consider a process with exactly 100=10² virtual pages. Instead of defining a single page table with 100 entries, we partition those 100 entries into 10 groups of 10.

For the moment imagine all 10 of these tables exist: The first table points to the first 10 pages (0-9). The second table points to pages 10-19, etc. These are called level 2 tables. We then need a level 1 table pointing to the level 2 tables. The level 1 table has a blue border in the diagram.

Now look for the physical page corresponding to virtual page 11. Yesterday we would have referenced the single page table, selected its 11th (or 12th) entry, and followed the pointer to the physical page containing virtual page 11.

Today it is harder. We first go to the blue (i.e. level 1) table and follow entry number 1 (because 11 starts with 1), which takes us to the correct level 2 table. From there we follow entry number 1 (because 11 ends with 1) and arrive finally at the physical page corresponding to virtual page 11.

I suggest you try try to find the physical page containing virtual page 57.

Sizes

Pages are commonly 4KB or 8KB as are level 2 tables; let's say each is 8KB to be definite. Each PTE is typically 8B so each level 2 table contains 1K=1024 PTEs. These 1K PTEs, refer to 1K pages or 8MB of the process's virtual address space.

Why is Multilevel Better?

The big advantage occurs because most virtual pages are not in physical memory. For example, none of the virtual pages 20-29 have physical pages. Hence slot 2 in the blue table is null and there is no corresponding level 2 table.

Even including the overhead of the blue table, the diagram shows an improvement: There are only 40 total table entries; whereas yesterday's simple page table would have had 100 entries.

The advantage is greater for bigger examples and especially when you consider 3- and 4-level tables.

Additionally, note that only the level 1 table needs to be permanently memory resident. The level 2 tables can be created as needed and can be paged in and out as needed.

O'H'-9.6.4 Putting it Together: End to End Address Translation

The book does a lengthy example including caches and a TLB. We will concentrate on just the paging aspect.

Each reference is to a byte.: No alignment constraints.
The page size is 64B: The virtual page offset is 6 bits.; The low order 6 bits of a virtual address gives the virtual page offset.; The physical page offset is 6 bits.; The low order 6 bits of a physical address give the physical page offset.
Each virtual address is 14 bits wide.: The virtual address space is 16KB; The high order 14-6=8 bits of the virtual address is the virtual page number.; Hence there are up to 256 virtual pages in a process and 256 entries in its page table
Each physical address is 12 bits wide.: The physical address space is 4KB.; The high order 12-6=6 bits of the physical address is the physical page number.; Hence a process contains up to 64 physical pages each 64B in size.

Contents of a Page Table Entry (PTE)

Page Table
VPN	PPN	Valid
00	28	1
01	-	0
02	33	1
03	02	1
04	-	0
05	16	1
06	-	0
07	-	0
08	13	1
09	17	1
0A	9	1
0B	-	0
0C	-	0
0D	2D	1
0E	11	1
0F	0D	1

The page table contains one PTE for each virtual page. A PTE contains several components; for simplicity we consider only the valid bit and the PPN.

Valid: A 1-bit field. The bit is on if there is a physical page associated with this virtual page.
Physical Page Number PPN: If the valid bit is off, the PPN is undefined. Otherwise it is the number of the physical page assigned to this virtual page.
Protection: A few bits. We considered 3 protection bits in section O'H-9.5. For simplicity we ignore protection in this example.

Doing A (Virtual to Physical) Address Translations

For simplicity we assume a 1-level page table. Since there are 256 virtual pages, the table has 256 rows. On the right, we show only the first 16 (0-15) decimal. Speaking in hex we show the first 10₁₆ (0₁₆-0F₁₆).

Question: What are the physical address, the PPN, and the PPO for the virtual address 0234₁₆?
Answer: 0234₁₆ = 00_0010_0011_0100₂. The low order 6 bits of the address are the VPO and the high order 8 bits are the VPN. (Throughout this example binary page numbers are red and binary page offsets are green). So VPO = 11_0100₂ = 34₁₆ = 52.
Similarly, VPN = 0000_1000₂ = 08₁₆ = 8. As always PPO = VPO = 11_0100₂ = 34₁₆ = 52.
The page table tells us that virtual page number 08₁₆ is valid and can be found in physical page 13₁₆ = 01_0011₂. Hence the physical address is 01_0011_11_0100₂ = 0100_1111_0100₂ = 4F4₁₆.

O'H-9.7 Case Study: The Intel Core i7/Linux Memory System

Skipped

O'H-9.8 Memory Mapping

Skipped

O'H-9.9 Dynamic Memory Allocation

For many programs, the (maximum) size of each data structure is known to the programmer at the time the program is written. That is the easy case. Sometimes (in C) we use #define to make explicit these maximums and thereby ease the burden of changing the maximums when needed.

O'H-9.9.1 The `malloc() and free()` Functions

Other times we want to let the running program itself determine the amount of memory used. For example we have malloc() and its variants in C and new in java.

Moreover, sometimes we want to return no-longer-needed memory prior to program termination. This is free() in C. What about Java?

The Heap

The malloc()/free() team deals with allocating virtual memory from (and returning virtual memory to) a region that grows and shrinks dynamically called the heap (see this previous diagram).

In the diagram on the right (which is from the book) we see malloc()/free in action. Each small square represents 4B (the size of an int) and we will insure alignment on an 8B boundary. Initially, malloc's internal pointer P points to the beginning of the heap, which we assume is properly aligned. You should imagine the diagram extending to the right with many, many, many more white boxes.

The first malloc() is for 4*sizeof(int)=16B or 4 boxes. malloc() returns p1 and sets an internal pointer to p2. The light blue block has been given to the user.
The next request is for 20B (5 boxes), which would have p3 not properly aligned. So malloc() waists 4B in darker green and gives the user p2, which points to 24 available bytes.
The user then requests 24B and is given the pink p3.
The user then frees p2. This gives malloc back both the light and dark greens.
Finally, the user asks for 8B and gets the yellow p4.
If the last request had been for 32B (i.e., 8 boxes), p4 would have been after the pink region pointed to by p3.

Note that achieving the above semantics is not trivial. That is, malloc() and free() are not trivial programs like the alloc()/afree() pair we did earlier in the semester. In particular, the alloc()/afree() pair required that the user could afree() only the most recent block obtained from alloc(). The ,alloc()/afree() scheme for obtaining and returning blocks required a stack-like LIFO ordering.

O'H-9.9.2 Why Dynamic Memory Allocation

When the size of a data structure depend on parameter known only at run time, the best / most natural response is to first read (or compute) the parameter and pass it to malloc() to obtain the right size data structure.

Also you may wish to return some of the memory before termination.

O'H-9.9.3 Allocator Requirements and Goals

Random ordering: As mentioned, allocation/freeing would be much easier if we require a stack-like ordering—but we do not.
Use only the heap: Scalars can come from the stack, but nonscalar objects must come from the heap itself (technical; we can ignore this).
Alignment: Allocated blocks blocks must have maximum alignment.
Cannot modify/move allocated blocks: User needs non-varying addresses. So compaction is not possible.
Maximize throughput: Easy solution is to make free() a nop and never reuse memory. But this wastes memory.
Maximize memory utilization: Minimize any memory overhead, i.e., memory used but not specifically requested.

O'H-9.9.4 Fragmentation

Internal Fragmentation

This is wasted space within an allocated region. One example was the padding used to maintain alignment. Some allocators only dispense blocks of certain sizes (e.g. buddy system and powers of 2).

  p1 = malloc(100);
  p2 = malloc(100);
  p3 = malloc(100);
  free(p1);
  free(p3);
  p4 = malloc(200);

External Fragmentation

This is wasted space outside any allocated region. It occurs when there is enough free memory but not in one piece.

For example, consider the code on the right and assume that after the third malloc(), no space remained. Then the two end blocks are freed giving a total of 200B free, but split in two 100B pieces. Hence the fourth malloc() fails.

This is difficult problem whose occurrence is impossible to predict since it depends on knowing the future. Common heuristics employed are to keep free memory in a few large pieces rather than a large number of small pieces.

O'H-9.9.5 Implementation issues

Free block organization: Keeping track of free blocks.
Placement: Which free block to use.
Splitting: After using part of a free block, what to do with the rest.
Coalescing: What to do with a newly freed block.

O'H-9.9.6 Implicit Free Lists

Format of a Block

The format of a block on an implicit free list is shown on the right. It consists of a header followed by the payload and any padding needed to ensure proper alignment. The header contains the size of the block (including the header and any padding) as well as a flag indicating whether the block is free or has been allocated The block size can be thought of as a pointer to the next block and that is how we show it it in the diagrams below.

A Typical Implicit Free List

The name implicit free list is a little funny. It is really a list of all blocks, both free and allocated. All blocks have a header, which contains the block length and status (free or allocated).

In the diagram that follow, we have four (green) free blocks and one (red) allocated block.

The block given to the user includes the payload and any possible padding. It does not include the size and allocated bit. Indeed, the user must not alter that word.

Note that with an implicit list we must also keep a color bit with each block stating if the block is free.

Recall that the green portion includes any padding needed for alignment (or other) purpose.

Alternatives

Explicit list: Have a separate list for just free blocks. Since all the blocks on the list are free, no color bit is needed, but I colored the blocks anyway, for clarity.
Segregated list: Separate lists for free blocks of different sizes.

O'H-9.9.7 Placing Allocated Blocks (Really Finding a Free Block)

Three methods have been considered.

First fit. Start at the beginning of the free list, choose the first block big enough, and return the leftover portion.
Next fit. Like first fit, but start where you left off last time.
Best fit. Search the free list and find the best fitting block (i.e., the smallest one that is big enough).

9.9.8 Splitting Free Blocks

Often the free block chosen by the given algorithm is bigger than the user requested. There are two possible continuations.

Give the use the full block found. This is fast and easy, but gives internal fragmentation.
Give the user her requested amount and put the remainder back on the free list. On the right is the result of satisfying a request for a block of size one for the previous implicit free list. First fit was used so the first block was split.

O'H-9.9.9 Getting Additional Heap Memory

What happens if malloc() cannot satisfy the current request for a free block of size 5? If the state is as in the first diagram, the only free blocks are of size 4,2,4,2 and we cannot satisfy the request. In this case, the next section will show us how to coalesce the first two free blocks into one of size 7. But if the request was for a size 20 free block, we would fail: there simply isn't enough space available.

On the right is a familiar diagram showing the (virtual) memory allocated for a running process. Note that the green heap section has an arrow indicating that it can grow.

What happens is that malloc() executes an sbrk() system call and poof the black line moves up and the heap gets bigger. (You will learn more about system calls in 202).

O'H-9.9.10 Coalescing (Adjacent) Free Blocks

Freeing a Block

The simplest implementation is to just change the color of the returned block from red to green. But there is a problem. The previous free list diagram ends with two adjacent green blocks of size 4 and 2, but we cannot satisfy a request for a size 5 block. This is called false fragmentation. Fortunately, a simple coalescing of the last two blocks gives a size 7 free block.

So a linear traversal of the free blocks would enable us to coalesce adjacent free blocks. In an implicit free list we would have to traverse the entire (i.e., free plus allocated) list.

Immediate Coalescing

Instead of scanning the list to find blocks to coalesce, we might want to coalesce when we free a block. Since the block header in not given to the user, it still points to the next block. If the latter is free, we can easily coalesce. Therefore, if blocks are freed in reverse order, then checking the successor will accomplish all possible coalescing.

O'H-9.9.11 Coalescing With Boundary Tags

As just noted, when a block is freed, free() can check if the successor is free and and, if so, coalesce. The boundary tag method of Knuth extends this to the predecessor as well. The difficulty was that although the block's header points to the successor, there is no pointer to the predecessor. The boundary tag method adds a footer at the end of the block that contains a reverse pointer (making the list doubly linked).

Assume that the first red block is freed by the user and is merged with the first green block to give a free (i.e., green) block containing four boxes. If we were using boundary tags (i.e. double link the list) the implicit free list would becomes: boundary-tag Remember that the allocated/free bit, which I color-code red/green is actually stored in both the header and footer.

The Four Cases

The block being freed has two neighbors and, assuming an implicit free list, each can be allocated or free. this gives the four cases shown on the upper right.

In all four cases the middle (white) block becomes green and is then is merged with any adjacent green block, which can be located above, below or both.

The four possible results are shown in the lower diagram. The lengths in the header and footer, must be updated to reflect any merges that occur,

O'H-9.9.12 Putting it Together: Implementing a Simple Allocator

Largely skipped; just a few comments about an implicit list allocator.

Allocating a block can take time linear in the number of blocks on the list (perhaps only the last block is big enough). This is bad.
Freeing a block is fast (constant time), even with coalescing.
Memory utilization depends on first-fit, best-fit, next-fit.

O'H-9.9.13 Explicit Free Lists

Since the free blocks have their own list, it is faster (but still linear-time) to find a big enough free block.
Must store in the free block a pointer to next block on the free list, and normally also a pointer to the previous one. This can usually be written in the body of the free block since no user is using the block.
When a block is freed, it can be inserted at the front of the list (fast) or the list can be kept in address order and coalesced with neighbors (better memory utilization).

O'H-9.9.14 Segregated Free Lists

Have multiple free lists each holding free blocks of roughly the same size.

To allocate a block first try the free list containing blocks if the appropriate size. If there is no block available in this list, either try list of bigger size blocks or go to OS for more memory blocks of this size.
To free a block (optionally try to coalesce and) insert on appropriate list.
The buddy system uses blocksizes that are powers of 2. It is very fast but can have significant internal fragmentation.

O'H-9.10 Garbage Collection

As mentioned previously, Java has analogue of malloc() namely new, but has no analogue of free(). What is going on?

Java systems (and some others) automatically determine when dynamically allocated memory can no longer be referenced by the user's program (such memory is called garbage). The system then automatically frees such memory.

This procedure is called garbage collection and is a serious subject that we will not treat in depth.

O-H-9.11 Common Memory-Related Bugs in C Programs

Skipped.

O'H-9.12 Summary

Skipped.

Start Lecture #26

Chapter O'H-8 Exceptional Control Flow

The normal (and simplest) instruction ordering is one after the other in order of their address. Now we consider more complex changes to the basic sequential control flow. Much of this involves the OS and you will see more in 202.

Some alterations of control flow are familiar and do not involve the OS, e.g., jump, call, return.

"Exceptional control flow" occurs in reaction to changes in system state: e.g., keyboard Ctrl-C, divide-by-zero.

User-level Programs Interacting with the OS

The user program can explicitly call the OS (system calls). For example, this is done to read data from a file. Since I/O is slow, the issuing process is blocked. An exception occurs when the data has been read and then the issuing processes is awakened. Another example is one program executing or killing another.
The user program can generate an exception (to be handled by the OS), such as dividing by zero.
A user program can be interrupted or preempted by the OS so that the latter can switch to another user program.

O'H-8.1 Exceptions

Exceptions are a form of exceptional control flow that are implemented partly in hardware and partly in software (typically in the OS).

O'H-8.1.1 Exception Handling

Well prior to the exception occurring, the OS sets up a jump table. This is similar to the jump table we saw when implementing a switch statement. Each exception type has a number and the system branches into the jump table indexed by that number and from there branches to the handler for that exception. Again this is similar to our implementation of the switch statement in the assembler section of the course.

The entries are sometimes called interrupt vectors. The table is established during OS boot time.

The memory used for the table is is not accessible to user programs, which is important since the jump table is executed in supervisor (privileged) mode

O'H-8.1.2 Classes of Exceptions

Synchronous events, i.e. events caused by executing an instruction.
- Intentional (Traps): This is how system calls get into the OS.
- Unintentional but recoverable (Faults): For example page faults.
- Unintentional and unrecoverable (Aborts): For example, a parity error in memory.
Asynchronous events (interrupts). These are not directly caused by executing a specific instruction. They are generally caused by an event external to the CPU.
- Typing cntl-C at keyboard on linux.
- A packet arrives on the network.
- A data block arrives from a disk in response to a pending read system call.

O'H-8.1.A Examples

Opening a file.
To open a file, the OS open() system call is used. The user's program calls an unprivileged library routine fopen(), which executes an unprivileged assembly language instruction (sometimes called trap() or int) passing the number of the open system call. The instruction shifts the processor to privileged mode and uses the jump table discussed above to pass control to the OS code for open. After the file is opened, control is returned to the instruction immediately following the trap/int and the system leaves supervisor mode. (Recall that the the jump table is established by the OS at boot time and is not directly accessible to user programs.)
Processing a page fault.
When the user program generates a page fault, the operating system is executed to process the generated exception. The OS loads the required page and returns to the user's program, which re-executes the instruction (this time without a page fault).
Referencing an Invalid Address (e.g., de-referencing an invalid pointer).
- Starts as a page fault.
- The OS detects an invalid address and sends a SIGSEGV signal to the process.
- The process exits with a segmentation fault.

O'H-8.2 Processes

Note: We are drifting into OS from CSO. I cover this material in much more detail when teaching 202.

A process is a program in execution.

Note: Process is a software concept. Do not confuse it with the hardware concept processor.

O'H-8.2.A Abstractions Provided by the OS

The OS works hard to provide the following illusions to each running process.

Each (non-threaded, ignore this) processes can (more or less) assume it has exclusive use of the CPU and of memory. This illusion is courtesy of process scheduling and context switching.
- Process A runs (user mode)
- OS runs (the scheduler)
- Process A runs
- OS runs (the scheduler)
- Process C runs
- Etc.
Each (non-threaded) process appears to have exclusive use of the main memory on the system. This illusion is courtesy of virtual to physical address translation.

O'H-8.3 System Call Error Handling

Skipped.

O'H-8.4 Process Control

O'H-8.4.A The Four Stars of Process Control

fork(): This system call generates a (very nearly identical) clone of the current process. The new process is a child of the original process.
exit(): The current process is terminated. It enters a zombie state until its parent reaps it.
wait()/waitpid(): Wait for and reap zombie children.
exec()/execve(): Overwrite current program with a new one.

To show how the four stars enable much of process management, consider the following highly simplified shell (the Unix command interpreter).

  while (true)
     display_prompt()
     read_command(command)
     if (fork() != 0)
        waitpid(...) <--Omit this line; get a background job
     else
        execve(command)
     endif
  endwhile

The fork() system call duplicates the process. That is, we now have a second process, which is a child of the process that actually executed the fork(). The parent and child are very, VERY nearly identical. For example they have the same instructions, they have the same data, and they are both currently executing the fork() system call.

But there is a difference!

The fork() system call returns a zero in the child process and returns a positive integer in the parent. In fact the value returned to the parent is the PID (process ID) of the child.

Thus, the parent and child execute different branches of the if-then-else in the code above.

Removing the waitpid(...) lets the child run in the background while the parent (the shell) can start another job.

Remark: Next class I shall go over the practice final. I suggest you work on it before then. The last class will be devoted to answering any questions you have.

Start Lecture #27

Remarks

Labs submitted on linserv1 vs Brightspace:
We need labs on Brightspace. Graders cannot read your files on linserv1. If your lab is ON linserv1 and NOT ON Brightspace.
1. Copy it to nyu Brightspace.
2. State in textbox on Brightspace what the linserv1 dates are.
Final exam time arrangement will be like midterm, but you have 1:50 not 1:15.

Review of practice final

Start Lecture #28

General question answer session

Redo 8.4.A with ipad to draw diagram dynamically.

Remark: End of material eligible for the CSO final exam.

Measuring Cache Performance

Clocks, Cycles. Rates. and Times

A clock on a computer is an electronic signal. If you plot a clock with the horizontal axis time and the vertical axis voltage, the result is a square wave as shown on the right.

A cycle is the period of the square wave generated by the clock.

You can think of the computer doing one instruction during one cycle. That is not correct: The truth is that instructions take several cycles but they are pipelines so in the ideal one instruction finishes each clock cycle.

We shall assume the clock is a perfect square wave with all periods equal.

Note: I added interludes because I realize that CS students have little experience in these performance calculations.

Interlude on Solving Rate and Time Equations

A cycle (or clock cycle) is the time for the clock to go from one transition from low to high voltage to the next such transition.

Hertz means CPS, i.e., cycles per second. It is a rate (like MPH) not a time. So a clock rate of 50 Hz means 50 cycles per second, which also means 50 cycles = 1 second or 1 cycle = 1/50 second.

Continuing with the same example 1 cycle = 2*10^-2 sec. = 20*10^-3 sec. = 20ms = 20,000*10^-6 sec = 20,000us = 20*10⁶*10^-9 = 20 million ns.

KHz is kilohertz = 1,000Hz. MHz is megahertz = 10⁶Hz; GHz is gigahertz = 10⁹Hz.

Question: Which takes longer 1GHz or 10MHz?
Answer: Nonsense! Hz is a rate NOT a time.

Question: Which takes longer one cycle at 1GHz or one cycle at 10MHz?
Answer: 1GHz means 10⁹ cycles = 1 sec; so 1 cycle = 10^-9 sec. 10MHz means 10*10⁶ cycles = 1 sec; so 1 cycle = 0.1*10^-6 sec = 10^-7 sec. So a cycle at 10MHz takes 100 times as long as a cycle at 1GHz.

Question: At a rate of 2GHz, how long is one cycle?
Answer: 2GHz means 2*10⁹ cycles = 1 second. Hence 1 cycle = 0.5*10^-9sec = 0.5ns

Question: What megahertz clock has a 300ns cycle time?
Answer: 300ns cycle time means 1 cycle = 300ns = 300*10^-9 sec. So 1 sec = (1/300)*10⁹ cycles = (10/3)*10⁶ cycles = 3.33MHz.

Interlude on Averages Given Base Plus Extra

Question: In 2/5 of the cases X=A, in 3/5 of the cases X=B. What is the average X?
Answer: (2/5)A + (3/5)B = (2A+3B)/5.

Question: In 30% of the cases X=A; in the rest, X=B. What is the average X?
Answer: Average X =(30/100)A + (70/100)B

Question: In p% of the cases X=A; in the rest X=B. What is the average X?
Answer: Average X = (p/100)A + ((100-p)/100)B

Question: In p% of the cases X=A+E; in the rest X=A (E stands for extra). What is the average X?
Answer: Average X = (p/100)(A+E) + ((100-p)/100)A = (p/100)A + (p/100)E + ((100-p)/100)A = (100/100)A + (p/100)E = A + (p/100)E

Question: Base cost is 13; 17% of the cases have an extra cost of E. What is the average cost?
Answer: Average cost = 13 + (17/100)E = 13 + .17E

CPI means (average number of) cycles per instruction. Base CPI means the CPI when all references hit in the cache.

Question:Base CPI = 13; 17% of refs miss cache with a penalty of 8 cycles. What is the overall CPI?
Answer: Overall CPI = 13 + .17(8) = 14.36 cycles.

Instruction and Data Caches

Modern processors have several caches. We shall study just two, the instruction cache and the data cache, normally called the I-Cache and D-Cache.

Every instruction that the computer executes has to be fetched from memory and the I-Cache is used for such references. So the I-cache is accessed once for every instruction.

In contrast only some instructions access the memory for data. The most common instructions making such accesses are the load and store instructions. For example the C assignment statement
y = x + 1;
generates a load to fetch the value of x and a store to update the value of y. There is also an add that does not reference memory. The diagram on the right shows all the possibilities If both caches have a miss, the misses are processed one at a time because there is only one central memory.

An Example

We assume separate instruction and data caches.

Do the following performance example on the board. It would be an appropriate final exam question.

Assume
- 5% I-cache misses.
- 10% D-cache misses.
- 1/3 of the instructions access data.
- The CPI = 4 if there are no cache misses. This (unrealistic) situation of no cache misses is sometimes called the base CPI.
What is the CPI if the miss penalty is 12 cycles?
What is the CPI if we upgrade to a double speed cpu+cache, but keep a single speed memory (i.e., a 24 cycle miss penalty)?
How much faster is the double speed machine? It would be double speed if there was a 0% miss rate.

A lower base (i.e. miss-free) CPI makes misses appear more expensive since waiting a fixed amount of cycles for the memory corresponds to losing more instructions if the CPI is lower.

A faster CPU (i.e., a faster clock) makes misses appear more expensive since waiting a fixed amount of time for the memory corresponds to more cycles if the clock is faster (and hence more instructions since the base CPI is the same).

A Second Example

Assume
1. I-cache miss rate 3%.
2. D-cache miss rate 5%.
3. 40% of instructions reference data.
4. miss penalty of 50 cycles.
5. Base CPI is 2.
What is the CPI including the misses?
How much slower is the machine when misses are taken into account?
Redo the above if the I-miss penalty is reduced to 10 (D-miss still 50)
If the clock rate is 100MHz, how many instructions are executed per second.
With I-miss penalty back to 50, what is performance if CPU (and the caches) are 100 times faster

Homework: Consider a system that has a miss-free CPI of 2, a D-cache miss rate of 5%, an I-cache miss rate of 2%, has 1/3 of the instructions referencing memory, and has a memory that gives a miss penalty of 20 cycles.

What is the CPI?
If the clock rate is 1.5GHz, how many instructions per second does the system execute.
What would be the CPI if the memory was double speed, but the CPU+caches remained the same as the original?
What would be the CPI if the memory remained the same as the original but the CPU+cache were double speed.
How fast would the CPU+cache have to be so that the system, with the original memory, was twice as fast as the original?

Note: Larger caches typically have higher hit rates but longer hit times.

Cache Review

Do Another CPI Example

Material on Cache Size and Number of Bits from Lecture #23

Reviewed caches again and answer students' questions.

As requested I wrote out another example. Here it is.

Extra: One More Problem

At the end of the last class I was asked to do another problem with sizes. In particular finding which address bits are the tag and which are the cache index.

Extra.1: Basic Assumptions

In this class we will always make the following assumptions with regard to caches.

All addresses are 32 bits.
The machine is byte addressable, so the 32-bit address specifies a byte.
All references from the process to memory (and hence to the cache) are for a single 4-byte word

One conclusion is that the low-order (i.e., the rightmost) two bits of the 32 bit address specifies the byte in the word and hence are not used by the cache (which always supplies the entire word).

Extra.2: The Cache in the Example

We will use the following cache.

The size (not the total size) is 256KB.
The blocksize is 8 words.
The set size is 2 (i.e., we have a 2-way set associative cache).

Extra.3: The General Procedure For Assigning the Address Bits

I use a three step procedure.

Find which bits of a 32-bit memory address, give the MBN, or Memory Block Number.
Find the number of sets in the cache NCS.
Divide MBN by NCS and look at the quotient Q and remainder R.

Extra.4: The Problem to be Solved

For the cache just described

Which address bits contain the tag?
How big is each tag?
How many lines (i.e., sets, i.e., rows) are in the cache?
What is the total size of the cache?

Extra.5: Solving the Problem

We will use the three step procedure mentioned in Extra.2.

Extra.5.1: Finding MBN, the Memory Block Number

The top picture shows the 32-bit address.

The rightmost 2 bits give the byte in word, which we don't use since we are interested only in the entire word not a specific byte in the word. That is shown in the second picture. Note that there are 4 = 2² bytes in the word. The exponent 2 is why we need 2 address bits.

The next 3 bits from the right give the word-in-block. There are 8 words in the block (see Extra.2) and 8=2³ so we need 3 bits.

The remaining 27 bits are the MBN.

Extra.5.2: Finding NCS, the Number of Sets

The cache has 256KB = 2¹⁸ bytes. This is given above in Extra.2.
2¹⁸ bytes is 2¹⁶ 4-byte words.
2¹⁶ words is 2¹³ 8-word blocks.
2¹³ blocks is 2¹² 2-block sets.

So NCS = 2¹², which answers question 3 of Extra.4

Extra.5.3: Dividing to Find the Cache Row (Set) and Tag

The MBN is 27 bits and NCS is 2¹².

Dividing a 27-bit number by a 12-bit number gives a (27-12)-bit quotient and a 12-bit remainder.

(This last statement is analogous to the more familiar statement that dividing a 5-digit number by 100=10² gives a (5-2)-digit quotient and a 2-digit remainder. To divide a 5 digit number by 100, you don't use a calculator, you just chop off the rightmost 2 digits as the remainder and remaining (5-2) digits form the quotient. Example 54321/100 equals 543 with a remainder of 21.)

The remainder is the cache set (the row in a diagram of the cache). It is shown in green. In blue we see the quotient, which is the tag.

So to answer questions 1 and 2. The high-order 15 (blue) bits form the 15-bit tag.

The Total Size of the Cache.

In the cache each 8-word block comes with a 15-bit tag and a 1-bit valid flag. Each of these cells (I don't know if they have a name) thus contains 8 32-bit words + 16 bits. (I realize 16 bits is 2 bytes but often the number of bits is not always a multiple of 8.) So each cell is 8*32+16 bits. There are 2 cells in each set and 2¹² sets in the cache so the total size of the cache is.

    2¹² × 2 × (8×32 + 16) bits

End of Computer Systems Organization (201)

What follows is NOT Part of the 2020-21 Course

Chapter T-1: Introduction to Computer Organization and Assembler Language

Homework: Read Tanenbaum chapter 1 (T-1 above stands for Tanenbaum 1).

Remark: Everyone with a Windows laptop should install cygwin as follows (I don't run windows so cannot test this procedure; apparently it worked well last semester).

Browse http://www.cygwin.com
Click on install now.
Where you are asked to select packages, choose devel and then check that the box in the bin column next to gcc is checked. This will ensure that the gcc compiler is included.
Also in devel, select make.
Finally select editor and choose emacs. There are other editors available if you /prefer.

Remark: If you have a linux laptop (or dual boot linux), you are set. The gcc on linux supports both variants of assembler syntax for the x86 CPU. We will be using the Intel syntax.

Remark: The mac story is interesting.

Download XCode from the Apple Developer Connection .
Pick up Carbon Emacs here.
Now the fun. The version of gcc on the Mac does not use the same assembler syntax (the intel syntax) that nearly everyone else uses. You can fairly easily translate from the (ATT) syntax used to the Intel syntax.
Our machine i5.nyu.edu does not use an intel cpu (instead it uses a SUN sparc), but I am getting the mac users accounts on a departmental linux box, where you can select the intel syntax.
You can also install windows or Linux under MacOS; the gcc under Linux or cygwin supports the Intel syntax.

Remark: Midterm exams returned.

If you did not read the mailing list, please read my comment on the exam and midterm (letter) grades. You can find it off the course home page (announcements).

Remark: Some comments on catDiff from the midterm.

Many forgot to allocate space for various strings. The maximum penalty (no allocations) was -3.
Several had trouble with this construct
```
      char *str1 = "something";
      char *str2 = str1;
    
```
They said char *str2 = *str1; I can see why as it looks symmetric. Remember you are declaring str2 not *str2.

Remark: Preview on final project.

You will be writing a (trivial) video game. For now, install the graphics library and put up a picture (find a *.bmp file). Here is the installation procedure

Browse libsdl.org.
Click on the SDL 1.2 link on the left column. Note where the file is downloaded.
If necessary, move the file to your home directory. On Cygwin, your home directory is c:\cygwin\home\yourname).
In the shell (under cygwin) type tar -xvfz SDL-1.2.14.tar.gz.
Type cd SDL-1.2.14
Type ./configure
If this fails try ./configure CC=/usr/bin/gcc-3 CXX=/usr/bin/g++-3
Type make
Type sudo make install.
If this fails try simply make install
If both fail, try $@ make install
You may be asked for your password (for your computer, not for an NYU machine). If you run as administrator, I don't believe you will be asked.
Type cd test
Type ./configure
If this fails try ./configure CC=/usr/bin/gcc-3 CXX=/usr/bin/g++-3
Type make
Type ./testsprite
Gasp, in awe.

Here is a windows/cygwin tip from Prof. Goldberg. Be sure that the name of your home directory does not have a space in it. For example, if your name is Joe Smith, be sure that your home directory on cygwin is not "c:\cygwin\home\joe smith", but rather something like "c:\cygwin\home\joe_smith". The SDL configure function gets confused by spaces in a directory name. If cygwin has created your home directory name with a space, change the name of the directory using Windows. Then, create an environment variable called HOME and set it to c:\cygwin\home\joe_smith, except with joe_smith replaced by the actual name of your home directory. To set an environment variable in Windows, go to
Start->Control Panel->System->Advanced->Environment Variables. It should be obvious from there.

Let me call the subdirectory SDL-1.2.14 the sdl directory. In there you will find a README file containing the web address of a wiki about the library Browse that wiki and follow the guide. I had never used any of this before and within 20 minutes I had a picture up.

Chapter T-2: Computer Systems Organization

Some diagrams of the overall structure of a computer system are in section 1.3 of my OS class notes

T-2.1: Processors

The processor (or CPU, Central Processing Unit) performs the computations on data. The data enter and leave the system via I/O devices and are stored in the memory (the last part is over simplified as you will learn in OS, but it is good enough for us).

T-2.1.1: CPU Organization

Simple processors have (had?) three basic components, a register file, an ALU (Arithmetic Logic Unit), and a control unit. Oversimplified, the control unit fetches the instructions and determines what needs to be done, the data to be processed is often the registers (which can be accessed much faster than central memory) and the ALU performs the operation (e.g., multiply).

In addition to the (assembly-language) programmer-visible registers mentioned above, the CPU contains several internal registers, two of which are the PC (Program Counter, a.k.a. ILC or LC), which contains the address of the next instruction to execute and the Instruction Register (IR), which contains (a copy of) the current instruction.

T-2.1.2: Instruction Execution

There are three parts to executing an instruction: obtain the instruction, determining what it needs to do, and doing it. Repeatedly performing these three steps for all the instructions in a program, is normally referred to as the fetch-decode-execute cycle.

In slightly more detail, the CPU executes the following program.

Fetch the next instruction into the IR.
Update the PC.
Determine the type of the instruction.
If memory is referenced, determine the address.
Fetch the referenced word, if needed, into a register.
Execute the instruction.
Repeat.

Architecture vs. Micro-Architecture

The architecture is the instruction set, i.e., the (assembly-language) programmer's view of the computer.

The micro-architecture is the design of the computer, it is the architect's/designer's/engineer's view of the system.

The interesting case is when you have a computer family, e.g., the IBM 360 or 370 line, the x86 microprocessor architecture, which has several different implementations, with different microarchitectures.

T-2.1.3: RISC vs. CISC

Reduced Instruction Set Computer versus Complex Instruction Set Computer. Clear implementation advantages for RISC. But CISC has thrived! Intel found an excellent RISC implementation of most of the very CISC x86.

T-2.1.4: Design Principles for Modern Computers

The RISC design principles below are generally agreed to be favorable, but are not absolute. For example backwards-compatibility with previous systems, force compromises.

All instructions executed directly by hardware (not interpreted by microinstructions).
Maximize instruction issue rate (e.g., easy to find where the next instruction begins).
Instructions should be easy to decode.
Only loads and stores reference memory.
Provide ample registers.

Remark: I mentioned the wrong guide last time. The notes are now correct.

T-2.1.5: Instruction Level Parallelism

Skipped.

T-2.1.6: Processor-Level Parallelism

Skipped.

T-2.2: Primary Memory

T-2.2.1: Bits

Abbreviates binary digit, which is rather contradictory.

T-2.2.2: Memory Addresses

The smallest addressable unit of memory is called a cell. Recently, for nearly all computers, a cell is an 8-bit unit called a byte (or octet). Bytes are grouped into words, which form the units on which most instructions operate.

T-2.2.3: Byte Ordering

This has caused decades of headaches.

Memory is addressed in bytes. But we also need larger units, e.g., a 4-byte word. If memory contains a big collection of bytes, the bytes are stored in address 0, 1, 2, 3, etc. If memory contains a big collection of words, the words are stored in address 0, 4, 8, 12, etc. So far no problem.

Consider a 32-bit integer stored in a (4-byte) word. If the integer has the value 5 then the bit string will be 00000000|00000000|00000000|00000101. So the lower order byte of the integer is 00000101 and the three high order bytes are each 00000000. Still no problem.

Let's assume this is the first word in memory, i.e., the one with address 0. It contains 4 bytes: 0, 1, 2, and 3. We are closing in on the problem.

Which of those four bytes is the low order byte of the word. Answer from IBM: byte 3 (IBM machines are big endian. Answer from Intel: byte 0 (Intel processors are little endian.

Either answer makes sense and if you stay on one machine, there is no problem at all since either system is consistent. But let try to move data from one machine to another.

Say we have an integer containing 5 (as above) and a 4-byte character string "ABC" stored on an IBM machine. The layout is

  00000000  00000000  00000000  00000101     A     B     C    00000000
     0         1         2         3         4     5     6       7

The ABC are expressed in bits, but the specific bit string is not important.
The last byte of all 0s is ...
the ascii null ending the string.

We send these 8 bytes via ethernet to an Intel machine where we again store them starting at location 0, and get the same layout as above. However, byte 3 is now the most-significant (rather than the least-significant) byte. Gack. The integer 5 has become 5*(256)³!

If the internet software reverses every set of four bytes, we fix the integer, but screw up the string.

T-2.2.4: Error-Correcting Codes

The Hamming Distance between two equal-length bit strings is the number of bits in which they differ. If you arrange that all legal bit strings have Hamming distance at least 2 from each other than you can detect any 1-bit error. This explains parity.

More generally if all legal bit strings have Hamming distance at least d+1, then you can detect any d-bit error since changing d bits of a legal string cannot reach another legal string.

To enable correction of errors you need greater Hamming distances: specifically Hamming distance 2d+1 is needed to enable correction of d bits. This is not too hard to see. If you have a valid string and change d bits, the result is at distance from the original valid string and at least distance d+1 from any other valid string (since the valid strings are at least 2d+1 apart.

The harder part is designing the code. That is, given n, assume you are storing and fetching n data bits at a time, how many extra check bits must be stored and what must they be in order that all the resulting strings are at least distance d apart?

The book gives Hamming's method, but we are skipping the algorithm and are content with just one fact.

If the size of the data word is 2^k (i.e., the number of bits in a word is 2^k), then k+1 check bits are necessary and sufficient to obtain a code that can correct all single-bit errors and can detect all double-bit errors.

For example if we are dealing with bytes, k is 3 so 4 check bits are required; a heavy overhead (4 check bits for every 8 data bits). If we are only transporting 64-bit words, k is 6 and 7 check bits are required, which is a much milder overhead.

Remark: Term Project assigned. Due in three weeks, 27 apr 2010

T-2.2.5: Cache Memory

The ideal memory is

Big
Fast
Cheap
Impossible

Commodity memory is big, slow, cheap, and possible.

Caches are small, fast, cheap enough because they are small, and possible.

Concentrating on the first two criteria we can build big and slow and we can build small and fast, but we want big and fast. This is where the idea of caching comes in.

A cache is small and fast. A significant portion of its speed is because it is close to the CPU and clearly if an object is big its (average or worst-case) distance from another object can't be small. For example, no matter where you park a car you can't have all (or half) of it within a foot of a given point.

The idea of caching is that we arrange (somehow) for almost all of the important data to be in the small, fast cache and use the big and slow memory to hold the rest (actually it holds all the data).

Since the portion of memory that is important changes with time, caches exchange data with memory as the program executes.

With clever algorithms for choosing which data to exchange with memory, surprisingly small caches can service a great deal of the memory activity of the processor.

There is no reason to stop with just one cache level. Today it is common to have a tiny, blistering-fast level-1 cache connected to a small, real-fast level-2 cache connected to a medium-size, fast level-3 cache connected to huge, slow memory.

This same issue of a small, fast red-memory supporting a large, slow blue-memory is studied in Operating Systems (202). In the OS setting, the small and fast memory is our big and slow central memory and the big and slow OS memory is a disk. Unfortunately, nearly all the terminology in the OS case (demand paging) is different from the terminology in the computer design case (caching).

T2.2.6: Memory Packaging and Types

T-2.3: Secondary Memory

T-2.3.1: Memory Hierarchies

The example of multiple cache levels, can be carried further. The processor registers are smaller and faster than a cache. As mentioned disks are bigger and slower than central memory, and robotic-accessed, tape storage is bigger and slower than a disk. Again the goal is to use smarts to approximate the impossible big and fast and cheap storage.

T-2.3.2: Magnetic Disks

Disks are covered in OS (202) so we will just define some terms (plus I demo'ed a bunch of disks last class).

Platter
Surface
Head
Track
Sector
Cylinder
Seek time:
Average case times are given and often the minimum, which is from one cylinder to the next one.
Rotational latency:
Given as the RPM or given directly in milliseconds.
Transfer rate:
Given directly as MB/sec or indirectly by RPM and track capacity.

Homework: 19.

T-2.3.2: Floppy Disks

Demoed last class.

T-2.3.4: Ide Disks

Describes the specific protocol, cabling, and speed.

T-2.3.5: Scsi Disks

Describes the specific protocol, cabling, and speed.

T-2.3.6: RAID

Done in OS (202).

T-2.3.7: CD-ROMs

Done in OS (202). Just one comment, unlike magnetic disks CD-ROMs and friends do not have circular tracks; instead the data spirals out from the center.

T-2.3.8: CD-Recordables (CD-R)

Done in OS (202).

T-2.3.9: CD Rewritables (CD-RW)

Done in OS (202).

T-2.3.10: DVD

Done in OS (202).

T-2.3.11: Blu-Ray

Done in OS (202).

T-2.4: Input/Output

T-2.4.1: Buses

Last class I demoed a computer main board (a.k.a. motherboard, system board, or mobo) and showed the slots where a controller would plug it.

I brought in an ethernet controller that fit onto the PCI bus of the main board. The different busses (PCI, PCIe, SCSI, ATA, etc) describe the wiring and protocols used to connect the different controllers to the CPU.

T-2.4.2: Terminals

Keyboards

Done in 202

CRT Monitors

Obsolete.

Flat Panel Displays

Very important, but a little too much engineering-oriented for us to cover. You might want to read it for you own curiosity.

Video RAM

One value per pixel on the screen. These values together are often called a bit map. In fact systems often contain several but maps to enable fast switching.

T-2.4.3: Mice

Covered in 202.

T-2.4.4: Printers

Monocrome Printers

Color Printers

T-2.4.5: Telecommunications Equipment

Modems

Digital Subscriber Lines (DSL)

Internet over Cable

T-2.4.6: Digital Cameras

T-2.4.7: Character Codes

ASCII

Unicode

T-2.5: Summary

Read.

Chapter T3: The Digital Logic Level

This is the bottom of the abstraction hierarchy.

T-3.1: Gates and Boolean Algebra

T-3.1.1: Gates

(Bipolar) Transitors and the Device Level

When the Base is high (positive voltage, say 5 volts, a digital 1) the transistor turns on, i.e., acts like a wire and the Collector is pulled down to ground (zero volts, a digital zero).

When the Base is low (zero volts, a digital zero), the transistor turns off, i.e., acts like an open circuit. Thus the collector is essentially the same as the voltage supply +V_cc; it is a digital one.

Summary, when the base is zero, the collector is one; and vice versa. That is, viewing the base as the input and the collector as the output. The logic function f having the property that f(0)=1 and f(1)=0 is called an inverter.

NAND and NOR

NAND and NOR The diagram on the right shows two additional logic functions built from transistors. These logic functions take two arguments and are called NAND (not and) and NOR (not OR) respectively.

Ignoring the above, which is one level below what we are studying, we define 5 logic gates by the truth tables given below their diagrams.

gates

NOT Truth Table
A	X
0	1
1	0

NAND Truth Table
A	B	X
0	0	1
0	1	1
1	0	1
1	1	0

NOR Truth Table
A	B	X
0	0	1
0	1	0
1	0	0
1	1	0

AND Truth Table
A	B	X
0	0	0
0	1	0
1	0	0
1	1	1

OR Truth Table
A	B	X
0	0	0
0	1	1
1	0	1
1	1	1

XOR Truth Table
A	B	X
0	0	0
0	1	1
1	0	1
1	1	0

Homework: Using truth tables prove DeMorgans Laws

  NOT(A AND B) = (NOT A) OR (NOT B)
  NOT(A OR B) = (NOT A) AND (NOT B)

Homework Show that all Boolean functions with two inputs (and one output) can be generated by just using NAND. This can be done two ways.

Draw all the possible truth tables and for each draw a circuit with just NAND that generates that same truth table.
Show how to get NOT, AND, and OR from just NAND and then show how any truth table can be generated from NOT, AND, and OR,

The book does method 2 in section 3.1.3. You should do method 1. How many truth tables are there?

Remark: The honors supplement has been added to the final project.

Jumped to Chapter T-5 (for assembler part of term project)

T-3.1.2: Boolean Algebra

Using truth tables, we can prove various formulas such as DeMorgan's Laws from the last homework. From these laws we can prove other laws.

Standard notation is to use + for OR, * for AND, and ⊕ for XOR (exclusive or). As in regular algebra the * is often dropped.

From these formulas, and algebraic manipulation we can get other formulas. This is called Boolean algebra (named after George Boole).

For example (I am using ' to signify NOT), you use truth tables to prove both distributive laws

  A(B+C) = AB + AC       * has higher precedence than +
  A+(BC) = (A+B)(A+C)    looks wrong but is correct

and then calculate

  A+(A'B)  =  (A+A')(A+B)      NOT has higher precedence than + or *
           =  (1)(A+B)         1 is the constant function (true)
           =  A+B              1 is the * identity (truth table)

T-3.1.3: Implementation of Boolean Functions

There is a standard procedure to generate any Boolean function using just AND, NOT, and OR. I did an example of this last time. Here is the general procedure.

Write the truth table.
Generate rails with each input and its complement.
Use an AND for each set of inputs with a 1 in the result column of the truth table.
Wire the ANDs to a final OR

T-3.1.4: Circuit Equivalence

As we have seen the same truth table can result from different Boolean formulas and hence from different circuits. Naturally, circuit designers might prefer one over the other (faster, less heat, smaller, etc.

T-3.2: Basic Digital Logic Circuits

T-3.2.1: Integrated Circuits

I showed a discrete circuit last time as well as a Pentium II main board containing many integrated circuits. Initially these circuits had only a few components; now they have millions.

T-3.2.2: Combinational Circuits

These are circuits in which the outputs are uniquely determined by the inputs.
Isn't this always true?
Certainly not! Some circuits have memory (i.e., RAM). If you give a ram an input of (12,read) the output is the last value that was stored in 12. So you need to know more than (12,read) to know the answer; you need to know the history.

Multiplexors (Muxes)

Have 2ⁿ inputs plus n select inputs. The select inputs are read as a binary value and thus specify a number from 0 to 2ⁿ. This number is used to select one of the inputs to be the output.

Construct on the board an equivalent circuit with ANDs and ORs in three ways:

Construct the truth table (64 rows!) and write the sum of products form, one product (6-input AND) for each row and a gigantic 64-way OR. Just start this, don't finish it.
A simpler (more clever) two-level logic solution. Four ANDS (one per input), each gets one of the inputs and both select lines with appropriate bubbles. The four outputs go into a 4-way OR.
Construct a 2-input mux (using the clever solution). Then construct a 4-input mux using a tree of three 2-input muxes. One select line is used for the two muxes at the base of the tree, the other is used at the root.

Decoders (and Encoders)

Imagine you are writing a program and have 32 flags, each of which can be either true or false. You could declare 32 variables, one per flag. If permitted by the programming language, you would declare each variable to be a bit. In a language like C, without bits, you might use a single 32-bit int and play with shifts and masks to store the 32 flags in this one word.

In either case, an architect would say that you have these flags fully decoded. That is, you can detect the value of any combination of the bits.

Now imagine that for some reason you know that, at all times, exactly one of the flags is true and the other are all false. Then, instead of storing 32 bits, you could store a 5-bit integer that specifies which of the 32 flags is true. This is called fully encoded.

A 5-to-32 decoder converts an encoded 5-bit signal into 32 signals with exactly one signal true.

A 32-to-5 encoder does the reverse operations. Note that the output of an encoder is defined only if exactly one input bit is set (recall set means true).

The diagram on the right shows a 3-to-8 decoder.

Note the 3 with a slash, which signifies a three bit input. This notation represents three (1-bit) wires.
A decoder with n input bits, produces 2ⁿ output bits.
View the input as k written as an n-bit binary number and view the output as 2ⁿ bits with the k-th bit set and
Implement the 3-to-8 decoder on the board with simple gates.
Why do we use decoders and encoders?
- The encoded form takes (MANY) fewer bits so is better for communication.
- The decoded form is easier to work with in hardware since there is no direct way to test if 3 wires represent a 5 (101). You would have to test each wire. But it easy to see if the encoded form is a five; just test the fifth wire, out5.

The truth table for an 8-3 encoder has 256 rows; for a 32-5 decoder we need 4 billion rows.

There is a better way! Make use of the fact that we can assume exactly one input is true.

For each output bit, OR the inputs that set this bit. For example the low-order output of an 8-3 is the OR of input bits 1,3,5,7.

Comparators

Programable Logic Arrays (PLAs)

T-3.2.3: Arithmetic Circuits

Shifters

Do you want to rotate/0-fill/sign-extend?

Do you want to shift left or right?

Use muxes to give all the choices you want. The operation forms the select lines.

Adders

Draw a half adder (AND and XOR) that takes two inputs and produces two outputs, the sum and the Carry-out.

Full Adder

Full Adder Really we want a full adder that has three inputs (A, B, Carry-in) and produces two outputs (Sum, Carry-out). The Sum equals the total number of 1s in A, B, and Ci is odd. The Carry-out is at least two of A, B, and Ci are 1.

The diagram above uses logic formulas for Sum and Carry-out equivalent to the definitions just given (see homework just below).

Homework:

Draw the truth table for the full adder (8 rows) based on the definition. Note that the circuit has 3 inputs and 2 outputs so the truth table has 3+2=5 columns and 2³=8 rows (the second 2 is NOT the number of outputs).
Show S = X ⊕ Y ⊕ Ci
Show Co = XY + (X ⊕ Y)Ci

T-3.2.4: Clocks

The period is also called the cycle time. The number of cycles per second/hour/day/etc is called the frequency. So a clock with a 2 nanosecond cycle time has a frequency of 1/2 a gigahertz or 500 megahertz (one hertz is one cycle per second).

T-3.3: Memory

T-3.3.1: Latches

The only unclocked memory we will use is a so called S-R latch (S-R stands for Set-Reset).

When we define latch below to be a level-sensitive, clocked memory, we will see that the S-R latch is not really a latch.

The circuit for an S-R latch is on the right. Note the following properties.

The S-R latch is constructed from Cross-coupled nor gates.
Consider the four possible inputs.
We do NOT assert both S and R at the same time (the output is not defined in this case).
When S is asserted (i.e., S=1 and R=0):
- The latch is Set (that's why it is called S).
- Q becomes true (Q is the output of the latch).
- Q' becomes false (Q' is the complemented output).
When R is asserted:
- The latch is Reset.
- Q becomes false.
- Q' becomes true.
When neither one is asserted:
- The latch retains its value, i.e. Q and Q' stay as they were.
- This last statement is the memory aspect.

Clocked SR Latches

The clocked version on the right has the property that the values of S and R are only relevant when the clock is high (i.e., true not false). This is sometimes convenient, but we will not use it. Instead we will use the important D-latch that we describe next that is very similar.

Clocked D Latches

The D stands for data.

The extra inverter (the bubble on the top left) and the rewiring prevents R and S from both being 1.

Specifically, there are three cases.

When the clock is low (false), both R and S are false and, as we saw before, Q and Q' remain unchanged.
When the clock is high and D is high, Q becomes true and Q' false.
When the clock is high and D is low, Q becomes fales and Q' becomes true.

The summary is that, when the clock is asserted, the latch takes on the value of D, which is then held while the clock is low. The value of D is latched when the clock is high and held while the clock is low.

The smaller diagram shows how the latch is normally drawn.

In the traces to the right notice how the output follows the input when the clock is high and remains constant when the clock is low. We assume the stored value was initially low.

Remark: Grishman did 4-bit adders and subtracters. If you wish (on line) pictures, you can look at my architecture notes.

T-3.3.2: Flip-Flops

D or Master-Slave Flip-flop

This structure was our goal. It is an edge-triggered, clocked memory. The term edge-triggered means that values change at edges of the clock, either the rising edge or the falling edge. The edge at which the values change is called the active edge.

The circuit for a D flop is on the right. It has the following properties.

The D-flop is built from D-latches, which are transparent, i.e the output equals the input when the clock is high.
The flop, however, is Not transparent
- Changes to the output occur only at the active edge.
- The circuit in the diagram has the falling edge as active. edge.
The structure is sometimes called a master-slave flip-flop: the left latch is the master and the right the slave.
The substructures reuse the same letters as the main structure but have different meaning (similar to block structured languages in the algol style).
The master latch is set during the time the clock is asserted. Remember that the latch is transparent, i.e. it follows its input when its clock is asserted. But the second latch is ignoring its input at this time. When the clock falls, the 2nd latch pays attention and the first latch keeps producing whatever D was at fall-time.
Actually D must remain constant for some time around the active edge.
- The set-up time before the edge.
- The hold time after the edge.

The picture on the right is for a master-slave flip-flop. Note how much less wiggly the output is in this picture than before with the transparent latch. As before we are assuming the output is initially low.

Homework: In the D-flop diagram, move the inverter to the other latch, i.e., the inverted clock goes to the left latch and the positive clock goes to the right. What has changed in the D-flop?

Homework: Which code better describes a flip-flop and which a latch?

  repeat {
      while (clock is low) {do nothing}
      Q=D
      while (clock is high) {do nothing}
  } until forever

  repeat {
      while (clock is high) {Q=D}
  } until forever

Show how to make a register out of FFs (easy just use a bunch).

Show how to make a register file out of registers. Not too hard use a BIG mux.

Describe how to write a register. Actually the trick is how to not write a register. Recall that the constituent FFs are written at every falling edge. The idea is to introduce a signal that is ANDed with the clock to eliminate edges you don't want (this takes some care).

Then the diagram on the right shows the basic workings of a register based ADD (or SUB or OR or AND)

    add regA,regB,regC

Chapter T-4: The Microarchitecture Level

Chapter T-5: The Instruction Set Architecture Level

Remark: We jump ahead (out of order) so that I can cover enough x86 assemble language for you to do the assembler portion of the project. I am not following Tanenbaum's order here as the goal is just x86.

Remark: I believe this reference is a good resource for x86 assembly programming.

T-5.1: Overview of the ISA Level

5.1.1: Properties of the ISA Level

5.1.2: Memory Models

5.1.3: Registers

5.1.4: Instructions

T-5.1.5: Overview of the Pentium 4 ISA Level

In order to maintain compatibility with previous, long-out-of-date members of the x86 processor family, modern members can execute in three modes.

Real Mode: The processor acts just like a 1979 8088, the processor used in the original PC. If any program screws up, the machine crashes.
If Intel had designed human beings, it would have put in a bit that made them revert back to chimpanzee mode (most of the brain disabled, no speach, sleeps in trees, eats mostly bananas, etc.)
While perhaps humorous (Tanenbaum certainly writes well) the quote does hide the tremendous user advantages of having a new computer that can still execute old (often awful) programs, in particular old games, which were notorious for not being clean.
Virtual 8086/8088 mode: Now if a program crashes the OS is notified, rather than having the machine crash. I am not sure why (indeed, if) real mode is still needed, i.e., why virtual 8086 mode did not simply replace it.
Protected mode: This is the mode we will study and is the mode used by all modern operating systems and applications.

Registers on the x86

We will mainly use the 32-bit registers, their names begin with E standing for extended. They extended the 16-bit registers of early members of the family.

The four main registers are EAX, EBX, ECX, and EDX. Each is 32 bits. We will make most use of EAX, which is the main arithmetic register. Also functions returning a single word, return this in EAX.

  mov   EAX   ECX
  mov   EAX   [EBX]
  mov   EAX   [EBX+4]

If an address is in any one of these four the contents of that address can be specified as an operand for an instruction. Also an offset can be added. For example the first instruction on the right simply copies (the contents of) ECX into EAX. The second instruction does a de-reference. If EBX contains 1000, the contents of memory location 1000 is loaded into EAX. Finally, the last instruction would load the contents of 1004 into EAX (again assuming EBX contains 1000).

As you can see from the sheet I handed out and from Tanenbaum's figure 5-3, these registers contain named 16-bit and 8-bit subsets.

The two registers ESI and EDI are mostly used for the hardware string manipulation instructions. I don't think you will need those instructions, but you can also use ESI and EDI to hold other values (scratch storage).

The EBP register is normally used to hold the frame pointer FP, that will be described below. The ESP is the stack pointer (again described below).

5.1.6: Overview of the UltraSPARC III ISA Level

5.1.7: Overview of the 8051 ISA Level

5.2: Data Types

T-5.2.1: Numeric Data Types

T-5.2.2: Nonnumeric Data Types

T-5.2.3: Data Types on the Pentium 4 (x86)

The x86 architecture supports signed and unsigned integers of three sizes.

8 bit (one byte): used for ascii characters (C char).
16 bit (two bytes): used for unicode characters and for 16-bit integers (C short int).
32 bit (4 bytes): used for integers (C int).

There is also support for 32-bit and 64-bit floating point, which are used for C float and double respectively.

Finally, there is support for 8-bit BCD (binary coded decimal), which is not used in C.

T-5.2.4: Data types on the UltraSPARC III

T-5.2.5: Data types on the Java Virtual Machine

T-5.2A: Argument Passing from C to Assembler on x86

This is not from the book.

Since you will be writing an assembler subroutine called by a C program and your subroutine might call another C program, we need to understand how arguments, the return address, and the returned value are passed from caller to callee. The short answer is via the stack.

Each routine places its local variables on the stack, a region of memory that grows and shrinks during execution. (We are ignoring variables created via malloc as they are not allocated on the stack.) Due to the lifo nature of calls and returns, stack allocation works perfectly for such variables.

As shown in the diagrams, the stack starts at a high address and grows towards location zero

Each routine uses a region of the stack called its stack frame or simply frame. The C convention (really the C-compiler convention) is that the frame is specified by two pointers: the frame pointer fp, which points to the beginning (bottom) of the frame, and the stack pointer sp that points to the current end (top) of the frame. As the routing places more information on the stack, sp moves (towards 0) to enlarge the stack. As the routine removes entries at the top of the stack, sp again moves (in this case away from 0).

In the left diagram the currently running procedure has just called another procedure. The caller has pushed the arguments onto the stack (in reverse order) and then pushed the return address (actually the call instruction did the last part). Also the caller has saved EAX, ECX, and EDX if necessary (these are referred to as caller-save registers).

    .intel_syntax noprefix
.globl add2
add2:   push    ebp
    mov ebp, esp
    mov eax, DWORD PTR [ebp+12]
    add eax, 1
    push    eax
    call    g
    add esp, 4 # undo the push
    pop ebp
    ret
#include <stdio.h>
int main(int argc, char *argv[]) {
    int i;
    for (i=0; i<10; i++)
    printf("i is %d and add2(1,i) is %d\n",
           i, add2(1,i));
    return 0;
}
int g(int x) {
    return x * x;
}
// Local Variables:
// compile-command: "cc -O add2.c add2.s \
// -mpreferred-stack-boundary=2; ./a.out"
// End:

We are the callee and first must set fp to the bottom of OUR frame (it is currently the bottom of the caller's frame). We also must save the current value of fp so that when we return to the caller, we can restore fp to the bottom of the caller's frame.

That is, we want to move from the left diagram to the right one. The first two assembler statements on the right do exactly this. The register EBP holds the current fp (I believe B is for base, the fp points to the base of the current stack). The register ESP holds sp.

The purpose of the program is to compute (x+1)² for x between 0 and 10. The main program calls us with two arguments, the first is unused (I wanted to illustrate the order the arguments appear on the stack) the second is the value to be operated on.

We want to move the second argument to EAX for processing. This will overwrite whatever the caller had in EAX, but recall that it is one of the caller-saved registers (mentioned above) so we do not have to save it. How do we reference the second parameter? It is in the caller's stack frame, the one below ours. Since the stack grows towards zero, going backwards means increasing the fp.

Why is it 12 to go back only 3, and what is the DWORD PTR nonsense? The 12 is easy: 3 words equals 12 bytes.

The DWORD PTR is because a pointer (in this case ebp) can point to a byte, a 2-byte word, or a 4-byte doubleword. We think of 32-bit words, but the x86 family started out with 16-bit machines and it shows.

Next we add 1. Note that x86 is a 2-operand architecture, you can compute x=xOPy or x=yOPx but not x=yOPz.

Now that we have x+1 we want to call a function to do the squaring. Thus, we are now the caller.

You might think that we need to save EAX since it is a caller-save register, but the value it contains is the first argument of the new callee so when we push that argument, we have saved EAX as well. We then issue the call instruction.

The function g, like all functions, returns its result in EAX. As it happens that value is the result we are charged with returning as well. Thus we just leave it there and return to our caller.

But wait, we have messed up the pointers to the stack! Hence, the end of our routine restores them before returning. The first diagram on the right shows the stack just after we have called g(), but before g() has executed.

When g executes ret, sp is lowered one word. We we execute add, sp is lowered again, returning us to the right stack in the previous diagram. The pop gives us the left stack in the previous diagram. Finally, our ret restores the stack to 2nd on the right, which is same as it was before main program called us.

Note that the values above sp are still there but the space would be reused if main() called another routine.

T-5.3: Instruction Formats

5.3.1: Design Criteria for Instruction Formats

5.3.2: Expanding Opcodes

T-5.3.3: The Pentium II (x86) Instruction Formats

Very complicated as is clear from looking at Figure 5-14 and reading the accompanying text. This makes it difficult for the hardware designer, which is not our problem.

It also makes the assembly language somewhat irregular. Specifically, it is not true that the 8 main registers EAX, EBX, ECX, EDX, ESI, EDI, ESP, EDP can be used interchangeably, certain instructions can use certain registers.

5.3.4: The UltraSPARC II Instruction Formats

5.3.5: The JVM Instruction Formats

T-5.4: Addressing

T-5.4.1: Addressing Modes

Most instructions have one or more operands, each of which is specified by a corresponding field in the instruction. It is the addressing mode that determines how the operand is determined given the address field.

T-5.4.2: Immediate Addressing

In this, the simplest form, the address field is not an address at all but is the operand. In this case we say the instruction has an immediate operand, because it can be determined immediately from the instruction (without requiring and additional memory reference).

T-5.4.3: Direct Addressing

Almost as simple, and better fitting the name address field, is for the address field to contain the address of the operand. So if the address field is 12, the operand is the contents of the 32-bit word (or 64-bit word, or 16-bit word, or 8-bit byte) specified by location 12.

T-5.4.4: Register Addressing

In this mode the operand is the register specified by the address field. So if the address field is 12 the operand is the contents of the register with address 12 (normally called simply register 12). This mode is very common and very fast.

T-5.4.5: Register Indirect Addressing

Using the terminology of C (and other high-level languages). This mode is just the de-reference operator applied to the previous mode. So if the address field is 12 the operand is determined by a two step process: first register 12 is examined. Say its value is 22888. Then the operand is the contents of the word (or byte, etc) specified by location 12.

T-5.4.6 Indexed Addressing

In this addressing mode, two values are used to determine the address: one is a used to specify a register and the second is a constant that is added to the contents of the register. The resulting sum is use as a memory address, the contents of which is the operand.

Why is this useful and why is it called indexed? Consider

    for (i=0; i<10; i++)
        A[i] = 0;

and assume that the array A is global (so that its address is known before the program begins execution).

What is the address referred to by A[i]?
It is the address of A[0] plus 4 times the value of i. The former is a constant (let's say it is 1280) and we use a register for the latter so the assembler loop would have body

    DWORD PTR mov [1280+EAX], 0     // X is the address of A[0], a known constant
    add EAX, 4

Note that EAX is serving as the index i in the C code. Hence the name, indexed addressing.

T-5.4.7: Based-Indexed Addressing

If one register is good, two are better (or at least more general). In this mode, the contents of two registers are added to a constant. Consider again

    for (i=0; i<10; i++)
        A[i] = 0;

but this time assume the array A is on the stack. Specifically assume A[0] is 1000 bytes below SP the top of the stack. Register ESP typically holds SP so the loop body in assembler would be

    mov DWORD PTR [ESP+EAX+1000], 0
    add EAX, 4

T-5.4.8: Stack Addressing

Reverse Polish Notation

Evaluation of Reverse Polish Notation Formulas

T-5.4.9: Addressing Modes for Branch Instructions

T-5.4.10: Orthogonality of Opcodes and Addressing Modes

T-5.4.11: The Pentium 4 Addressing Modes

The x86 is quite irregular: not all addressing modes are available for all instructions and not all registers can be used for all addressing modes.

The machine has both 16-bit and 32-bit flavors of operations, we are only studying the 32-bit versions.

The x86 is a two operand machine, but at most one operand can be a memory location.

The x86 supports immediate, direct, register, register indirect, indexed, and based-index. Based-index uses an extra byte of instruction call the SIB (Scale, Index, Base), which specifies not only both the base and index registers, but a scale of 1, 2, 4, or 8 that is multiplied with the index register, which permits that register to represent the number of bytes, (16-bit) words, double words, or quad words the effective address is displaced from the base.

I do not understand why Tanenbaum does not consider addresses using SIB to be employing based-index mode.

T-5.4.12: The UltraSPARC Addressing Modes

T-5.4.13: The 8051 Addressing Modes

T-5.4.14: Discussion of Addressing Modes

T-5.6: Flow of Control

T-5.7: A Detailed Example: The Towers of Hanoi

T5.8: The IA-64 Architecture and the Itanium2

T5.6: Summary

Appendix TA: Binary Numbers

T-A.1: Finite Precision Numbers

In mathematics, integers have infinite precision. That is, we uses as many digits as are needed, without limit.

Some software systems offer this as well (up to the memory limits of the computer). However, we will be looking at the native hardware support for integers (we will not do floating point, which is a little more complicated). On most systems you can buy today, the normal integer is 32 bits or 64 bits. That means you write integers using 32 bits (or 64 buts, but we will concentrate on 32-bit systems). If an integer requires more than 32 bits it cannot be expressed using the native hardware representation of integers.

This possibility of a number not being expressible leads to anomalies, such as overflow. We will learn the representation shortly, but for the moment note that the largest integer expressible in the native 32-bit system is 2³¹-1=2,147,483,647. Thus

    (2,000,000,000-1,000,000,000)+1,000,000,000 ≠
    (2,000,000,000+1,000,000,000)-1,000,000,000

Specifically, the first computation yields the mathematically correct answer of 2,000,000,000; whereas, the second gives no answer since an overflow occurs during the addition.

T-A.2: Radix Number Systems

We write our numbers in the radix-10 system. That is the digits read from the right tell you how many 1s, how many 10s, how many 100s, etc. Note that 1=10⁰, 10=10¹, 100=10², etc. (Some ancient civilizations used other radices—or radixes.)

Almost all computers use radix 2; that is what we shall use. So the bits (sometimes called binary digits) from right to left tell you how many 1s, 2s, 4s, 8s, etc., where 1=2⁰, 2=2¹, 4=2², 8=2³, etc.

A.3: Conversion from one Radix to Another

It is very easy to convert from radix 2 to any radix 2ⁿ. You simply group n of the bits together to form a single digit in radix 2ⁿ.

Do this on the board for octal (radix 8=2³) and hexadecimal (radix 16=2⁴).

Converting from Binary to Decimal

You can simply follow the definition. For the binary number ABCDEFG (each letter is a bit), the decimal equivalent is

    A×2⁶+B×2⁵...+F×2¹+G×2⁰ = A×128+B×64...+F×2+G

Less work is to evaluate the equivalent expression

    G+2×(F+2×(E+2×(D+2×(C+2×(B+2×A)))))

from right to left (start with A, double it and add B, double the sum and add C, ....

Converting from Decimal to Binary

Take the remainders obtained with successive divisions by two.

For example, take 103.

Dividing 103 by two gives a quotient Q=51 and a remainder R=1. This says the low order bit is 1.
Next divide the quotient 51 by 2 and get Q=25 & R=1. So the low order is now 11.
Dividing 25 by 2 gives Q=12 & R=1; low order is 111.
Dividing 12 by 2 gives Q=6 & R=0; low order is 0111.
Dividing 6 by 2 gives Q=3 & R=0; low order is 00111.
Dividing 3 by 2 gives Q=1 & R=1; low order is 10111.
Dividing 1 by 2 gives Q=0 & R=1; low order is 10111.

Homework: 1, 2, 3

T-A.4: Negative Binary Numbers

There are several schemes for representing binary numbers; we will study the one that is used on essentially all modern machines: two's complement.

Although we are interested in 32-bit systems, let's use 4 bits in this section since it will ease our task when we do arithmetic and draw pictures. There is basically no difference between n-bit and m-bit systems providing n and m are least 3.

How Many Positive Values and How Many Negative Values

With 4 bits, we can express 16 numbers. One of these bit pattern must be used for zero, leaving 15 for positive and negative. Thus we cannot achieve the ideal of using all 16 values, having exactly one representation for each expressible value, and having the same number of positive and negative values. So we must give up one of these ideals. Possibilities include

Having two (or more) different representations for some values.
Having at least one bit pattern declared illegal.
Having a different number of positive and negative values.

Possibility 1 has been done (in this very building!). The CDC 6600, then the fastest computer in the world, used one's complement arithmetic, which has two expressions for zero (0000 and 1111).

Also one could use the bottom three bits to express 0-7 and declare the top bit as the sign so both 0000 and 1000 would be zero.

I don't know of a machine that ever did possibility 2.

The third possibility dominates and that is what we will study.

A Conceptual Understanding of Twos Complement

This text (and many others) just tells you how to do it (take the ordinary (bitwise) complement (called the one's complement) and then add 1 to get the two's complement.

That sound too much like instructions to Merlin for my liking. So I will try to explain why it is done.

Recall we have zero and 15 additional numbers to split among the positive and negative values. Seven will be positive and eight negative. It will become clear later why we don't have 8 positive and seven negative.

Good news. The values from 0-7 are expressed as you would expect:
0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111.
The high order bit gives the sign and the bottom n-1=4-1=3 bits gives the magnitude.

Bad news. The value -3 is not just the value for three with the sign bit 1.

Let's begin.

Note that the number 16 in binary is 10000, it is the first number that cannot be expressed in 4 bits. If we chop off the high bit (normally called HOB, high-order bit), 16 becomes 0. Said mathematically 16 mod 16 is 0.

Now what about -3? The definition of -3 is that it is the (unique) number that, when added to 3, gives 0. Instead of demanding to get 0 when added, we loosen the requirement to say that we want a number that, when added to 3, gives 0 mod 16.

There are lots of them: 13+3=16, which is 0 mod 16; 29+3=32, which also is 0 mod 16. But there is only one number in the range 0-15 that has this property, and those are precisely the numbers expressible in 4 bits.

Mathematically we are simply taking -3 mod 16.

So, for us -3 is the 4-bit representation of 13, which is 1101. If we simply add and throw away the 5th bit we see that 13+3 is indeed 0: 1101+0011=10000, which becomes 0000 when we throw away bit 5.

Recall that we define -n to be the number, which when added to n gives 16, i.e. we just express -n as (-n) mod 16. This is all for n between 1 and 7.

The properties of mod permit us to prove the normal laws of inverses. For example, the inverse of n+m is

    -(n+m) mod 16 = (-n)+(-m) mod 16 = [(-n) mod 16] + [(-m) mod 16]

which is precisely the inverse of n plus the inverse of m.

OK, but how do you calculate inverses on the computer. For our 4-bit system, do we calculate the inverse of 3 by evaluating -3 mod 16? With a 32-bit system, do we calculate the inverse of 1,000,000 by evaluating -1,000,000 mod 2³²?

There is indeed an easier way. Recall that to find the inverse of 3 we needed to find the number with the property that when added to 3 we get 16.

Written in 4-bit binary we want the number that when added to 0011 gives us 10000 (I know that is 5-bits).

Let's ask a different question. What is the number that when added to 0011, gives 1111. That is easy, look at 0011 and take the complement, 1100. Then between the original and the complement each bit position has exactly one 1 so the sum is clearly 1111. (This number 1100 is called the one's complement).

Since we really wanted to get 10000 not just 1111, we need to add one. This gives 1101, which is indeed the two's complement of 0011.

So the rule is: Take the bitwise complement and add 1, just as all the text books say. So for 4-bit numbers using two's complement arithmetic, -0011 is 1101. Said more simply -3 is 1101 in 4-bit 2's complement arithmetic.

Negating Negative Numbers and the Lack of Ideal Behavior

It is not too hard to see that this same procedure works when the original number is negative. Lets try -(-3). We already know -3 is 1101. Complementing gives 0010 and adding 1 gives 0011. Success.

Addition works too. Compute -2=-0010=1101+1=1110; (-2)+(-3)=1101+1110=11011 toss the HOB and the answer is 1011. Is this really -5? Does 1011+0100 give 16? Yes!

Is -(1011) equal to 5? Take the complement and add 1: 0100+1=0101=5.

Now the sad news. (-4)+(-4)=1100+1100=11000 toss the HOB and get 1000, which actually is -8 but the complement is 0111+1=1000 which is not 8. Remember we can't have the same number of positives as negatives. So the range for 4-bit two's complement is -8,-7,...0,...6,7.

T-A.5: Binary Arithmetic

Tanenbaum does both one's complement and two's complement arithmetic. We will just do the latter. As we indicated above you simply add the two's complement numbers with no thought of signs or compliments. If you add two n-bit numbers you might get an (n+1)-bit number, i.e., you might get a carry-out of the high order bit. But the rule is simple, toss it!

Subtraction

The rule is the same as what you learned in elementary school, a-b=a+(-b). That is you invert (take the two's complement) the b and add. For example 5-3 is (0101)-(0011)=(0101)+(1101)=10010 toss the HOB and get 0010 which is 2.

Homework: 7.

Overflow

Unfortunately, although the above does describe (part of) the hardware, it doesn't always give the correct answer. As a simple example, with our 4-bit system we can express -8...7, but if you add 5+6 you should get 11. We cannot possibly get 11 since we can't express 11. Similarly if you add (-5)+(-6), you should get -11, which again we cannot even express.

When the result falls outside the expressible range, an overflow has occurred.

When you add numbers of opposite sign overflow is impossible (the result is between the two original numbers).

As we have seen, subtracting numbers of the same sign is the same as adding numbers of opposite sign so again overflow is impossible.

When you add numbers of the same sign (or subtract numbers of the opposite sign) overflow is possible. The question is, When does it occur?.

The answer is simple to state but not so simple to explain (you need to analyze several cases): An overflow occurs if and only if the carry into the HOB does not equal the carry out from the HOB.

Homework: 9.

VPN	PPN	Valid
00	28	1
01	-	0
02	33	1
03	02	1
04	-	0
05	16	1
06	-	0
07	-	0
08	13	1
09	17	1
0A	9	1
0B	-	0
0C	-	0
0D	2D	1
0E	11	1
0F	0D	1

VPN	PPN	Valid
00	28	1
01	-	0
02	33	1
03	02	1
04	-	0
05	16	1
06	-	0
07	-	0
08	13	1
09	17	1
0A	9	1
0B	-	0
0C	-	0
0D	2D	1
0E	11	1
0F	0D	1

Chapter 0 Administrivia

0.1 Contact Information

0.2 Course Web Page

0.3 Textbooks

0.4 Email, and the Brightspace Mailing List

0.5 Grades

0.7 Homeworks and Labs

0.7.1 Homework Numbering

0.7.2 Doing Labs on non-NYU Systems

0.7.2.1 Testing Your Labs on linserv1.cims.nyu.edu

0.7.3 Obtaining Help with the Labs

0.7.4 Computer Language Used for Labs

0.8 A Grade of Incomplete

0.9 Academic Integrity Policy

0.10 Tutoring

Chapter K&R-1 A Tutorial Introduction

K&R-1.1 Getting Started

The Hello World Function

K&R-1.2 Variables and Arithmetic Expressions

K&R-1.3 The For Statement

1.A lvalues and rvalues

Remark

K&R-1.4 Symbolic Constants

Fahrenheight-Celsius

Floating Point Fahrenheight-Celcius

K&R-1.5 Character Input and Output

getchar() / putchar()

K&R-1.5.1 File Copying

K&R-1.5.2 Character Counting

K&R-1.5.3 Line Counting

K&R-1.5.4 Word Counting

The Unix wc Program

K&R-1.6 Arrays

Mean and Standard Deviation

The First Version (Just the Mean; No Array Used)

The Second Version (Just the Mean: Array Used)

The Third Version (Mean and Standard Deviation)

Non-primitive Declarations

K&R-1.7 Functions

Computing Letter Grades

Averages and Sorting

1.8 Arguments—Call by Value

Terminology: Arguments vs Parameters

Copy-in BUT NOT Copy-out Semantics

1.9 Character Arrays

Print Longest Line

The main() Function

The getline() Helper

A Subtlety in The Three Tests

The copy() Helper

1.10 External Variables and Scope

Solving Quadratic Equations

Method 1 Arguments and Parameters

Method 2 External (a.k.a. Global) Variables

Chapter K&R-2 Types, Operators, and Expressions

K&R-2.1 Variable Names

K&R-2.2 Data Types and Sizes

K&R-2.3 Constants

Integer and Char Constants

String Constants

Enum Constants

2.4 Declarations

2.5 Arithmetic Operators

2.6 Relational and Logical Operators

2.7 Type Conversion

Automatic Conversions

Explicit Casts

2.8 Increment and Decrement Operators

K&R-2.9 Bitwise Operators

2.10 Assignment Operators and Expressions

2.11 Conditional Expressions

2.12 Precedence and Order of Evaluation

Chapter K&R-3: Control Flow

3.1 Statements and Blocks

3.2 If-Else

3.3 Else-IF

Dangling Else

3.4 Switch

3.5 Loops—While and For

The Comma Operator

The `main()` Function

The `getline()` Helper

The `copy() Helper`

Calculating the Length of a String: `mystrlen()`

Calling `mystrlen()`

Using `int *A` vs `int A[]` as a Parameter

`alloc()/afree()`: A Simple Storage Allocator

`char amsg[]="hello";` vs `char *msgp="hello";`

VPN	PPN	Valid
00	28	1
01	-	0
02	33	1
03	02	1
04	-	0
05	16	1
06	-	0
07	-	0
08	13	1
09	17	1
0A	9	1
0B	-	0
0C	-	0
0D	2D	1
0E	11	1
0F	0D	1