CSCI-UA.0202 Spring 2015 Lab 1: C review, WeensyOS MiniprocOS, git tutorial

Handed out Thursday, January 29, 2015
Due 9:00 PM, Friday, February 6, 2015

Lab 1: Setup and C review, WeensyOS MiniprocOS, git

Introduction

This lab is split into three parts. In the first part, you will get your development environment set up, and quickly review C pointers. In the second part, you will do the heart of this lab: a small coding exercise (WeensyOS, written by Eddie Kohler) in which you manipulate processes (you will implement a basic fork call!). The third part gives an introduction to Git, a version control system that you will use this semester to manage your work.

This lab does not require you to write much code. So take time to understand the code, tools, and environment that you have been given; you will need to be fluent in these things in order to be productive this semester.

You will electronically hand in code and answers (the mechanics are described at the end of the lab; sneak preview). You will hand in responses to numbered exercises. Do your work in the code files themselves and in a file called answers.txt (your answers must be in this file in text; a blank answers.txt will receive no credit).

Please note the following expectation, which we will have throughout the semester. You must comment every block of code that you write, and you will be graded on these comments. As a rough guideline, figure that you need a one-sentence comment every four lines or so. In these comments, you must explain the purpose of the code, rather than narrating the instruction that you gave to the computer. Examples of not-good comments are, "assign variable X to variable Y", "dereference pointer P", or "the following code is a for loop that iterates over the foo structure." An example of a better comment would be, "Initialize the frobnitz data structure." To be clear: in the real world, it is terrible style to comment code that is obvious or self-documenting. Why are we departing from that? Because, this being a class, we want to make sure that you understand and can explain your code.

As an exception, the preceding expectation does not apply to the exercises in part 1 below.

Part 1: Getting started, C review

Set up your development environment

You will perform the labs on one of two platforms: a virtual devbox (a machine with all of the required development software installed on it), or the machines provided to us by CIMS (The Courant Institute). It's time for you to set up that platform:

Go to the setup page, and follow the instructions there. Come back here when that step is complete.

Get the code

The files that you will need for this and subsequent lab assignments in this course are distributed using the Git version control system. We will learn a bit more about Git at the end of this lab.

The URL for the course Git repository is http://www.cs.nyu.edu/~mwalfish/classes/15sp/cs202-labs.git

To get the labs, go into your development environment (the devbox or the CIMS machines) and run the commands below (the lines preceded with '#' are comments):

$ mkdir ~/cs202
$ cd ~/cs202
# You must execute the line below; NOT doing so violates the course policies
$ chmod 0700 . 
# The 'dot' at the end of the next line tells git not to add another 
# layer in the directory hierarchy: the clone will be into the current directory
$ git clone http://www.cs.nyu.edu/~mwalfish/classes/15sp/cs202-labs.git . 
Cloning into ....
$

Git allows you to keep track of the changes you make to the code. For example, if you are finished with one of the exercises, and want to checkpoint your progress, you can commit your changes by running:

$ git commit -am 'my solution for lab1 exercise1'
Created commit 60d2135: my solution for lab1 exercise1
 1 files changed, 1 insertions(+), 0 deletions(-)
$

You can keep track of your changes by using the git diff command. Running git diff will display the changes to your code since your last commit, and git diff origin/lab1 will display the changes relative to the initial code supplied for this lab. Here, origin/lab1 is the name of the git branch with the initial code you downloaded from our server for this assignment.

Review of C pointers

All of the projects in this class will be done in C or C++, for two main reasons. First, some of the things that we want to implement require direct manipulation of hardware state and memory, which are operations that are naturally expressed in C. Second, C and C++ are widely used languages, so learning them is a useful thing to do in its own right. While you have some experience in C from CS201, you will probably need greater comfort and familiarity here than was required there. You will truly need to "think in C."

If you are interested in why C looks like it does, we encourage you to look at Ritchie's history. Here, perhaps, is the key quotation: "Despite some aspects mysterious to the beginner and occasionally even to the adept, C remains a simple and small language, translatable with simple and small compilers. Its types and operations are well-grounded in those provided by real machines, and for people used to how computers work, learning the idioms for generating time- and space-efficient programs is not difficult. At the same time the language is sufficiently abstracted from machine details that program portability can be achieved."

You will do four short exercises in C, as a warmup. From your cs202 directory:

$ cd lab1/review-pointers

Exercise 1. Implement the function set_to_five in ex1.c.

For Exercises 1 through 4, you can test with:

$ make

Later in the lab, we will revisit the program make. For now, think of it as an automatic way to invoke the compiler, and link your compiled code to a small test harness that we provide. Now invoke ex1:

$ ./build/ex1
ex1: OK

If your ex1.c is correctly implemented, you will see OK, as above.

Exercise 2. Implement the function swap in ex2.c.

You will need to remove the assert(0); line. Remove this wherever you see it as you implement functions in future exercises; it is a reminder that you have yet to implement a particular function.

As above, you compile and test with:

$ make

Now you should see

$ ./build/ex2
ex2: OK

The assert line you saw in the previous exercise is a simple application of a powerful tool. In C, an assert is a preprocessor macro which effectively enforces a contract in the program. (You can read more about macros in C here.) The contract that assert enforces is simple: when program execution reaches assert(<condition>), if condition is true, execution continues; otherwise, the program aborts with an error message.

Assertions, when used properly, are powerful because they allow the programmer who uses them to guarantee that certain assumptions about his or her code hold. For example, in the swap function that you just implemented, there are two assertions at the beginning of the function, before where your code should have been placed:

void
swap(int *p1, int *p2)
{
        assert(p1 != NULL);
        assert(p2 != NULL);

	...
}

These two assertions combined enforce the contract that neither of the parameters p1 or p2 can be NULL. If these assertions were not present and either of these passed parameters were NULL, if we tried to swap them, we would encounter a type of error called a segmentation violation (or segmentation fault). This is because dereferencing the NULL address is invalid; NULL points to "nothing". By using assertions, we guarantee that swap will never try to swap the value of a variable at NULL, saving us the headache of having to debug a segmentation fault if some code tried to pass swap a NULL value. Instead, we will get a handy error message describing exactly what contract of the function was invalidated.

Exercise 3. Implement the function array_sum in ex3.c.

Test it:

$ make
gcc -m32 -g -c ex3.c -o build/ex3.o
gcc -m32 static/part3_harness.o build/ex3.o -lm -o build/ex3
$ ./build/ex3
ex3: OK

Again, you should see "OK".

Implement the set_point and point_dist functions in ex4.c.

Look in ex4.h for the definition of struct point, which represents a point in a two-dimensional plane.

Test it:

$ make
gcc -m32 -g -c ex4.c -o build/ex4.o
gcc -m32 static/part4_harness.o build/ex4.o -lm -o build/ex4
$ ./build/ex4
set_point ok
point_dist ok

If you implemented ex4.c correctly, you will see the printout above.

Part 2: WeensyOS assignment: MiniprocOS

The heart of this lab is the first WeensyOS. WeensyOS refers to a series of small coding assignments that are also complete operating systems. You could boot a WeensyOS operating system on real x86-compatible hardware! The purpose of the WeensyOS assignments is first, to teach some of the class's concepts through example, and second, to demystify operating systems in general.

WeensyOS was developed by Eddie Kohler; the code that we use, and the description below (narrative, exercises, etc.), is borrowed from him. One of his goals in developing these labs was to make them fun. We think he succeeded!

The first WeensyOS problem set is called MiniprocOS, and it concerns processes. MiniprocOS is a tiny operating system that supports the major process primitives: creating a new process with fork, exiting a process, and waiting on a process that has exited. The only thing missing -- and it's a big one -- is process isolation: MiniprocOS processes actually share a single address space. (In a later WeensyOS assignment you will implement process isolation for memory.) In this assignment, you will actually implement the code that forks a new process, and see how system calls are implemented. You will also update the code that waits on an exited process to avoid busy waiting.

x86 emulator and other tools

You could take the disk image file built in this assignment, write it to your laptop's hard drive, and boot up your operating system directly if you wanted! However, it's much easier to work with a virtual machine or PC emulator.

An emulator mimics, or emulates, the behavior of a full hardware platform. A PC emulator acts like a Pentium-class PC: it emulates the execution of Intel x86 instructions and the behavior of other PC hardware. For example, it can treat a normal file in your home directory as an emulated hard disk; when the program inside the emulator reads a sector from the disk, the emulator simply reads 512 bytes from the file. PC emulators are slower than real hardware, since they do all of the regular CPU's job in software -- not to mention the disk controller's job, the console's job, and so forth. However, debugging with an emulator is a whole lot friendlier, and you can't screw up your machine!

We will be using the QEMU emulator. We will also be using a C compiler (gcc), configured to compile code for an x86 ELF target. (ELF, or Executable and Linkable Format, is a particular format for storing machine language programs on disk.) All of the required tools are installed on our two lab platforms (the virtual devbox or the CIMS machines), which you set up above.

Note that if you are using the virtual devbox option, then you are using an additional layer of emulation to run the devbox itself (in that Ubuntu Linux is made to "run" on your own laptop or desktop). You are probably using Virtual Box for this purpose. Note that Virtual Box is solving almost the same problem that QEMU is. This is explained further on our setup page.

Browsing MiniprocOS

You fetched the source at the beginning of the lab. Look over the source:


# this assumes we were in the review-pointers directory
$ cd ..
# now we are in the lab1 directory

$ ls lab1
answers.txt  conf       GNUmakefile  k-loader.c  mergedep.pl      process.h        x86.c
boot.c       const.h    kernel.c     lib.c       p-procos-app2.c  review-pointers  x86.h
bootstart.S  COPYRIGHT  kernel.h     lib.h       p-procos-app3.c  submit.py
build        elf.h      k-int.S      link        p-procos-app.c   types.h
$

Now it's time to give the OS a whirl.

Change into the lab1 directory and run the make program (which must be GNU make).

make, if you haven't heard of it, is a program that simplifies the process of building software projects. The user writes a set of rules, called a Makefile (ours is called GNUmakefile) that tells the make program what to build. For example, a Makefile might say, "to compile a C program, run the gcc compiler; and by the way, I want to compile the program named hello, which depends on the C source file hello.c". Makefiles can be quite simple, although most medium-to-large projects have complex Makefiles. You'll be invoking a couple of Makefiles in the labs.

The WeensyOS GNUmakefile builds a hard disk image called procos.img, which contains the MiniprocOS "kernel" and two applications, p-procos-app.c and p-procos-app2.c.

Make's output should look something like this:

$ make
  HOSTCOMPILE build/mkbootdisk.c
  ASSEMBLE bootstart.S
  COMPILE boot.c
  LINK obj/bootsector
  ASSEMBLE k-int.S
  COMPILE kernel.c
  COMPILE x86.c
  COMPILE k-loader.c
  COMPILE lib.c
  COMPILE p-procos-app2.c
  LINK obj/p-procos-app2
  COMPILE p-procos-app3.c
  LINK obj/p-procos-app3
  COMPILE p-procos-app.c
  LINK obj/p-procos-app
  LINK obj/kernel
  CREATE procos.img
$

Now that you've built the OS disk image, it's time to run it! We've made it very easy to boot a given disk image; just run this command:

$ make run

This will start up QEMU. After a moment you should see a window like this!

Hit "1" to try to run the first application, and you should see a window like this:

To quit QEMU, type control-C in the terminal.

MiniprocOS Application

You're now ready to start learning about the OS code!

Start first with the application, p-procos-app.c. This application simply starts a single child process and waits for it to exit. It uses system calls that you saw in CS201 and that we also cover in this class: fork starts a new process; exit exits a process; and wait returns a process's exit status.

Read and understand the code in p-procos-app.c.

How are those system calls implemented? As you saw in CS201 (and as we will review in this class): to make a system call, the application program executes a trap: an instruction that initiates a protected control transfer to the kernel. The system call's arguments are often stored in machine registers, and that's how MiniprocOS does it. Likewise, the system call's results are often returned in a machine register. On Intel 80386-compatible machines (colloquially called "x86es"), the interrupt instruction is called int, and registers have names like %eax, %ebx, and so forth. A special C language statement, called asm, can execute the interrupt instruction and connect register values with C-language variables.

Read and understand the comments in process.h. This file defines MiniprocOS's system calls. Also glance through the code, to see how system calls actually work!

The MiniprocOS kernel handles these system calls.

This kernel is different from conventional operating system kernels in several ways, mostly to keep the kernel as small as possible. For one thing, the kernel shares an address space with user applications, so that user applications could write over the kernel if they wanted to. This isn't very robust, since the kernel is not isolated from user faults, but for now it is easier to keep everything in the same address space. Another difference is that MiniprocOS implements cooperative multitasking, rather than preemptive multitasking. That is, processes give up control voluntarily, and if a process went into an infinite loop, the machine would entirely stop. In preemptive multitasking, the kernel can preempt an uncooperative process, which forces it to give up control. Preemptive multitasking is more robust than cooperative multitasking, meaning it's more resilient to errors, but it is slightly more complex. All modern PC-class operating systems use preemptive multitasking for user-level applications, but the kernel itself usually switches between internal tasks using cooperative multitasking.

MiniprocOS's main kernel structures are as follows.

struct process_t: This is the process descriptor structure, which stores all the relevant information for each process. It is defined in kernel.h.
process_t proc_array[];: This is an array of process descriptor structures, one for each possible process. MiniprocOS supports up to 15 concurrent processes, with process IDs 1 to 15. The process descriptor for process I is stored in proc_array[I]. Initially, only one of these processes is active, namely proc_array[1]. The proc_array[0] entry is never used.
process_t *current;: This points to the process descriptor for the currently running process.

The code in kernel.c sets up these structures. In particular, the start() function initializes all the process descriptors.

Read and understand the code and comments in kernel.h. Then read and understand the memory map in kernel.c, the picture at the top that explains how MiniprocOS's memory is laid out. Then look at start().

The code you'll be changing in MiniprocOS is the function that responds to system calls. This function is called interrupt().

Read and understand the code for interrupt() in kernel.c. Concentrate on the simplest system call, namely sys_getpid/INT_SYS_GETPID. Understand how the sys_getpid application function (in process.h) and the INT_SYS_GETPID clause in interrupt() (in kernel.c) interact.

Exercise 5. Answer the following question (in answers.txt): Say you replaced run(current) in the INT_SYS_GETPID clause with schedule(). The process that called sys_getpid() will eventually run again, picking up its execution as if sys_getpid() had returned directly. When it does run, will the sys_getpid() call have returned the correct value? Why or why not?

You may have noticed, though, that the sys_fork() system call isn't working! Your job is to write the code that actually creates a new process.

Exercise 6. Fill out the do_fork() and copy_stack() functions in kernel.c.

Congratulations, you've written code to create a process -- it's not that hard, no? (Our version is less than 20 lines of code.) Here's what you should see when you're done:

Now take a look at the code in p-procos-app.c that calls sys_wait(). Also look at the INT_SYS_WAIT implementation in kernel.c. The current system call design uses a polling approach: to wait for process 2 to exit, a process must call sys_wait(2) over and over again until process 2 exits and the sys_wait(2) system call returns a value different from WAIT_TRYAGAIN.

We'll see more about polling later in the semester, but for now, notice that polling approaches like this often reduce utilization. A process uses CPU time to call sys_wait(2) over and over again, leaving less CPU time for others. An alternative approach, which can improve utilization, is called blocking. A blocking implementation would put sys_wait(2)'s caller to sleep, then wake it up once process 2 had exited and a real exit status was available. The sleeping process doesn't use any CPU. A process that is asleep because the kernel is waiting for some event is called blocked.

Exercise 7. Change the implementation of INT_SYS_WAIT in kernel.c to use blocking instead of polling. In particular, when the caller tries to wait on a process that has not yet exited, that process should block until the process actually exits.

Important Hint: Make sure that your blocking version of sys_wait() has exactly the same user-visible behavior as the original version, except that it blocks and so never returns -2. See process.h for an English description of the current behavior.

To implement Exercise 7, you will probably want to add a field to the process descriptor structure. This field will indicate whether or not a process is waiting on another process. You will change INT_SYS_WAIT to add the calling process to this "wait queue", and INT_SYS_EXIT to wake any processes that were on the "wait queue". There are several ways to do this; describe how you did it in answers.txt.

To check your work, try changing the sys_wait() loop in p-procos-app.c to look like this:

		do {
			status = sys_wait(p);
			app_printf("W");
		} while (status == WAIT_TRYAGAIN);

A polling implementation of sys_wait would produce output like this:

You want it to produce output like this:

Cleaning Up Processes

Now try running the other MiniprocOS application. You should see something like this (different processes generally print their lines in different colors):

The MiniprocOS2 application, in p-procos-app2.c, tries to run 1024 child processes.

Read and understand p-procos-app2.c.

Unfortunately, your current kernel code doesn't seem able to run more than 15 total processes, ever! It looks like old, dead processes aren't being cleaned up, even after we call sys_wait() on them. This is what we call a bug.

Exercise 8. Find and fix this bug.

When you've completed this exercise, the application should count up to 1024, like this:

Your colors may differ, however, depending on how you implement sys_wait(). One common implementation strategy ends with several red lines in a row. If you see this in your code, try to figure out why!

Honors supplement. Honors students (section 001) should choose ONE of the following two exercises; students in section 002 are free (even encouraged) to do the exercises, but there will be no credit given:

Exercise 9 (Honors). MiniprocOS miniprocesses have some aspects of threads. For instance, like threads, they all share a single address space. A big difference from threads is that we create a new process by forking. New threads are created in a different way. Introduce a new system call,

pid_t sys_newthread(void (*start_function)(void));

that creates a new process in a thread-like way. The new process should start with an empty stack, not a copy of the current stack. Rather than starting at the same instruction as the parent, the new thread should start by executing the start_function function: that is, that function's address becomes the new thread's instruction pointer.

Exercise 10 (Honors). Introduce a sys_kill(pid) system call by which one process can make another process exit. Use this system call to alter p-procos-app2.c's run_child() function so that the even-numbered processes kill off all odd-numbered processes (except for thread 1). Running the application should print out "Process N lives" messages only for even-numbered values of N.

Extra credit. For extra credit (honors or regular), do the following exercise.

Exercise 11 (Extra-credit). Our version of sys_fork(), with its dirt simple stack copying strategy, works only for simple programs. For example, consider the following function definition:

void pmain(void) {
        int x = 0;  /* note that local variable x lives on the stack */
        /* YOUR CODE HERE */
        pid_t p = sys_fork();
        if (p == 0)
                /* YOUR CODE HERE */;
        else if (p > 0)
                sys_wait(p); // assume blocking implementation
        app_printf("%d", x);
        sys_exit(0);
}

In a system with true process isolation, the child process's x and the parent process's x would be different variables, and changes in one process would not affect the other's x. But in MiniprocOS, this is not always the case! For this exercise, produce a version of that code with the following properties:

The code uses only local variables.
In a system with correct process isolation, the code would print "10".
In MiniprocOS, the code would print "11".

Hint: It isn't easy to get this to work because the compiler tends to optimize away important assignment statements or shift them to unfortunate places. Mark a variable as volatile to tell the compiler not to optimize references to it. Doing this correctly is tricky, but if you can understand the difference between

volatile
int *x

and int * volatile x you can do this problem.

This completes the WeensyOS portion of this lab. At this point, you should commit your work:

# see what files you changed
$ git status

# examine your changes
$ git diff

$ git commit -am "My solutions to lab1"

The next (final) part of the lab will shed some light on what the commands above are actually doing.

When you try to commit for the first time, you may see an error message:

Your name and email address were configured automatically based
on your username and hostname. Please check that they are accurate.
You can suppress this message by setting them explicitly:

    git config --global user.name "Your Name"
    git config --global user.email you@example.com
...

This happens because you have not configured git, and it cannot figure out your name and email address (required by default for creating a commit). Type the following:

$ git config --global user.name "<your_name>" 
$ git config --global user.email <your_email_address>

and then re-run the git commit command above.

Part 3: Using Git

This course will use Git to distribute code for all programming assignments. Git is a distributed (as opposed to centralized) version control system. If used correctly, it can be very useful for tracking your work, undoing mistakes, trying different approaches, etc.

Read Eddie Kohler's excellent introduction to git. Where this page mentions code.seas.harvard.edu, read that as cs.nyu.edu (where you cloned from, above); where it mentions "the CS50 appliance", read that as our lab platform (the devbox or your CIMS account).

Note that life with git will be far easier at first if you use a graphical browser. Gitk is a good one that is installed on the lab platform. Just type gitk while you are in a git repo (if you haven't heard the term "repo" before, it's short for "repository"; we and many others sometimes use this abbreviation).

Last, note that our reference page has tutorials and references on git.

Other git commands that may ultimately come in handy (if not in this class than in later projects) are git stash (type

man
git-stash

to learn about it), git rebase (ditto: man git-rebase), and git-cherry-pick.

We have prepared a couple of very brief exercises using git. First, go to your cs202/ directory. Now, type:

$ git checkout -b git-lab/mergeA origin/git-lab/mergeA

This will create a new branch named git-lab/mergeA. This branch should only have a single file called merge.c. Type ls to verify. This is a simple program that simply prints three characters to the screen. (Note: You may see other files besides merge.c, if you created them but did not check them into your lab1 branch. These files are not tracked by git, so they persist across checkouts). You may compile and run this program by typing:

$ make merge   # To compile
$ ./merge

Exercise 12. (This goes in answers.txt.) What is the name of the first commit to the tree in which this branch resides? (Hint: This is much easier if you use a git browser. An example is gitk, which is installed on our lab platform.)

Now we will create another branch called git-lab/mergeB. Type:

$ git checkout -b git-lab/mergeB origin/git-lab/mergeB

This branch also contains a single file called merge.c. Verify with ls. This program is very similar to the one you saw before. The only difference is that it prints out a different set of characters. Make sure that this is so.

One of the strengths of git is that it allows multiple users to work from the same repository independently from each other. Eventually, though, all of the work must be merged together into a final product Usually, git will do this automatically for you. However, there are times when multiple users modify the same place in a file, in which case git cannot know whose work should be used (only a human can manually resolve a "conflict" of this kind). You will be doing such conflict resolution, but here and throughout the semester, you must be careful: a botched merge is a reliable source of headaches. The two branches that you have just created have been set up so that they will cause exactly such a conflict when merging. Type the following into a console:

$ git branch
  git-lab/mergeA
* git-lab/mergeB
  lab1
$ git merge git-lab/mergeA
Auto-merging merge.c
CONFLICT (content): Merge conflict in merge.c
Automatic merge failed; fix conflicts and then commit the results.

Now would be a good time to go back over Eddie Kohler's excellent introduction to git. See especially the section on Conflicts; search for the word "scary".

Find out what the base version of merge.c does, then resolve the merge conflict in merge.c so that the merged file behaves the same as the base version. Don't forget to commit the merge when you are done. Hint: Look at the common parents of the two branches. gitk will be useful for this.

Make sure your merged merge.c compiles and runs correctly.

Exercise 14. What is the name of your commited merge? Place your answer in answers.txt.

This completes the work of the lab. The final step is handing it in. Switch back to the lab1 branch (git checkout lab1).

Handing in the lab

To hand in the lab, you will interact with a submission server that we are running. This requires a one-time setup, good for this and all labs in this course. Here are the steps:

Go here, and fill in your NYU NetID.
Your NYU email account will receive a key that is associated with you and only you. (Do not share this key. Doing so would allow someone else to submit in your name, which not only exposes you to vandalism but also violates the course policies.)
Place this key into a file in your cs202 directory, on your platform (either the devbox or your account on the CIMS machines). Name the file cs202-auth-<your-netid>. To do this step, you can use scp, copy/paste, http fetch, etc.

Finally! You're ready to submit. Here's a checklist:

Make sure that you are on the lab1 branch, and inside your lab1 directory.
Make sure that you have answered all written, numbered exercises above in answers.txt.
Make sure you have answered all code exercises in the files, with all blocks of code commented, as described in the introduction.
Create a file called slack.txt noting how many slack hours you have used for this assignment. (This is to help us agree on the number that you have used.) Include this file even if you didn't use any slack hours.
Then do:
```
  $ make handin
```
This command does the following, on your behalf. It executes make clean (which removes .o files, executables, etc.). It then creates a copy of your lab1 directory, placing the copy in an archive (it uses the tar program for this purpose. Historical note: tar stands for "Tape ARchive", because the program was designed to backup files to tape). It then compresses the archive, creating a bundle called lab1-handin.tgz (such bundles are sometimes known as tarballs). Last, make handin invokes a script that submits this tarball to our submission server. You should see something like this:
```
  CLEAN
  CREATE lab1-handin.tgz
  SUBMITTING ... 
  ... progress ...
  Submission successful
```
If you submit multiple times, we will take the latest submission and count slack hours accordingly.

This completes the lab.

Last updated: Mon May 04 11:24:46 -0400 2015 [validate xhtml]