Lecture Notes for Computer Systems Design

======== START LECTURE #24 ========

Obtaining bus access

The simplest scheme is to permit only one bus master.
- That is, on each bus only one device is permited to initiate a bus transaction.
- The other devices are slaves that only respond to requests.
- With a single master, there is no issue of arbitrating among multiple requests.
One can have multiple masters with daisy chaining of the grant line.
- Any device can assert the request line, indicating that it wishes to use the bus.
  - This is not trivial: uses ``open collector drivers''.
  - If no output drives the line, it will be ``pulled up'' to 5v, i.e., a logical true.
  - If one or more outputs drive the line to 0v it will go to 0v (a logical false).
  - So if a device wishes to make a request it drives the line to 0v; if it does not wish to make a request it does nothing.
  - This is (another example of) active low logic. The request line is asserted by driving it low.
- When the arbiter sees the request line asserted (and the previous grantee has issued a release), the arbiter raises the grant line.
- The grant signal is passed from one device to another if the first device is not requesting the bus. Hence devices near the arbiter have priority and can starve the ones further away.
- The device whose request is granted asserts the release line when done.
- Simple, but not fair and not of high performance.
Centralized parallel arbiter: Separate request lines from each device and separate grant lines. The arbiter decides which device should be granted the bus.
Distributed arbitration by self-selection: Requesting processes identify themselves on the bus and decide individually (and consistently) which one gets the grant.
Distributed arbitration by collision detection: Each device transmits whenever it wants, but detects collisions and retries. Ethernet uses this scheme (but modern switched ethernets do not).

Option	High performance	Low cost
bus width	separate addr and data lines	multiplex addr and data lines
data width	wide	narrow
transfer size	multiple bus loads	single bus loads
bus masters	multiple	single
clocking	synchronous	asynchronous

Do on the board the example on pages 665-666

Memory and bus support two widths of data transfer: 4 words and 16 words
64-bit synchronous bus; 200MHz; 1 clock for addr; 1 for data.
Two clocks of ``rest'' between bus accesses
Memory access times: 4 words in 200ns; additional 4 word blocks in 20ns per block.
Can overlap transferring data with reading next data.
Find
1. Sustained bandwidth and latency for reading 256 words using both size transfers
2. How many bus transactions per sec for each (1 trans is addr+data).
Four word blocks
- 1 clock to send addr
- 40 clocks read mem
- 2 clocks to send data
- 2 idle clocks
- 45 total clocks
- 256/4=64 transactions needed so latency is 64*45*5ns=14.4us
- 64 trans per 14.4us = 64/14.4 trans per 1us = 4.44M trans per sec
- Bandwidth = 1024 bytes per 14.4us = 1024/14.4 B/us = 71.11MB/sec
Sixteen word blocks
- 1 clock for addr
- 40 clocks for reading first 4 words
- 2 clocks to send
- 2 clocks idle
- 4 clocks to read next 4 words. But this is free! Why?
  Because it is done during the send and idle of previous block.
- So we only pay for the long initial read
- Total = 1 + 40 + 4*(2+2) = 57 clocks.
- 16 transactions need; latency = 57*16*5ns=4.56ms, which is much better than with 4 word blocks.
- 16 transactions per 4.56us = 3.51M transactions/sec
- Bandwidth = 1024B per 4.56ms = 224.56MB/sec

8.5: Interfacing I/O Devices

Giving commands to I/O Devices

This is really an OS issue. Must write/read to/from device registers, i.e. must communicate commands to the controller. Note that a controller normally contains a microprocessor, but when we say the processor, we mean the central processor not the one on the controller.

The controler has a few registers that can be read and/or written by the processor, similar to how the processor reads and writes memory. These registers are also read and written by the controller.
Nearly every controler contains
- A data register, which is readable (by the processor) for an input device (e.g., a simple keyboard), writable for an output device (e.g., a simple printer), and both readable and writable for input/output devices (e.g., disks).
- A control register for giving commands to the device.
- A readable status register for reporting errors and announcing when the device is ready for the next action (e.g., for a keyboard telling when the data register is valid, and for a printer telling when the character to be printed has be successfully retrieved from the data register). Remember the communication protocol we studied where ack was used.
Many controllers have more registers

Communicating with the Processor

Should we check periodically or be told when there is something to do? Better yet can we get someone else to do it since we are not needed for the job?

We get mail at home once a day.
At some business offices mail arrives a few times per day.
No problem checking once an hour for mail.
If email wasn't buffered, you would have to check several times per minute (second?, milisecond?).
Checking email this often is too much of a burden and most of the time when you check you find there is none so the check was wasted.

Polling

Processor continually checks the device status to see if action is required.

Like the mail example above.
For a general purpose OS, one needs a timer to tell the processor it is time to check (OS issue).
For an embedded system (microwave) make the checking part of the main control loop, which is guaranteed to be executed at a minimum frequency (application software issue).
For a keyboard or mouse, which have very low data rates, the system can afford to have the main CPU check. We do an example just below.
It is a little better for slave-like output devices such as a simple printer. Then the processor only has to poll after a request has been made until the request has been satisfied.

Do on the board the example on pages 676-677

Cost of a poll is 400 clocks.
CPU is 500MHz.
How much of the CPU is needed to poll
1. A mouse that requires 30 polls per sec?
2. A floppy that sends 2 bytes at a time and achieves 50KB/sec?
3. A hard disk that sends 16 bytes at a time and achieves 4MB/sec?
For the mouse, we use 12,000 clock cycles each second sec for polling. The CPU runs at 500*10⁶ cycles/sec. So polling the mouse requires 12/500*10^-3 = 2.4*10^-5 of the CPU. A very small penalty.
The floppy delivers 25,000 (two byte) data packets per second so we must poll at that rate not to miss one. CPU cycles needed each second is (400)(25,000)=10⁷. This represents 10⁷ / 500*10⁶ = 2% of the CPU
To keep up with the disk requires 250K polls/sec or 10⁸ clock cycles or 20% of the CPU.
The system need not poll the floppy and disk until the CPU had issues a request. But then it must keep polling until the request is satisfied.

Interrupt driven I/O

Processor is told by the device when to look. The processor is interrupted by the device.

Dedicated lines (i.e. wires) on the bus are assigned for interrupts.
When a device wants to send an interrupt it asserts the corresponding line.
The processor checks for interrupts after each instruction. This requires ``zero time'' as it is done in parallel with the instruction execution.
If an interrupt is pending (i.e., if a line is asserted) the processor (this is mostly an OS issue, covered in 202).
1. Saves the PC and perhaps some registers.
2. Switches to kernel (i.e., privileged) mode.
3. Jumps to a location specified in the hardware (the interrupt handler).
At this point the OS takes over.
What if we have several different devices and want to do different things depending on what caused the interrupt?
Use vectored interrupts.
- Instead of jumping to a single fixed location, the system defines a set of locations.
- The system might have several interrupt lines. If line 1 is asserted, jump to location 100, if line 2 is aserted jump to location 200, etc.
- Alternatively, the system could have just one line and have the device send the address to jump to.
There are other issues with interrupts that are taught in OS. For example, what happens if an interrupt occurs while an interrupt is being processed. For another example, what if one interrupt is more important than another. These are OS issues and are not covered in this course.
The time for processing an interrupt is typically longer than the type for a poll. But interrupts are not generated when the device is idle, a big advantage.

Do on the board the example on pages 681-682.

Same hard disk and processor as above.
Cost of servicing an interrrupt is 500 cycles.
The disk is active only 5% of the time.
What percent of the processor would be used to service the interrupts?
Cycles/sec needed for processing interrupts while the disk is active is 125 million.
This represents 25% of the processor cycles available.
But the true cost is only 1.25%, since the disk is active only 5% of the time.
Note that the disk is not active (i.e., actively generating interrupts) right after the request is made. Interrupts are not generated during the seek and rotational latency. They are generated only during the transfer itself.

Direct Memory Access (DMA)

The processor initiates the I/O operation then ``something else'' takes care of it and notifies the processor when it is done (or if an error occurs).

Have a DMA engine (a small processor) on the controller.
The processor initiates the DMA by writing the command into data registers on the controller (e.g., read sector 5, head 4, cylinder 123 into memory location 34500)
For commands that are longer than the size of the data register(s), a protocol must be used to transmit the information.
(I/O done by the processor as in the previous methods is called programmed I/O, PIO).
The controller collects data from the device and then sends it on the bus to the memory without bothering the CPU.
- So we have a multimaster bus and need some sort of arbitration.
- Normally the I/O devices are given higher priority than the CPU.
- Freeing the CPU from this task is good but isn't as wonderful as it seems since the memory is busy (but cache hits can be processed).
- A big gain is that only one bus transaction is needed per bus load. With PIO, two transactions are needed: controller to processor and then processor to memory.
- This was for an input operation (the controller writes to memory). A similar situation occurs for output where the controller reads from the memory). Once again one bus transaction per bus load.
When the controller detects that the I/O is complete or if an error occurs, it sets the status register accordingly and sends an interrupt to the processor to notify the latter that the I/O is complete.