Class Notes for Computer Architecture

Computer Architecture
1999-2000 Fall
MW 3:30-4:45
Ciww 109

Allan Gottlieb
gottlieb@nyu.edu
http://allan.ultra.nyu.edu/~gottlieb
715 Broadway, Room 1001
212-998-3344
609-951-2707
email is best

======== START LECTURE #25 ========

Obtaining bus access

The simplest scheme is to permit only one bus master.
- That is, on each bus only one device is permited to initiate a bus transaction.
- The other devices are slaves that only respond to requests.
- With a single master, there is no issue of arbitrating among multiple requests.
One can have multiple masters with daisy chaining of the grant line.
- Any device can assert the request line, indicating that it wishes to use the bus.
  - This is not trivial: uses ``open collector drivers''.
  - If no output drives the line, it will be ``pulled up'' to 5v, i.e., a logical true.
  - If one or more outputs drive the line to 0v it will go to 0v (a logical false).
  - So if a device wishes to make a request it drives the line to 0v; if it does not wish to make a request it does nothing.
  - This is (another example of) active low logic. The request line is asserted by driving it low.
- When the arbiter sees the request line asserted (and the previous grantee has issued a release), the arbiter raises the grant line.
- The grant signal is passed from one device to another if the first device is not requesting the bus. Hence devices near the arbiter have priority and can starve the ones further away.
- The device whose request is granted asserts the release line when done.
- Simple, but not fair and not of high performance.
Centralized parallel arbiter: Separate request lines from each device and separate grant lines. The arbiter decides which device should be granted the bus.
Distributed arbitration by self-selection: Requesting processes identify themselves on the bus and decide individually (and consistently) which one gets the grant.
Distributed arbitration by collision detection: Each device transmits whenever it wants, but detects collisions and retries. Ethernet uses this scheme (but modern switched ethernets do not).

Option	High performance	Low cost
bus width	separate addr and data lines	multiplex addr and data lines
data width	wide	narrow
transfer size	multiple bus loads	single bus loads
bus masters	multiple	single
clocking	synchronous	asynchronous

Do on the board the example on pages 665-666

Memory and bus support two widths of data transfer: 4 words and 16 words
64-bit synchronous bus; 200MHz; 1 clock for addr; 1 for data.
Two clocks of ``rest'' between bus accesses
Memory access times: 4 words in 200ns; additional 4 word blocks in 20ns per block.
Can overlap transferring data with reading next data.
Find
1. Sustained bandwidth and latency for reading 256 words using both size transfers
2. How many bus transactions per sec for each (addr+data)
Four word blocks
- 1 clock to send addr
- 40 clocks read mem
- 2 clocks to send data
- 2 idle clocks
- 45 total clocks
- 256/4=64 transactions needed so latency is 64*45*5ns=14.4us
- 64 trans per 14.4us = 64/14.4 trans per 1us = 4.44M trans per sec
- Bandwidth = 1024 bytes per 14.4us = 1024/14.4 B/us = 71.11MB/sec
Sixteen word blocks
- 1 clock for addr
- 40 clocks for reading first 4 words
- 2 clocks to send
- 2 clocks idle
- 4 clocks to read next 4 words. But this is free! Why?
  Because it is done during the send and idle of previous block.
- So we only pay for the long initial read
- Total = 1 + 40 + 4*(2+2) = 57 clocks.
- 16 transactions need; latency = 57*16*5ns=4.56ms, which is much better than with 4 word blocks.
- 16 transactions per 4.56us = 3.51M transactions/sec
- Bandwidth = 1024B per 4.56ms = 224.56MB/sec