There is usually a gap between the real world data formats and the data structures which are theoretically the most appropriate for algorithms. Witness the plethora of graphics file formats, and the many trade books that treats them. In this lecture, we will describe an important format called TIGER (Topologically Integrated Geographic Encoding and Referencing System) developed by the U.S. Bureau of the Census (http://www.census.gov). For more information, we recommend going to the TIGER homepage: http://www.census.gov/geo/www/tiger.
The TIGER data set contains the geographic encoding of the whole USA, that is especially appropriate for representing census information, its primary application. For our purposes, we are mainly interested in its map data.
TIGER files. The data set is county-based, so that each county in the county has its own data set. The whole USA has over 3200 counties. Each county is given a 6 digit identifier called an FIPS. For instance, the FIPS for New York County (i.e., Manhattan) is 36061 while Kings County (i.e., Brooklyn) is 36047. The first two digits of the FIPS identifies the state (New York State is 36).
Each county has up to 17 different record types,
and all the records of a given type for the county is
stored in a single file.
So, there are up to 17 files for each county.
The files for Manhattan are named
TGR36061.RT1,
TGR36061.RT2,
TGR36061.RTA, etc.
The suffix RT1, RT2, etc, tells us that the file contains ``Record Type 1'',
``Record Type 2'', etc.
The file suffixes are RTn where n is one of the characters
|
Each file is an ASCII file. Each record is stored in one line of the file, and each line has a fixed number of ASCII characters. Each record also has a fixed number of fields, and fields occupy predetermined positions in its line. Below, we give more details about the contents of these files.
Coordinate System and Accuracy. The coordinate system is based on Longitude and Latitudes. Each Longitude is given as a signed 9 digit sequence, with an implied 6 decimal places. Thus -123456789 really represents -123.456789. Each latitude is given as a signed 8 digit sequence, also with an implied 6 decimal places. Thus +12345678 really represents +123.45678. For instance, the bounding box for New York State is (minLon, maxLon, minLat, maxLat) = (-79.762418, -71.778137, 40.477408, 45.010840).
You will need to make a conversion from the Tiger coordinates to conventional Lat/Lon coordinates which are specified in degree and minutes. Moreover, latitudes are relative to the Equator (North or South) and longitudes are relative to Greenwich Meridian (East or West). For instance, the Tiger coordinates for Courant Institute (NYU) is roughly (40.729, -73.995). This converts to (40 deg 43.74' North, 73 deg 59.70' West). Thus, negative coordinates are South of Equator or West of Greenwich Meridian.
The accuracy of the map is that of a 1:100,000 scale map, which is correct up to ±167 feet. However, the relative positions of any specified points (with respect to the plane subdivision of the Tiger data) is correct.
TIGER Geometry. There are three kinds geometric objects: points, polygonal lines and polygons. The polygonal lines in TIGER files have the registered name of TIGER/Line (R). They are also known as ``complete chains'', but in this lecture, we simply call them Tiger Lines. Each Tiger Line has a unique ID (or TLID). Each polygon, which we will call Tiger Polygons, also has an ID (or POLYID), but this ID is unique only within each county. But is easy to make a POLYID unique across the whole country by concantenating it with the FIPS, for instance. Each Tiger Line is basically the maximal polygonal chain that share the same ``line features''. We will discuss line features below, but the type of a line feature might be ``road'' or ``county boundary''. The basic Tiger Line information is stored in RT1 and RT2.
Points (or vertices) in a Tiger line are of two kinds: of Tiger Lines, and non-endpoints. The latter are called detail points. Every Tiger Line has two endpoints, and zero or more detail points. In the Tiger File RT1, the endpoints are stored with each Tiger Line. The detail points are stored in RT2. Since Tiger files have fixed size records, while the number of detail points for a Tiger Line has no a priori bound, the solution is to group the detail points for a Tiger Line into groups of ten detail points per record in RT2 files. At most one group has less than 10 detail points. If a Tiger Line has no detail points, then it has no corresponding entry in the RT2 file.
It is important to note that points in the Tiger Line do not have independent existence (i.e., no unique ID's). This could potentially lead to inconsistencies, both within a county and across two adjacent counties. Inconsistencies cannot happen with detail points, as they are not shared with other Tiger Lines. But endpoints will generally be shared by two or more Tiger Lines. If two Tiger Lines L, L¢ share an endpoint P, then the coordinates of P in L and in L¢ must agree. This consistency is not encoded into the Tiger data organization, but is an implicit guarantee.
The set of Tiger Polygons forms a subdivision of its county (and by extension to the whole USA). Each Tiger Polygon is associated with an interior point. The boundary of each Tiger Polygon is bounded by a whole number of Tiger Lines; this implies that each Tiger Line bounds one or two Tiger Polygons. A Tiger Lines bounds only one Tiger Polygon if the other side of the line is outside the county or outside the map coverage.
Fix any county. Each Record Type (RT) is stored in a single file. To understand the basic information in these files, we use some basic concepts from Relational Database theory. Each file is thus a Relation. A relation has a fixed set of attributes. We suggest two steps to figuring out the Tiger files:
(1) To understand an individual file, you need to figure out that one of its attributes is the "KEY". E.g., in RT1, the TLID field is the key. This KEY is unique within the current county in the sense of relational data bases. With the exception of TLID, most keys are not unique across counties or states (so they need to be combined with some other attributes). So, the first thing is understand for each file to identify its KEY.
(2) Second, to connect the information across files, you need to find shared attributes. For instance, the POLYID attribute is found in RT9 and RTA. By "joining" these two files in the sense of relational database, you can cross reference properties.
Note that blank entries are very common, and it means that the corresponding attribute is not applicable for that row. Alternatively, you might say that the database scheme for these relations
The table below lists some important fields in the various Tiger files:
Record Type | Field Name | Field Position | Description |
TLID | 6-15 | TIGER/Line ID | |
RT 1 | SIDE1 | 16 | Single side? (blank = both sides, 1 = single side) |
FRLONG | 191-200 | Start (from) longtitude | |
FRLAT | 201-209 | Start (from) latitude | |
(Basic info | TOLONG | 210-219 | End (to) longtitude |
for Lines) | TOLAT | 220-228 | End (to) latitude |
TLID | 16-25 | TIGER/Line ID (not all lines have an entry here | |
RT 2 | RTSQ | 16-18 | Record Sequence Number (1, 2, etc) |
LONG1 | 19-28 | Start (from) longtitude of point 1 | |
LAT1 | 29-37 | Start (from) latitude of point 1 | |
(Detail points | LONG2 | 28-37 | Start (from) longtitude of point 2 |
for Lines) | LAT2 | etc | etc |
RT A | |||
(Basic info | POLYID | 6-15 | Polygon ID |
for Polygons) | |||
RT I | TLID | 6-15 | TIGER/Line ID |
(Line-Polygon | POLYIDL | 27-36 | Polygon ID on the left side |
info) | POLYIDR | 42-51 | Polygon ID on the right side |
RT P | POLYID | 16-25 | Polygon ID |
(More info | POLYLONG | 26-35 | Internal Point Longtitude |
on Polygons) | POLYLAT | 36-44 | Internal Point Latitude |
N.B. The html version of this table may not be completely correct; refer to the postscript version.
Example: Metropolitan Areas. Suppose you are interested in Metropolitan Areas (MA's). This information is found in Record Type S (RTS). The KEY of RTS is POLYID (polygons). Two attributes found in RTS are MSA/CMSA and PMSA. What are these? Well, there are two basic kinds of MA's: Metropolitan Statistical Areas or MSA (under 1 million population), Consolidated MSAs or CMSA (over 1 million population). CMSA's are in turn subdivided into Primary MSAs or PMSAs. E.g., New York City is a CMSA, but it has many PMSA's. Thus, to identify an MA, you look for entries in the MSA/CMSA attribute. If a row (i.e., a polygon) is part of an MSA, then its PMSA attribute is blank; if it is a CMSA, it will have a PMSA attribute.
Non-Geometric and Non-topological Data. Such data include census data (of course), postal address ranges, zip code, land type and metropolitan areas. Also of practical interest is landmarks (school, park, airport, etc).
Of course, the conversion issue is to extract from the Tiger dataset the usually connectivity expected in structures such as half-edge data structures. What makes this interesting is that we may want to process this incrementally (such as across the netwok).
When we display maps based a global referencing system (usually this means the longitude/latitude system), it is seldom acceptable to treat this as if the coordinates come from a rectangular coordinate system. Otherwise, the map would look distorted. We need to choose some map projection scheme. That is, if G is the surface of the spherical globe, we want to choose a projection surface P and a partial map p: G® P. The simplest projection is the sterographic projection. In this case, P is a plane tangent to G at any chosen point p0 Î G. For any point p Î G, we define p(p) Î S to be the intersection of S with the ray r(p) that emanates from the center of the globe and passing through p. Note that p(p) is undefined for points in a hemisphere and p(p0)=p0.
Another simple projection is to take P to be cylinder that touches G at the equator. Again, the projection point p(p) is defined to the intersection of r(p) with P. In this case, p(p) is undefined for only two points, the North and South poles. This cylindrical projection is is easy to calculate if p Î G is given in terms of longitude-latitude (q,f) Î [-p,p)×(-p/2,p/2). and the cylinder P has the rectangular coordinates [-p,p) ×R. Assume that points (q,f) on the equator projects to p(q,f)=(q, 0). We leave this as an exercise.
When we merge counties, there may be inconsistencies or unexpected topology. For instance, we discovered in merging Manhattan with neighboring New Jersey counties that the Liberty Island belongs to Manhattan County but is surrounded by water belonging to New Jersey. How do we check for topological consistency and regularity?
In the above description of the Tiger data, we focused on the geometric information (lines and polygons, lat/long, adjacency relations, etc). But of human interest are the names and types of various features, location of landmarks, etc. Maps would be useless without this information.
Coloring of Map Areas. The first basic data in any map is the distinction between water (blue) and land. For land, we want to color code park areas (green) and public facilities or institutions such as airports and hospitals (pink), and the rest (yellow). Record Type S has a water flag for polygons.
Landmarks. Record Type 7 contain landmark features. Each landmark is given a ID (positions 11-20), name (positions 25-54) and longitude/latitude (positions 55-64/65-73). There is also a census feature class code (CFCC) for each landmark. This is a 3 byte code in positions 22-24. This information is critical in our map color coding.
Some landmarks are area landmarks (like parks and hospitals). Others are point landmarks (for instance a building). For area landmarks, their associated polygon (POLYID) can be found in Record Type 8.
Key Geographical Locations (KGL). This is found in Record Type 9. Example of a KGL are public squares and plazas. Each KGL has two associated fields POLYID and FEAT, which can be used to locate it on the map. To get to the POLYID location, use RT I, and to get to the FEAT location, use RT 5.
Line Features. A ``line feature'' is an informal concept that is typically a sequence of one or more continguous Tiger Lines that share common attributes such as feature identifiers. For instance, a street named ``Broadway'' is a line feature that may be decomposed into many Tiger Lines. The Tiger data set does not guarantee that line features can be reconstructed easily from its data.
Each Tiger Line has a feature identifier in file RT1, stored in the field positions 18 to 55. Example of a feature identifiers: N Adams Av, US Highway 1, Jefferson St, Providence St NE. Feature identifiers are subdivided into 4 subfields:
However, each Tiger Line also has a census feature class code (CFCC) in positions 56-58. This code is highly correlated with the Feature Type (FETYPE), but are independently assigned.
Zip Code and Address Range. The zip code of the polygons to the left and right a each Tiger Line is stored in positions 107-116 of the RT1 records. The address range on the left and right side of each street is also recorded in positions 59-106.
CFCC Code. Such codes are found in record types RT1 (for Tiger Lines), RT7 (for point or area landmarks), RT9 (for key geographic location). Here is an illustrative list of the codes.
Class A code is for roads:
Class E is for physical features:
Class F is for nonvisible features such as property areas, legal and administrative entities. These are only identified if they do not follow visible features such as roads or streams.
Class G is for US Census Bureau internal usage. Class H is hydrography.
Feature Class P, Provisional Features, is a new class that that may only appear on street features. These features are treated exactly line Class A.
No. | FIPS | Name |
1. | 36103 | Suffolk |
2. | 36059 | Nassau |
3. | 36081 | Queens |
4. | 36047 | King (Brooklyn) |
5. | 36061 | New York (Manhattan) |
6. | 36085 | Richmond (Staten Is) |
7. | 36005 | Bronx |
8. | 36087 | Rockland |
9. | 36071 | Orange |
10. | 36119 | Westchester |
11. | 34003 | Bergen |
12. | 34017 | Hudson |
13. | 34013 | Essex |
14. | 34039 | Union |
15. | 34023 | Middlesex |
16. | 34025 | Monmouth |