There is usually a gap between the real world data formats and the data structures which are theoretically the most appropriate for algorithms. Witness the plethora of graphics file formats, and the many trade books that treats them. In this lecture, we will describe an important format called TIGER (Topologically Integrated Geographic Encoding and Referencing System) developed by the U.S. Bureau of the Census (http://www.census.gov). For more information, we recommend going to the TIGER homepage: http://www.census.gov/geo/www/tiger.
The TIGER data set contains the geographic encoding of the whole USA, that is especially appropriate for representing census information, its primary application. For our purposes, we are mainly interested in its map data.
TIGER files. The data set is county-based, so that each county in the county has its own data set. The whole USA has over 3200 counties. Each county is given a 6 digit identifier called an FIPS. For instance, the FIPS for New York County (i.e., Manhattan) is 36061 while Kings County (i.e., Brooklyn) is 36047. The first two digits of the FIPS identifies the state (New York State is 36).
Each county has up to 17 different record types,
and all the records of a given type for the county is
stored in a single file.
So, there are up to 17 files for each county.
The files for Manhattan are named
TGR36061.RT1,
TGR36061.RT2,
TGR36061.RTA, etc.
The suffix RT1, RT2, etc, tells us that the file contains ``Record Type 1'',
``Record Type 2'', etc.
The file suffixes are RTn where n is one of the characters
|
Each file is an ASCII file. Each record is stored in one line of the file, and each line has a fixed number of ASCII characters. Each record also has a fixed number of fields, and fields occupy predetermined positions in its line. Below, we give more details about the contents of these files.
Coordinate System and Accuracy. The coordinate system is based on Longitude and Latitudes. Each Longitude is given as a signed 9 digit sequence, with an implied 6 decimal places. Thus -123456789 really represents -123.456789. Each latitude is given as a signed 8 digit sequence, also with an implied 6 decimal places. Thus +12345678 really represents +123.45678. For instance, the bounding box for New York State is (minLon, maxLon, minLat, maxLat) = (-79.762418, -71.778137, 40.477408, 45.010840).
You will need to make a conversion from the Tiger coordinates to conventional Lat/Lon coordinates which are specified in degree and minutes. Moreover, latitudes are relative to the Equator (North or South) and longitudes are relative to Greenwich Meridian (East or West). For instance, the Tiger coordinates for Courant Institute (NYU) is roughly (40.729, -73.995). This converts to (40 deg 43.74' North, 73 deg 59.70' West). Thus, negative coordinates are South of Equator or West of Greenwich Meridian.
The accuracy of the map is that of a 1:100,000 scale map, which is correct up to ±167 feet. However, the relative positions of any specified points (with respect to the plane subdivision of the Tiger data) is correct.
TIGER Geometry. There are three kinds geometric objects: points, polygonal lines and polygons. The polygonal lines in TIGER files have the registered name of TIGER/Line (R). They are also known as ``complete chains'', but in this lecture, we simply call them Tiger Lines. Each Tiger Line has a unique ID (or TLID). Each polygon, which we will call Tiger Polygons, also has an ID (or POLYID), but this ID is unique only within each county. But is easy to make a POLYID unique across the whole country by concantenating it with the FIPS, for instance. Each Tiger Line is basically the maximal polygonal chain that share the same ``line features''. We will discuss line features below, but the type of a line feature might be ``road'' or ``county boundary''. The basic Tiger Line information is stored in RT1 and RT2.
Points (or vertices) in a Tiger line are of two kinds: of Tiger Lines, and non-endpoints. The latter are called detail points. Every Tiger Line has two endpoints, and zero or more detail points. In the Tiger File RT1, the endpoints are stored with each Tiger Line. The detail points are stored in RT2. Since Tiger files have fixed size records, while the number of detail points for a Tiger Line has no a priori bound, the solution is to group the detail points for a Tiger Line into groups of ten detail points per record in RT2 files. At most one group has less than 10 detail points. If a Tiger Line has no detail points, then it has no corresponding entry in the RT2 file.
It is important to note that points in the Tiger Line do not have independent existence (i.e., no unique ID's). This could potentially lead to inconsistencies, both within a county and across two adjacent counties. Inconsistencies cannot happen with detail points, as they are not shared with other Tiger Lines. But endpoints will generally be shared by two or more Tiger Lines. If two Tiger Lines L, L¢ share an endpoint P, then the coordinates of P in L and in L¢ must agree. This consistency is not encoded into the Tiger data organization, but is an implicit guarantee.
The set of Tiger Polygons forms a subdivision of its county (and by extension to the whole USA). Each Tiger Polygon is associated with an interior point. The boundary of each Tiger Polygon is bounded by a whole number of Tiger Lines; this implies that each Tiger Line bounds one or two Tiger Polygons. A Tiger Lines bounds only one Tiger Polygon if the other side of the line is outside the county or outside the map coverage.
Fix any county. Each Record Type (RT) is stored in a single file. To understand the basic information in these files, we use some basic concepts from Relational Database theory. Each file is thus a relation. A relation has a fixed set of attributes. We suggest two steps to figuring out the data in Tiger files:
(1) To understand a particular record type, you need to figure out which of its attributes are "KEY" attributes. E.g., in RT1, the TLID field is the key. In general you need more than one attribute to form a KEY. So, the first task is to identify its KEY attributes (if any). One big hint in this task is to look at those attributes that are NOT allowed to have blank values. We next explain this.
In Chapter 6 of the TIGER Manual, there is table for each record type. For each record type, we find a list of its attributes and their properties. The properties are NAME, BV, FMT, TYPE, BEG, END, LEN and DESC. E.g., in RT1, we have attributes with NAME's such as TLID, FRLONG, FRLAT, TOLONG, TOLAT, etc. The FMT tells you whether the values are left- or right-justified within its allocated bytes. TYPE is either Ä" (for alphabetic string) or "N" (for numeric data). BEG and END are the beginning and end positions for the attribute value. LEN is redundant, and is equal to END-BEG+1. DESC gives a brief informal description. For our purposes, BV ("blank value") is most interesting: this property has a "Yes/No" value, where "Yes" means that a record is allowed to have a blank value for this field. Clearly, KEY attributes cannot be blank.
(2) Second, to connect the information across files, you need to find shared attributes. For instance, the POLYID attribute is found in RT9 and RTA. By "joining" these two files in the sense of relational database, you can cross reference properties.
Note that blank entries are very common, and it means that the corresponding attribute is not applicable for that row. Alternatively, you might say that the database scheme for these relations
The table below lists some important fields in the various Tiger files:
Record Type | Field Name | Field Position | Description |
TLID | 6-15 | TIGER/Line ID | |
RT 1 | SIDE1 | 16 | Single side? (blank = both sides, 1 = single side) |
CFCC | 56-58 | Census feature class code | |
FRLONG | 191-200 | Start (from) longitude | |
FRLAT | 201-209 | Start (from) latitude | |
(Basic info | TOLONG | 210-219 | End (to) longitude |
for Lines) | TOLAT | 220-228 | End (to) latitude |
TLID | 6-15 | TIGER/Line ID (not all lines have an entry here | |
RT 2 | RTSQ | 16-18 | Record Sequence Number (1, 2, etc) |
LONG1 | 19-28 | Longitude of point 1 | |
LAT1 | 29-37 | Latitude of point 1 | |
(Detail points | LONG2 | 28-37 | Longitude of point 2 |
for Lines) | LAT2 | etc | etc |
LONG10 | 190-199 | Longitude of point 10 | |
LAT10 | 200-208 | Latitude of point 10 | |
RT A | |||
(Basic info | POLYID | 6-15 | Polygon ID |
for Polygons) | |||
RT I | TLID | 6-15 | TIGER/Line ID |
(Line-Polygon | POLYIDL | 27-36 | Polygon ID on the left side |
info) | POLYIDR | 42-51 | Polygon ID on the right side |
RT P | POLYID | 16-25 | Polygon ID |
(More info | POLYLONG | 26-35 | Internal Point Longitude |
on Polygons) | POLYLAT | 36-44 | Internal Point Latitude |
N.B. The html version of this table may not be completely correct; refer to the postscript version.
Example: Metropolitan Areas and Townships. Suppose you are interested in Metropolitan Areas (MA's). This information is found in Record Type S (RTS). The KEY of RTS is POLYID (polygons). Two attributes found in RTS are MSA/CMSA and PMSA. What are these? Well, there are two basic kinds of MA's: Metropolitan Statistical Areas or MSA (under 1 million population), Consolidated MSAs or CMSA (over 1 million population). CMSA's are in turn subdivided into Primary MSAs or PMSAs. E.g., New York City is a CMSA, but it has many PMSA's. Thus, to identify an MA, you look for entries in the MSA/CMSA attribute. If a row (i.e., a polygon) is part of an MSA, then its PMSA attribute is blank; if it is a CMSA, it will have a PMSA attribute.
Suppose you want to identify all the polygons related to a single township (or incorporated entity). You need to join the FIPS 55 Code in RTC with a particular FIPS 55 Code in RTS. In RTC, the code is the value of the attribute named FIPS. But in RTS, there are several FIPS 55 Code, but the one you need is in an attribute named COUSUB. In short, you must perform a join between RTS and RTC in which RTS:COUSUB = RTC:FIPS.
Non-Geometric and Non-topological Data. Such data include census data (of course), postal address ranges, zip code, land type and metropolitan areas. Also of practical interest is landmarks (school, park, airport, etc).
Of course, the conversion issue is to extract from the Tiger dataset the usually connectivity expected in structures such as half-edge data structures. What makes this interesting is that we may want to process this incrementally (such as across the netwok).
Projections. When we display maps based a global referencing system (usually this means the longitude/latitude system), it is seldom acceptable to treat this as if the coordinates come from a rectangular coordinate system. Otherwise, the map would look distorted. We need to choose some map projection scheme. In the following discussion, assume G is the surface of the spherical globe with unit radius. A central problem in map making is to choose a suitable plane projection surface P and a partial 1-1 map p: G® P. The simplest projection is the sterographic projection. In this case, P is a plane tangent to G at any chosen point p0 Î G. For any point p Î G, we define p(p) Î S to be the intersection of S with the ray r(p) that emanates from the center of the globe and passing through p. Note that p(p) is undefined for points in a hemisphere and p(p0)=p0.
Another mapping is to map each point of G, except for
the Poles, onto the rectangular
region
| (1) |
Next consider projection onto the cylindrical surface C
that touches G at the equator. The axis of C
is the z-axis. We note two possibilities:
(a)
We can map G onto C
by the ``central cylindrical projection'':
a point p in G
is mapped into the point where the ray r(p) intersects C.
Note that this map is undefined at two points,
the North and South Pole.
Alternatively, this mapping identifies G
with the infinite strip [-p,p)×\mathbb R of the Euclidean plane.
The Greenwich Meridian is mapped to the y-axis.
(b)
A more useful projection onto C is the following:
For any point p in G, represented by
p=(q, f) Î R, we define
|
In practice, we can further simplify the computation of p(p) as follows:
suppose we are interested in points within a certain lon/lat box,
[q0,q1]×[f0,f1] Í R.
Let c = cos((f0+f1)/2). Then we approximate
the map p(q,f) by computing:
| (2) |
Topological Issues. Despite its name (the "T" is TIGER stands for topological), the TIGER dataset can have topological problems. When we merge counties, there may be inconsistencies or unexpected topology. For instance, we discovered in merging Manhattan with neighboring New Jersey counties that the Liberty Island belongs to Manhattan County but is surrounded by water belonging to New Jersey. How do we check for topological consistency and regularity?
Map Simplification. This is a major issue. Following [1], we split the simplification into two parallel tasks: simplification of polygons, and simplification of networks (e.g., road network). In some sense, these two maps are now considered independent layers. But of course they are not entirely independent, and we face the problem of simultaneous simplification: how to maintain some minimal consistency across these two overlays. This seems to be a new issue in simplification.
In the above description of the Tiger data, we focused on the geometric information (lines and polygons, lat/long, adjacency relations, etc). But of human interest are the names and types of various features, location of landmarks, etc. Maps would be useless without this information.
Coloring of Map Areas. We want to introduce a basic classification of TIGER polygons which will be color coded. The most basic data in any map is perhaps the distinction between water (blue) and land. For land, we want to color the park areas green, and public facilities or institutions such as airports and hospitals pink, The remaining polygons will be colored yellow. Non-map areas will be colored white.
There is a refinement of yellow that interests us: TIGER data also classifies polygons into various metropolitan or urban areas. We would like to color these ochre instead of yellow.
How do we identify these colors? Blue is easy: Record Type S has a water flag for polygons. As for green and pink, we can use the CFCC Codes described below (these are usually codes that begin with the letter D). Thus parks (D82) or national forests (D83), local parks (D85) will be green, while airports (D51) and train stations (D52) will be pink. This information can be obtained in RT7 and RT9, but to link this to polygons, you need RT8. Ochre can be obtained from RTC and RTS.
Landmarks. Record Type 7 contain landmark features. Each landmark is given a ID (positions 11-20), name (positions 25-54) and longitude/latitude (positions 55-64/65-73). There is also a census feature class code (CFCC) for each landmark. This is a 3 byte code in positions 22-24. This information is critical in our map color coding.
The landmarks in RT7 can be classified into two types: some are area landmarks (like parks and airports) and others are point landmarks (for instance, a peak or a building). For area landmarks, their associated polygon (POLYID) can be found in Record Type 8.
Key Geographical Locations (KGL). This is found in Record Type 9. Example of a KGL are public squares and plazas. Each KGL has two associated fields POLYID and FEAT, which can be used to locate it on the map. To get to the POLYID location, use RT I, and to get to the FEAT location, use RT 5.
Line Features. A ``line feature'' is an informal concept that is typically a sequence of one or more continguous Tiger Lines that share common attributes such as feature identifiers. For instance, a street named ``Broadway'' is a line feature that may be decomposed into many Tiger Lines. The Tiger data set does not guarantee that line features can be reconstructed easily from its data.
Each Tiger Line has a feature identifier in file RT1, stored
in the field positions 18 to 55.
Example of feature identifiers:
N Adams Av,
US Highway 1,
Jefferson St,
Providence St NE.
Feature identifiers are subdivided into 4 subfields, which we
may represent by the equation,
|
However, each Tiger Line also has a census feature class code (CFCC) in positions 56-58. This code (which we explain below) is highly correlated with the Feature Type (FETYPE). However, they are independently assigned.
Each line feature can be represented by many TLID's and conversely, a single TLID may represent several features. For instance, the FEAT_ID US Highway 1 is clearly extended over many TLID's, but any TLID in US Highway 1 might have other local designations (i.e., other FEAT_ID's).
Zip Code and Address Range. The zip code of the polygons to the left and right of a Tiger Line is stored in positions 107-116 of the RT1 records. The address range on the left and right side of each street is also recorded in positions 59-106.
CFCC Code. Such codes are found in record types RT1 (for Tiger Lines), RT7 (for point or area landmarks), RT9 (for key geographic location). Here is an illustrative list of the codes.
Class A code is for roads:
Class E is for physical features:
Class F is for nonvisible features such as property areas, legal and administrative entities. These are only identified if they do not follow visible features such as roads or streams.
Class G is for US Census Bureau internal usage. Class H is hydrography.
Feature Class P, Provisional Features, is a new class that that may only appear on street features. These features are treated exactly line Class A.
No. | FIPS | Name |
1. | 36103 | Suffolk |
2. | 36059 | Nassau |
3. | 36081 | Queens |
4. | 36047 | King (Brooklyn) |
5. | 36061 | New York (Manhattan) |
6. | 36085 | Richmond (Staten Is) |
7. | 36005 | Bronx |
8. | 36087 | Rockland |
9. | 36071 | Orange |
10. | 36119 | Westchester |
11. | 34003 | Bergen |
12. | 34017 | Hudson |
13. | 34013 | Essex |
14. | 34039 | Union |
15. | 34023 | Middlesex |
16. | 34025 | Monmouth |