Chee Yap
Geometric and Geographic Simplification
This lecture introduces two related and very important topics. In visualization, the issue is how to simplify geometric models. In GIS, the issue is how to simplify map data. They are clearly related, but the relationship is not simple to describe.
We have already seen in our homework that displaying a map in all its details is too much for a dynamic zooming display. What is needed is the ability to adjust the amount of details to be displayed at different zoom levels.
In visualization, we are given a geometric model M to be displayed. This model can be a map (2-D) or it can be a 3-D model. We assume that there is some measure of size of the model, denoted |M|. The main issues of visualizing M arises becase |M| can be very large.
The next idea is that M might be approximated by other models M¢ where |M¢| < |M|. We assume there is some measure of their difference, d(M,M¢) ³ 0. Let d ³ 0 and s ³ 1. We will call M¢ a (d,s)-simplification of M if d(M,M¢) £ d and |M| £ s|M¢|. We will also say M¢ is a d-simplification of M.
There are two ways to go defining M¢ for a given M: we could specify d and then seek the smallest s such that M¢ is a (d,s)-simplification of M. Or we could specify s and seek the smallest d such that this relation holds. Of course, finding the ``smallest'' is going to be a hard optimization problem. So we may be content with some good heuristics.
Suppose d > 0 is fixed, and we have an algorithm
that for each Mi, create a
new Mi+1 that is a d-simplification.
This will allow us to construct a hierarchy of models,
|
But how do we do simplification? We begin with two basic paradigms, called decimation and clustering, respectively. Let us illustrate them with the simplest kind of geometric model, where M is a polygonal path. So M=(v0, v1 ,¼, vn) where vi are points in the plane.
(1) Decimation: we can define M¢=(v0, v2, v4 ,¼, v2 ë n/2 û). Thus, we simply drop every other point in the path.
(2) Clustering: let us fix a grid G in the plane. View the grid as a discrete set of points, G Í \mathbb R2. we can define C(vi) to be the closest grid point to vi (breaking ties in some systematic way). Then we can replace M by (C(v0), C(v1) ,¼, C(vn)), and further dropping any C(vi) in case C(vi)=C(vi+1). The result is M¢.
The difference in these two points of views is that decimation guarantees reduction in the size |M|=n while clustering guarantees a bound in the distance (or distortion) d(M,M¢).
The Douglas Peucker (DP) Algorithm [2] is a method to simplify a polygonal path, originally designed for simplifying map data. Psychological studies have suggested that it produces very good perceptual representations of the original lines. It is a simple and natural algorithm, and so have been re-discovered several times in other literature. See [3] for this and related literature.
The DP Algorithm solves the following problem: given a polygonal path p=(v0, v1 ,¼, vn), n ³ 2, and an error bound e > 0, compute a subsequence p¢=(v¢0, v¢1 ,¼, v¢m) such that d(p,p¢) £ e, and v¢0=v0, v¢m = vn.
Note that ``subsequence'' means that p¢ is obtained
from p be omitting 0 or more vertices (so m £ n).
We define d(p,p¢) as follows:
if |p¢|=2, then d(p,p¢) is just the maximum
distance of any point vi from the line [`(v0vn)].
If |p¢| > 2, then there is ome 0 < k < m
such that p¢ = p1;p2 (where ; denotes path concatenation)
where p1=(v¢0 ,¼, v¢k) and p2=(v¢k ,¼, v¢m)
and recursively,
|
|
A simple implementation of the DP Algorithm is as follows
|
The DP Algorithm outputs a path p¢ that may not preserve simplicity of the input path p. There are more sophisticated algorithms that can preserve simplicity and achieve other additional properties. For instance, we can compute the simplified p¢ of minimum length (subject to d(p,p¢) £ e). Conversely, given a target m < n, there are algorithms to give the best subsequence p¢ of length m such that d(p,p¢) is minimized. Moreover, alternative measures of distortion that are different from d(p,p¢) can be investigated.
The DP line simplification algorithm can be used to to generate an ``integrated hierarchy'' that can be used in visualization. What we mean by integrated hierarchy is this: if (P0, P1 ,¼, Ph) is a seqence of simplifications of P0 (so Pi+1 is a simplification of Pi, then we call this a ``simplification hierarchy''. But this is not integrated yet. Here, we are assuming that simplication is by decimation.
Let P0=(v1 ,¼, vn) be a path with n vertices. A DP hierarchy for a path P0 is a binary tree H with 2n-2 nodes. Each node u stores (1) a subpath of P0 denoted span(u) and (2) a split vertex v(u). If u is a leaf, then v(u) is undefined and span(u) is a pair (vi, vi+1) (i=1 ,¼, n-1) of consecutive vertices of P0. If u is an internal node with uL, uR as left and right children, then span(u) = span(uL);span(uR), and v(u) is the endpoint of span(uL) and the startpoint of span(uR). Note that P;Q denotes concatenation of paths P and Q and this is defined provided the endpoint of P equals the startpoint of Q and the resulting path is just obtained by concatenating the correspond two sequences, after removing the startpoint of Q (to avoid the obvious duplication). It is clear that the DP algorithm produces such a hierarchy.
An antipath is a sequence (u1 ,¼, um) of nodes in T such that every path from the root of T to a leaf must intersect this sequence. We say that H contains the hierarchy (P0 ,¼, Ph) if each Pi can be extracted as an antipath of H.
We can construct a DP hierarchy by calling the DP algorithm with
e
= 0. The modifications to DP algorithm are
(1) We assume he DP Algorithm now returns the root of
a DP Hierarchy on the input (sub)path.
We construct the DP hierarchy as we go. Each time we find
a split point vf, we create a new node u with v(u)=vf,
and make its results of the two recursive calls the children of u.
(2) When j-i=1 we simply create a leaf node with no split nodes
and with span (vi, vj).
How do we use a DP hierarchy H? Given a e > 0, we can recursively compute the antipath Q in H such that the d(P0,Q) £ e.
Suppose this algorithm is called by Q = DPH(H,e). This is easy: we decide to go below a node u iff d(v(u), span(u)) > e. If we do not go below u, we return Q=span(u). Otherwise, we return Q = DPH(uL,e); DPH(uR,e) where uL, uR are the left and right children of u.
Extension. We can use this Hierarchy in other ways - as in ``view dependent simplifications''. Suppose p Î \mathbb R is point and we want to simplify the path proportionally to its distance from p. In this case, we can make our decision to go below a node u based on the distance of p from span(u) and v(u).
Simplifying real data is much more complex than line simplification. We want to address the TIGER data set. Ken Been's thesis [1] addresss this problem. But first we address a more general question: what really is the mathematical model of a TIGER map? The answer to this question is important if we want to understand simplification of such data. We propose to view the TIGER data as three overlaid geometric structures:
Henceforth, when we talk about a map (TIGER) M, we will mean such a triple M=(S, G, L). The fact that we have three overlaid structures imposes considerable constraints on our map simplification. The first two structures S, G may share common line features (e.g., a road can serve as a boundary between a park and non-park areas).
You may ask: what is there to simplify in L? Well, in the context of zooming, we must have a means to thinning out the set. The simplest method is just to drop landmarks as we zoom out. Alternatively, we can cluster them into ``super landmarks'' as we zoom out. Actually, such a list can be further generalized as follows: we can associate a zoom range with each point landmark. These points are only to be displayed within its zoom range. As an example, suppose certain points represents cities: we will only display a city-point when sufficiently zoomed out, but the city-point will vanish when we are too zoomed out.
Now that we know what we want (abstractly), we must proceed to extract this data from the real TIGER data. We proceed in two steps:
DECOUPLING STEP:
We must first decouple the S structure from the G structure, as this is mixed up in the TIGER representation.
INITIAL MERGE STEP: (1) We want to merge polygons with the same type information. What constitute "type"? Assuming that each polygon is part of a geographic entity, we must go to RTC to get the FIPS 55 Class Code and name of the geographic entity. We want to merge all polygons with the same code and same name. Other information may also be needed - for instance, RTS tells you whether a polygon is a water feature (lake, river, sea, etc).
The resulting (merged) polygons is given their own unique identifiers (it is normally inherited from one of its constituent POLYID's.
(2) Similarly, each Tline has a CFCC code and a feature name. Again, we merge all Tlines with the same code and name.
Note that (1) and (2) may now have diverged. In Been's thesis, step (2) is made dependent on the merge in (1) - the rule to merge two lines is simply based on their left- and right-polygons. This results in a simpler model than what we propose. In particular, Been's merging will ensuer that every merged line from (2) is either a boundary between two merged polygons, or is contained inside a unique polygon.
Further Discussion. Call the results of the above merging the S-polygons and the G-lines. Among the list of S-polygons, we need to determine their inter-relation. Polygons has containment structure. We say polygon P is the parent of another polygon Q if P contains Q. So we want to store parent pointers as well as a list of children polygons. The problem is that the polygon containment hierarchy may have depth more than one. This is illustrated by the following example from Ken's thesis.
Prince William County (PWC) in Virginia encloses two other counties, Manassas and Manassas Park. Moreover, Manassas contains three ïslands" of land that belongs to Prince William County.
Let us consider in detail how this can be done in the context of a database system like Postgresql. We will focus here on extracting the geometric information related to nesting of polygons. This is the task:
Let us emphasize that a polygon in the following discussion refers to a connected region of the plane. But it need not be simply connected. Hence the boundary of a polygon can have several connected components that we call loops. There is a unique loop that is called the outer loop. All other loops (if any) are inner loops. Two polygons are adjacent if they share a common portion of a boundary (sharing isolated points on the boundary do not count). We are given this adjacency information.
The task before us is to determine, for each polygon Q, its parent, i.e., the polygon that contains Q. It is possible that Q has no parent. Why is this task necessary? That is because in our map display, it is necessary to draw the parent of a polygon before we draw Q (otherwise the parent will cover up Q when drawn).
Why is this task non-trivial? Because this information may not be directly obtainable from the adjacency information - the parent of polygon Q need not be one of the polygons that are adjacent to Q. This is because of the phenomenon of clusters, i.e., a maximally connected set of polygons that are external to each other.
We assume that Tiger information is given to us in the form
of two relations (tables):
(i) Tlines(Tlid, StartPt, EndPt, Detail, LPolyId, RPolyId, ¼)
There is an entry for Tiger line.
Besides the shape of each line, it tells us the left and right
polygons of each Tiger line. The ¼ indicate other information
(mainly to do with the "type information" of the line.
This will be important later for simplication.
(ii) TPolys(PolyId, BoundaryList, ¼)
The boundary list is just an unordered list of Tlids, corresponding to all the Tiger lines that bound this polygon.
We shall construct the following intermediate relations in our processing:
The LoopId is a unique identifier we can generate using Postgresql (see below)
For each polygon, we want to remember its parent, its outer loop and list of inner loops.
We also have its cluster information - this is the field ClusterRep, which is basically another PolyId. (Clusters are basically represented by a "compressed tree" in the Union-Find data structure.) Each cluster has a representative - the ClusterRep field is only a pointer to will ultimately lead us to the representative. The ClusterRep field of the representative polygon is null - that is how we know we have reached the representative. Initially, each polygon is its own cluster (so its ClusterRep field is initially null).
In Postgresql, we can create a ßequence object". This object can give generate a sequence of unique identifiers for use in, say, the LoopIds:
=> CREATE SEQUENCE MySeq; => INSERT INTO Loops VALUES (nextval('MySeq'), (123, 456, 789), (true, false, false));The values returned by a sequence object is BIGINT. The sequence counter is automatically advanced. If you do not want to advance the sequence counter, just use currval('MySeq') instead.
Clustering and Parent Algorithm.
Note that this will populate the Loops and PolyBoundary tables described above. The parent of each polygon is itself at this point.
[Previous Section] [Next Section] Go to Top of Page