Correspondence Mapping Algorithms

Lidia Cassier, Robert Lucito, Vivek Mittal, Joseph West, Michael Wigler, William Casey and Bud Mishra

Abstract

In principle, microarrays are an ideal tool for mapping genomes because a large amount of information can be gathered in parallel from a single hybridization. In this abstract, we explain the algorithmic and the related mathematical tools needed to utilize microarray hybridization data to BAC pools and establish correspondence maps, the assignment of probes to individual BACs. The applications of correspondence mapping include

finding transcribed regions of the genome,
filling gaps in assembled genomic sequence,
integrating the annotations of two probe sets,
finding orthologs of human genes in other mammals, and

combining linear and genetic maps into integrated genomic maps.

Our central strategy involves a series of hybridizations to carefully constructed pools of BACs. In brief, we issue to a unique binary code to each BAC in a collection. We then construct BAC pools, making binary partitions according to the binary codes. After a series of array hybridizations with the BAC pools, each arrayed probe has an associated hybridization pattern. We convert the pattern of each probe to a numerical code, and then match the probe code to the closest BAC code, using a Hamming distance. From the statistical distribution of the Hamming distances, we can determine with confidence which probes have been correctly mapped to a homologous BAC.

Error correcting algorithms.
We hardly expect unambiguous data for all probe hybridizations, and even one ambiguous digit in a non-redundant code is enough to render probe assignment infeasible. The solution is to make extra poolings, by assigning extra (i.e., "redundant distinguishing") random bits of 0 or 1 to each BAC, and carry out additional partitions and hybridizations of the resulting pools. In order to distinguish one BAC from another, we will need hybridization experiments to yield distinguishable results (e.g., color) on those bits where the BAC addresses differ. As the signal-to-noise for hybridization deteriorates, increasing the number of distinguishing bits improves our ability to tell two BACs apart.

For each outcome for each probe, the ratio between channels is mapped to the real interval from 1.0 (red) to 0.0 (green), using reasonable linear projections with thresholds, creating a probe vector in N-space, where N is the total number of hybridizations. We can then associate the closest BAC vector to each probe vector, using a Hamming metric. This Hamming metric is an unbiased statistical estimator of the true Hamming distance, with a variance that depends on the noise inherent to hybridization. Thus as long as the variance in the Hamming metric is reasonably small, the closest BAC associated with a probe under the Hamming metric is almost surely also the closest BAC under the Hamming distance. In particular, we infer that if a probe belongs to exactly one BAC then this is the only BAC with a Hamming distance of zero and hence, the association between these probes and the BACs are correct with very high probability.

We have simulated probe-BAC pool hybridizations using parameters derived from the experiments. Using the error correcting algorithm described above, the results are clear and gratifying. With the noise levels and signal to background ratios consistent with our experimental data, and using 24 partitions/hybridizations to encode 4096 BACS, we obtain about a 92% yield of all possible correct assignments, and make on the order of 2% incorrect assignments.