Multimarker Genetic Analysis Methods for High Throughput Array Data
Candidate: Iuliana Ionita
Advisor: Bud Mishra

Abstract

In this thesis, we focus on multi-marker/-locus statistical methods for analyzing high-throughput array data used for the detection of genes implicated in complex disorders. There are two main parts: the first part concerns the localization of cancer genes from copy number variation data, with an application to lung cancer; the second part concerns the localization of disease genes using an affected-sib-pair design, with an application to inflammatory bowel disease. A third part addresses an important issue involved in the design of these disease-gene-detection studies. More details follow:

1. Detection of Oncogenes and Tumor Suppressor Genes using Multipoint Statistics from Copy Number Variation Data

ArrayCGH is a microarray-based comparative genomic hybridization technique that has been used to compare a tumor genome against a normal genome, thus providing rapid genomic assays of tumor genomes in terms of copy number variations of those chromosomal segments, which have been gained or lost. When properly interpreted, these assays are likely to shed important light on genes and mechanisms involved in initiation and progression of cancer. Specifically, chromosomal segments, amplified or deleted in a group of cancer patients, point to locations of cancer genes. We describe a statistical method to estimate the location of such genes by analyzing segmental amplifications and deletions in the genomes from cancer patients and the spatial relation of these segments to any specific genomic interval. The algorithm assigns to a genomic segment a score that parsimoniously captures the underlying biology. It computes a p-value for every putative disease gene by using results from the theory of scan statistics. We have validated our method using simulated datasets, as well as a real dataset on lung cancer.

2. Multi-locus Linkage Analysis of Affected-Sib-Pairs

A The affected-sib-pair (ASP) design is a simple and popular design in the linkage analysis of complex traits. The traditional ASP methods evaluate the linkage information at a locus by considering only the marginal linkage information present at that locus. However complex traits are influenced by multiple genes that together interact to increase the risk to disease. We describe a multi-locus linkage method that uses both the marginal information and information derived from the possible interactions among several disease loci, thereby increasing the significance of loci with modest marginal effects. Our method is based on a statistic that quantifies the linkage information contained in a set of markers. By a marker selection-reduction process, we screen a set of polymorphisms and select a few that seem linked to disease. We test our approach on simulated data and a genome-scan data for inflammatory bowel disease. We show that our method is expected to be more powerful than single-locus methods in detecting disease loci responsible for complex traits.

3. A Practical Haplotype Inference Algorithm

We consider the problem of efficient inference algorithms to determine the haplotypes and their distribution from a dataset of unrelated genotypes.

With the currently available catalogue of single-nucleotide polymorphisms (SNPs) and given their abundance throughout the genome (one in about $500$ bps) and low mutation rates, scientists hope to significantly improve their ability to discover genetic variants associated with a particular complex trait. We present a solution to a key intermediate step by devising a practical algorithm that has the ability to infer the haplotype variants for a particular individual from its own genotype SNP data in relation to population data. The algorithm we present is simple to describe and implement; it makes no assumption such as perfect phylogeny or the availability of parental genomes (as in trio-studies); it exploits locality in linkages and low diversity in haplotype blocks to achieve a linear time complexity in the number of markers; it combines many of the advantageous properties and concepts of other existing statistical algorithms for this problem; and finally, it outperforms competing algorithms in computational complexity and accuracy, as demonstrated by the studies performed on real data and synthetic data.