DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

**Organizers:****Jaume Bertranpetit**, Universitat Pompeu Fabra, jaume.bertranpetit at upf.edu**Dan Gusfield**, University of California, gusfield@cs.ucdavis.edu**Laxmi Parida**, IBM T J Watson Research, parida at us.ibm.com

This special focus is jointly sponsored by the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS), the Biological, Mathematical, and Physical Sciences Interfaces Institute for Quantitative Biology (BioMaPS), and the Rutgers Center for Molecular Biophysics and Biophysical Chemistry (MB Center).

Title: Recombinations-based Population Genomics

We will give a summary of a list of hot problems in (human) population genetics which is at the base of wide development in genomic sciences, biomedicine and biology at large. These problems span from using population genetics for redrawing population history to reconstructing genome sequences, with many implications in disease genetics. One of the points is the inference of recombinations in SNP data (or, in the future, sequence data) in a set of chromosomes. This has been implemented in a system, IRiS, and the results yield a new set of information for population genetics.

In particular, we will discuss our study of human population diversity using evidence of past recombinations (termed recotypes) as genetic markers. Our inferred recombinations indicate strong agreement with past in vitro and in silico recombination rate estimates. The correlation between traditional allele frequency based distances and recombinational distances bring further credence to the study of population structure using recotypes. Furthermore our results indicate recotypes are more representative of the underlying population structure than the haplotypes they are derived from.

Title: Dimensionality reduction in the analysis of human genetics data

Dimensionality reduction algorithms have been widely used for data analysis in numerous application domains including the study of human genetics. For instance, linear dimensionality reduction techniques (such as Principal Components Analysis) have been extensively applied in population genetics. In this talk we will discuss such applications and their implications for human genetics, as well as the potential of applying non-linear dimensionality reduction techniques in this area.

Title: More powerful genome-wide association methods for case-control data

In case-control Single Nucleotide Polymorphism (SNP) data, there are three distinct sources of information about genetic association, and correspondingly three different tests: the Allele frequency, Hardy Weinberg Disequilibrium (HWD) and Linkage Disequilibrium (LD) contrast tests. While all three tests are typically developed in a retrospective context, we show that prospective logistic regression models may be developed that correspond conceptually to the retrospective tests. This approach provides a flexible framework for conducting a systematic series of association analyses using unphased genotype data and any number of covariates. For a single stage study, two single-marker tests and four two-marker tests are discussed. The true association models are derived and they allow us to understand why a model with only a linear term will generally fit well for a SNP in weak LD with a causal SNP, whatever the disease model, but not for a SNP in high LD with a non-additive disease SNP.We investigate the power of the association tests using real LD parameters from chromosome 11 in the HapMap CEU population data. Among the single-marker tests, the allelic test has on average the most power in the case of an additive disease; but, for dominant, recessive and heterozygote disadvantage diseases, the genotypic test has the most power. Among the six two-marker tests, the Allelic-LD contrast test, which incorporates linear terms for two markers and their interaction term, provides the most reliable power overall for the cases studied. Therefore, our result supports incorporating an interaction term as well as linear terms in multi-marker tests.

Title: Two (population genetics and phylogenetics) Solutions in Search of Killer Apps

Title: Statistical Alignment, Footprinting and Transfer of Knowledge

Methods of sequence analysis based on stochastic models of the insertion-deletion process (Statistical Alignment) has been a major surprising success story of the last decade after three decades of dominance of non-statistical optimization (similarity maximization) approaches. There are still many challenges to statistical alignment, both in terms of biological realism and the computational challenges of increasingly large data set. Combining statistical alignment with annotation techniques will have clear advantages and is only recently being explored. This talk will present recent advances in combining statistical alignment with the search for regulatory signals that unambiguously outperfoms competing approaches.

Title: Efficient algorithms for ascertaining markers for controlling for population substructure

Human population substructure has been traditionally the study of human population genetics for making inferences about past evolutionary events, either related to demographic factors or to selective pressures. Nevertheless, human population substructure has acquired an additional relevance in the last years due to its role as a putative confounding factor in epidemiological case-control studies as well as their putative importance in the forensic field. So far, different statistical approaches have been proposed for controlling for this confounding effect, including the use of markers ascertained specially for their information to detect population substructure (AIMs, also called ancestry sensitive markers or ASMs). This has been a quite active field of research and several algorithms have been developed for ascertaining sets of AIMs, mainly differing in whether pre-defined clusters of individuals were considered or not, and the kind of metrics to estimate the amount of population substructure among other factors. Nevertheless, it has been shown that population substructure is only a confounding factor in quite particular scenarios, namely when the genetic variation covariates with the phenotypic variation. Therefore, it seems reasonable that the ascertained markers for their further use in case-control studies should include this information. We have implemented an algorithm that allows one to ascertain markers that are (population) associated to particular phenotypes, either discrete or continuous; results suggest that these markers that geographically covariate at a population level with the phenotype could be indeed associated to it.

Title: Imputation-based local ancestry inference in admixed populations
Mp

This is joint work with Bogdan Pasaniuc (ICSI Berkeley) and Justin Kennedy (UCONN).

Title: High-dimensional data-sets and the problems they cause

Perhaps more than any other scientific discipline, the biological sciences are currently in the midst of a golden era of technological advances. These advances are allowing us to collect data that are a quantum leap better (i.e. more detailed) and bigger (i.e. genomewide) than has hitherto been available. These data represent a gold-mine in our efforts to understand the relationship between our genetic and phenotypic make-ups, but they also introduce problems. Models that were tractable for smaller data-sets become intractable in the new era, explicit calculation often becomes impossible, and many analysis methods begin to break down. We discuss these problems and illustrate proposed solutions using examples drawn from applications in population genetics.

Title: Human Population Genomics: Man, Woman, Birth, Death, Infinity, Plus Altruism, Cheap Talks, Bad Behavior, Money, God and Diversity on Steroids

Our ancestors became almost extinct twice, the most recent being about 40,000 to 60,000 years ago. At one point, the population had shrunk to as few as 4,000 individuals, but expanded rapidly as humans migrated to other parts of the world and learned to farm and domesticate animals. The genomes of the current human population record this history as it has been molded by mutations (polymorphisms), migration, genetic drifts and selection. The statistical distributions of genes and other genomic elements are hard to decipher since it mixes huge amount of diversity fueled by genetic drift, resulting from small populations and non-random mating, with significant differences that contribute each individual's overall traits.

However, as we prepare to usher in the age of individualized medicine, we have to attack the underlying statistical analysis problem on several fronts: (1) Technology, (2) Systems Biology and Genetics, (3) Statistical Algorithms, and (4) Large-Scale System Building. My group has been engaged in developing a single-molecule sequencing technology (SMASH) and sequence assembly algorithms (SUTTA) to collect very high-quality haplotypic sequencing data from a large number of individuals. Using this data, we aim to catalog and understand how different polymorphisms (SNP, CNV, segmental rearrangements and possibly many others) originate and diffuse through the population. This will then lead to various novel non-parametric algorithms to model the stochastic processes that are modulated by population sizes, migration and mating patterns. This integrated technology can then be used to discover and exploit groups of genetic markers to drive the core recommender engine of individualized medicine. I will discuss various open problems related to this strategy and their possible solutions.

Title: Population genetic analyses of next-generation sequencing data

Low coverage next-generation sequencing data poses special problems for population genetic analyses because of low coverage, missing data, and sequencing errors. We present some new methods for addressing these problems and show applications to the estimation of inbreeding coefficients, population scaled mutations rates, frequency spectra and other statistics of interest to population geneticists. We illustrate with a number of applications in humans and other organisms.

Title: RECOMBINOMICS: Myth or Reality?

The talk is in two parts. In the first part we explore the general problem of reconstructability of pedigree history using a random graphs framework. How plausible is it to unravel the history of a complete unit (chromosome) of inheritance? In the second part of the talk we discuss our approach to reconstructing the recombinational history of a sample of individuals. I will describe the underlying algorithms in a system called IRiS that we have used in studying population diversity.

Title: Haplotype clusters and imputed genotypes in diverse human populations

Shared descent of similar haplotypes from a common ancestor enables the inference of haplotype phase from diploid genotypes and the imputation of unmeasured alleles. This talk will examine a series of problems that arise in human population genetics from the consideration of phasing and imputation. Topics that will be discussed include (1) the development of an encoding of haplotypes pointwise along the genome for use in population-genetic analysis, (2) the measurement of the accuracy of genotype imputation in diverse human populations, and (3) the evaluation of the connection between imputation error and the power of association-mapping studies.

Title: Genome shrinkage by elimination of duplicates

Over evolutionary time scales, genomes may expand and shrink considerably. A variety of environmental and functional selective forces have been adduced to account for these changes, One well-known mechanism for sudden expansion is whole genome doubling (WGD). Following an episode of WGD gene duplicates are lost at a high rate through processes such as pseudogenization and deletion of chromosomal segments containing one or more genes, while intra- and interchromosomal rearrangement mechanisms redistribute chromosomal segments both large and small across the genome. The genome of the present-day descendant can be largely decomposed into a set of duplicated DNA segments dispersed among the chromosomes, with all the duplicate pairs exhibiting a similar degree of sequence divergence, and with single-copy segments interspersed among them. In this paper, we introduce approaches to analyzing the evolution of doubled genomes, based entirely on gene order evidence, in order to explain aspects of the gene loss process and to reconstruct the rearrangement steps leading from the doubled ancestral genome to the present day descendant. This is based on the recently-developed ``Guided halving" algorithm and statistical analysis of ``Conserved frames". We apply our methods to yeast, cereal and poplar genomes.

Half-Sibs Reconstruction Problem: A Theoretical Analysis

Kinship Analysis from microsatellite markers is an important area of population genetics with applications in conservation biology, kin selection, evolutionary biology and agriculture. Computationally, this area poses a number of interesting problems, especially when modeled with combinatorial optimization. The goal is given genotypic data of a cohort of individuals, reconstruct kinship information including sibling groups and parental genotype. We will discuss the combinatorial problem of reconstructing minimum half-sibling groups (i.e. groups of individuals that share one parent) necessary to explain a population. In addition to the applications of such information we will discuss an exact algorithm, complexity, and finally the relation of the problem to a formulation of Full-Sibs Reconstruction, and consequently, Raz's Parallel Repetition Theorem.

Title: Forensic DNA analysis and multi-locus match probability in finite populations: a fundamental difference between the Moran and Wright-Fisher models

A classical problem in population genetics, which being also of importance to forensic science, is to compute the match probability (MP) that two individuals randomly chosen from a population have identical alleles at a collection of loci. At present, 11 to 13 unlinked autosomal microsatellite loci are typed for forensic use. In a finite population, the genealogical relationships of individuals can create statistical non-independence of alleles at unlinked loci. However, the so-called product rule, which is used in courts in the US, computes the MP for multiple unlinked loci by assuming statistical independence, multiplying the one-locus MPs at those loci. Analytically testing the accuracy of the product rule for more than 5 loci has hitherto remained an open problem.

In this talk, I will describe how a flexible graphical framework can be employed to compute multi-locus MPs analytically. I will consider two standard models of random mating, namely the Wright-Fisher and Moran models, and describe the computation of MPs for up to 10 loci in the Wright-Fisher model and up to 13 loci in the Moran model. For a finite population, I will show that the MPs for a large number of loci predicted by the product rule are highly sensitive to mutation rates in the range of interest, while the true multi-locus MPs are not. Furthermore, I will show that the Wright-Fisher and Moran models may produce drastically different MPs for a finite population, and that this difference grows with the number of loci and mutation rates. Although the two models converge to the same coalescent or diffusion limit, in which the population size approaches infinity, I will demonstrate that, when multiple loci are considered, the rate of convergence in the Moran model is significantly slower than that in the Wright-Fisher model. Hence, our work reveals a striking fundamental difference between the two standard models of random mating.

Title: Estimating human demographic parameters from DNA sequence data

We present a composite-likelihood framework for estimating demographic parameters from DNA resequencing data. We then apply this method to estimate split times and migration rates between different populations. Our results suggest that population structure within Africa is quite old, and likely predates the expansion of modern humans to other continents. We also outline how this method can be used to detect ancient admixture events (e.g., between modern humans and 'archaic' human groups such as Neandertals), and speculate that ancient admixture may be a common feature in the history of many extant populations.

Title: A fatgraph model of protein structure

The CATH database is a hierarchical classification of protein domain structures with four main levels. Classification on the top level is relatively easy but already at the second level, manual work is needed. We present a novel method for describing domain structures based on concepts from algebraic topology. Using the locations of the backbone atoms and the hydrogen bonds we create a combinatorial object -- a so-called fatgraph -- which is then transformed into a topological object. The topological object of our method does not depend on any particular embedding in an Euclidean space, and this leads to defining intrinsic quantities -- topological invariants -- of protein domains. We have implemented algorithms to calculate these quantities and other quantities of interest. We show some results for classification of domain structures using topological invariants; even simple classification schemes perform remarkably well. Apart from the model's use in protein classification, it might eventually be used to guide structure prediction and structural annotation of proteins.

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center

Document last modified on April 14, 2009.