This special focus is jointly sponsored by the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS), the Biological, Mathematical, and Physical Sciences Interfaces Institute for Quantitative Biology (BioMaPS), and the Rutgers Center for Molecular Biophysics and Biophysical Chemistry (MB Center).
Title: Recombinations-based Population Genomics
We will give a summary of a list of hot problems in (human) population genetics which is at the base of wide development in genomic sciences, biomedicine and biology at large. These problems span from using population genetics for redrawing population history to reconstructing genome sequences, with many implications in disease genetics. One of the points is the inference of recombinations in SNP data (or, in the future, sequence data) in a set of chromosomes. This has been implemented in a system, IRiS, and the results yield a new set of information for population genetics.
In particular, we will discuss our study of human population diversity using evidence of past recombinations (termed recotypes) as genetic markers. Our inferred recombinations indicate strong agreement with past in vitro and in silico recombination rate estimates. The correlation between traditional allele frequency based distances and recombinational distances bring further credence to the study of population structure using recotypes. Furthermore our results indicate recotypes are more representative of the underlying population structure than the haplotypes they are derived from.
Title: Dimensionality reduction in the analysis of human genetics data
Dimensionality reduction algorithms have been widely used for data analysis in numerous application domains including the study of human genetics. For instance, linear dimensionality reduction techniques (such as Principal Components Analysis) have been extensively applied in population genetics. In this talk we will discuss such applications and their implications for human genetics, as well as the potential of applying non-linear dimensionality reduction techniques in this area.
Title: More powerful genome-wide association methods for case-control data
In case-control Single Nucleotide Polymorphism (SNP) data, there are three distinct sources of information about genetic association, and correspondingly three different tests: the Allele frequency, Hardy Weinberg Disequilibrium (HWD) and Linkage Disequilibrium (LD) contrast tests. While all three tests are typically developed in a retrospective context, we show that prospective logistic regression models may be developed that correspond conceptually to the retrospective tests. This approach provides a flexible framework for conducting a systematic series of association analyses using unphased genotype data and any number of covariates. For a single stage study, two single-marker tests and four two-marker tests are discussed. The true association models are derived and they allow us to understand why a model with only a linear term will generally fit well for a SNP in weak LD with a causal SNP, whatever the disease model, but not for a SNP in high LD with a non-additive disease SNP.We investigate the power of the association tests using real LD parameters from chromosome 11 in the HapMap CEU population data. Among the single-marker tests, the allelic test has on average the most power in the case of an additive disease; but, for dominant, recessive and heterozygote disadvantage diseases, the genotypic test has the most power. Among the six two-marker tests, the Allelic-LD contrast test, which incorporates linear terms for two markers and their interaction term, provides the most reliable power overall for the cases studied. Therefore, our result supports incorporating an interaction term as well as linear terms in multi-marker tests.
Title: Two (population genetics and phylogenetics) Solutions in Search of Killer Apps
Title: Statistical Alignment, Footprinting and Transfer of Knowledge
Methods of sequence analysis based on stochastic models of the insertion-deletion process (Statistical Alignment) has been a major surprising success story of the last decade after three decades of dominance of non-statistical optimization (similarity maximization) approaches. There are still many challenges to statistical alignment, both in terms of biological realism and the computational challenges of increasingly large data set. Combining statistical alignment with annotation techniques will have clear advantages and is only recently being explored. This talk will present recent advances in combining statistical alignment with the search for regulatory signals that unambiguously outperfoms competing approaches.
Title: Efficient algorithms for ascertaining markers for controlling for population substructure
Human population substructure has been traditionally the study of human population genetics for making inferences about past evolutionary events, either related to demographic factors or to selective pressures. Nevertheless, human population substructure has acquired an additional relevance in the last years due to its role as a putative confounding factor in epidemiological case-control studies as well as their putative importance in the forensic field. So far, different statistical approaches have been proposed for controlling for this confounding effect, including the use of markers ascertained specially for their information to detect population substructure (AIMs, also called ancestry sensitive markers or ASMs). This has been a quite active field of research and several algorithms have been developed for ascertaining sets of AIMs, mainly differing in whether pre-defined clusters of individuals were considered or not, and the kind of metrics to estimate the amount of population substructure among other factors. Nevertheless, it has been shown that population substructure is only a confounding factor in quite particular scenarios, namely when the genetic variation covariates with the phenotypic variation. Therefore, it seems reasonable that the ascertained markers for their further use in case-control studies should include this information. We have implemented an algorithm that allows one to ascertain markers that are (population) associated to particular phenotypes, either discrete or continuous; results suggest that these markers that geographically covariate at a population level with the phenotype could be indeed associated to it.
Title: Imputation-based local ancestry inference in admixed populations
Mp This is joint work with Bogdan Pasaniuc (ICSI Berkeley) and Justin Kennedy
(UCONN).
Title: High-dimensional data-sets and the problems they cause
Perhaps more than any other scientific discipline, the biological sciences
are currently in the midst of a golden era of technological advances. These
advances are allowing us to collect data that are a quantum leap better (i.e.
more detailed) and bigger (i.e. genomewide) than has hitherto been available.
These data represent a gold-mine in our efforts to understand the relationship
between our genetic and phenotypic make-ups, but they also introduce problems.
Models that were tractable for smaller data-sets become intractable in the new
era, explicit calculation often becomes impossible, and many analysis methods
begin to break down. We discuss these problems and illustrate proposed solutions
using examples drawn from applications in population genetics.
Title: Human Population Genomics: Man, Woman, Birth, Death, Infinity, Plus
Altruism, Cheap Talks, Bad Behavior, Money, God and Diversity on Steroids
Our ancestors became almost extinct twice, the most recent being about 40,000
to 60,000 years ago. At one point, the population had shrunk to as few as 4,000
individuals, but expanded rapidly as humans migrated to other parts of the world
and learned to farm and domesticate animals. The genomes of the current human
population record this history as it has been molded by mutations
(polymorphisms), migration, genetic drifts and selection. The statistical
distributions of genes and other genomic elements are hard to decipher since it
mixes huge amount of diversity fueled by genetic drift, resulting from small
populations and non-random mating, with significant differences that contribute
each individual's overall traits.
However, as we prepare to usher in the age of individualized medicine, we
have to attack the underlying statistical analysis problem on several fronts:
(1) Technology, (2) Systems Biology and Genetics, (3) Statistical Algorithms,
and (4) Large-Scale System Building. My group has been engaged in developing a
single-molecule sequencing technology (SMASH) and sequence assembly algorithms
(SUTTA) to collect very high-quality haplotypic sequencing data from a large
number of individuals. Using this data, we aim to catalog and understand how
different polymorphisms (SNP, CNV, segmental rearrangements and possibly many
others) originate and diffuse through the population. This will then lead to
various novel non-parametric algorithms to model the stochastic processes that
are modulated by population sizes, migration and mating patterns. This
integrated technology can then be used to discover and exploit groups of genetic
markers to drive the core recommender engine of individualized medicine. I will
discuss various open problems related to this strategy and their possible
solutions.
Title: Population genetic analyses of next-generation sequencing data
Low coverage next-generation sequencing data poses special problems for
population genetic analyses because of low coverage, missing data, and
sequencing errors. We present some new methods for addressing these problems and
show applications to the estimation of inbreeding coefficients, population
scaled mutations rates, frequency spectra and other statistics of interest to
population geneticists. We illustrate with a number of applications in humans
and other organisms.
Title: RECOMBINOMICS: Myth or Reality?
The talk is in two parts. In the first part we explore the general problem of
reconstructability of pedigree history using a random graphs framework. How
plausible is it to unravel the history of a complete unit (chromosome) of
inheritance? In the second part of the talk we discuss our approach to
reconstructing the recombinational history of a sample of individuals. I will
describe the underlying algorithms in a system called IRiS that we have used in
studying population diversity.
Title: Haplotype clusters and imputed genotypes in diverse human populations
Shared descent of similar haplotypes from a common ancestor enables the
inference of haplotype phase from diploid genotypes and the imputation of
unmeasured alleles. This talk will examine a series of problems that arise in
human population genetics from the consideration of phasing and imputation.
Topics that will be discussed include (1) the development of an encoding of
haplotypes pointwise along the genome for use in population-genetic analysis,
(2) the measurement of the accuracy of genotype imputation in diverse human
populations, and (3) the evaluation of the connection between imputation error
and the power of association-mapping studies.
Title: Genome shrinkage by elimination of duplicates
Over evolutionary time scales, genomes may expand and shrink considerably. A
variety of environmental and functional selective forces have been adduced to
account for these changes, One well-known mechanism for sudden expansion is
whole genome doubling (WGD). Following an episode of WGD gene duplicates are
lost at a high rate through processes such as pseudogenization and deletion of
chromosomal segments containing one or more genes, while intra- and
interchromosomal rearrangement mechanisms redistribute chromosomal segments both
large and small across the genome. The genome of the present-day descendant can
be largely decomposed into a set of duplicated DNA segments dispersed among the
chromosomes, with all the duplicate pairs exhibiting a similar degree of
sequence divergence, and with single-copy segments interspersed among them. In
this paper, we introduce approaches to analyzing the evolution of doubled
genomes, based entirely on gene order evidence, in order to explain aspects of
the gene loss process and to reconstruct the rearrangement steps leading from
the doubled ancestral genome to the present day descendant. This is based on the
recently-developed ``Guided halving" algorithm and statistical analysis of
``Conserved frames". We apply our methods to yeast, cereal and poplar genomes.
Half-Sibs Reconstruction Problem: A Theoretical Analysis
Kinship Analysis from microsatellite markers is an important area of population genetics with applications in conservation biology, kin selection, evolutionary biology and agriculture. Computationally, this area poses a number of interesting problems, especially when modeled with combinatorial optimization. The goal is given genotypic data of a cohort of individuals, reconstruct kinship information including sibling groups and parental genotype. We will discuss the combinatorial problem of reconstructing minimum half-sibling groups (i.e. groups of individuals that share one parent) necessary to explain a population. In addition to the applications of such information we will discuss an exact algorithm, complexity, and finally the relation of the problem to a formulation of Full-Sibs Reconstruction, and consequently, Raz's Parallel Repetition Theorem.
Title: Forensic DNA analysis and multi-locus match probability in finite
populations: a fundamental difference between the Moran and Wright-Fisher models
A classical problem in population genetics, which being also of importance to
forensic science, is to compute the match probability (MP) that two individuals
randomly chosen from a population have identical alleles at a collection of
loci. At present, 11 to 13 unlinked autosomal microsatellite loci are typed for
forensic use. In a finite population, the genealogical relationships of
individuals can create statistical non-independence of alleles at unlinked loci.
However, the so-called product rule, which is used in courts in the US, computes
the MP for multiple unlinked loci by assuming statistical independence,
multiplying the one-locus MPs at those loci. Analytically testing the accuracy
of the product rule for more than 5 loci has hitherto remained an open problem.
In this talk, I will describe how a flexible graphical framework can be
employed to compute multi-locus MPs analytically. I will consider two standard
models of random mating, namely the Wright-Fisher and Moran models, and describe
the computation of MPs for up to 10 loci in the Wright-Fisher model and up to 13
loci in the Moran model. For a finite population, I will show that the MPs for a
large number of loci predicted by the product rule are highly sensitive to
mutation rates in the range of interest, while the true multi-locus MPs are not.
Furthermore, I will show that the Wright-Fisher and Moran models may produce
drastically different MPs for a finite population, and that this difference
grows with the number of loci and mutation rates. Although the two models
converge to the same coalescent or diffusion limit, in which the population size
approaches infinity, I will demonstrate that, when multiple loci are considered,
the rate of convergence in the Moran model is significantly slower than that in
the Wright-Fisher model. Hence, our work reveals a striking fundamental
difference between the two standard models of random mating.
Title: Estimating human demographic parameters from DNA sequence data
We present a composite-likelihood framework for estimating demographic
parameters from DNA resequencing data. We then apply this method to estimate
split times and migration rates between different populations. Our results
suggest that population structure within Africa is quite old, and likely
predates the expansion of modern humans to other continents. We also outline how
this method can be used to detect ancient admixture events (e.g., between modern
humans and 'archaic' human groups such as Neandertals), and speculate that
ancient admixture may be a common feature in the history of many extant
populations.
Title: A fatgraph model of protein structure
The CATH database is a hierarchical classification of protein domain
structures with four main levels. Classification on the top level is relatively
easy but already at the second level, manual work is needed. We present a novel
method for describing domain structures based on concepts from algebraic
topology. Using the locations of the backbone atoms and the hydrogen bonds we
create a combinatorial object -- a so-called fatgraph -- which is then
transformed into a topological object. The topological object of our method does
not depend on any particular embedding in an Euclidean space, and this leads to
defining intrinsic quantities -- topological invariants -- of protein domains.
We have implemented algorithms to calculate these quantities and other
quantities of interest. We show some results for classification of domain
structures using topological invariants; even simple classification schemes
perform remarkably well. Apart from the model's use in protein classification,
it might eventually be used to guide structure prediction and structural
annotation of proteins.
Paul Marjoram, Keck School of Medicine, USC, USA
Bud Mishra, NYU, USA
Rasmus Nielsen, UC Berkeley, USA
Laxmi Parida, IBM T J Watson Research, USA
Noah Rosenberg, University of Michigan, USA
David Sankoff, Department of Mathematics and Statistics,
University of Ottawa
Saad Sheikh, University of Illinois at Chicago, USA
Yun S Song, UC Berkeley, USA
Jeff Wall, UC San Francisco, USA
Carsten Wiuf, Aarhus University, Denmark
Previous: Program
Workshop Index
DIMACS Homepage
Contacting the
Center
Document last modified on April 14, 2009.