Theses & Reports

Instructions for submitting a technical report or thesis.

You can find technical reports published prior to 1990 archived here.

Title

Authors

Year

TR2006-880 2006 A Unified Construction of the Glushkov, Follow, and Antimirov Automata Allauzen, Cyril; Mohri, Mehryar Abstract | PDF

Title: A Unified Construction of the Glushkov, Follow, and Antimirov Automata

Author(s): Allauzen, Cyril; Mohri, Mehryar

Abstract:

Many techniques have been introduced in the last few decades to create ε-free automata representing regular expressions: Glushkov automata, the so-called follow automata, and Antimirov automata. This paper presents a simple and unified view of all these ε-free automata both in the case of unweighted and weighted regular expressions.It describes simple and general algorithms with running time complexities at least as good as that of the best previously known techniques, and provides concise proofs.The construction methods are all based on two standard automata algorithms: epsilon-removal and minimization. This contrasts with the multitude of complicated and special-purpose techniques and proofs put forward by others to construct these automata. Our analysis provides a better understanding of ε-free automata representing regular expressions: they are all the results of the application of some combinations of epsilon-removal and minimization to the classical Thompson automata. This makes it straight forward to generalize these algorithms to the weighted case, which also results in much simpler algorithms than existing ones. For weighted regular expressions over a closed semiring, we extend the notion of follow automata to the weighted case. We also present the first algorithm to compute the Antimirov automata in the weighted case.
TR2006-884 2006 Invisible Safety of Distributed Protocols Balaban, Ittai; Pnueli, Amir; Zuck, Lenore Abstract | PDF

Title: Invisible Safety of Distributed Protocols

Author(s): Balaban, Ittai; Pnueli, Amir; Zuck, Lenore

Abstract:

The method of ``Invisible Invariants'' has been applied successfully to protocols that assume a ``symmetric'' underlying topology, be it cliques, stars, or rings. In this paper we show how the method can be applied to proving safety properties of distributed protocols running under arbitrary topologies. Many safety properties of such protocols have reachability predicates, which, on first glance, are beyond the scope of the Invisible Invariants method. To overcome this difficulty, we present a technique, called ``coloring,'' that allows, in many instances, to replace the second order reachability predicates by first order predicates, resulting in properties that are amenable to Invisible Invariants, where ``reachable'' is replaced by ``colored.'' We demonstrate our techniques on several distributed protocols, including a variant on Luby's Maximal Independent Set protocol, the Leader Election protocol used in the IEEE 1394 (Firewire) distributed bus protocol, and various distributed spanning tree algorithms. All examples have been tested using the symbolic model checker TLV.
TR2006-885 2006 Shape Analysis of Single-Parent Heaps Balaban, Ittai; Pnueli, Amir; Zuck, Lenore Abstract | PDF

Title: Shape Analysis of Single-Parent Heaps

Author(s): Balaban, Ittai; Pnueli, Amir; Zuck, Lenore

Abstract:

We define the class of single-parent heap systems, which rely on a singly-linked heap in order to model destructive updates on tree structures. This encoding has the advantage of relying on a relatively simple theory of linked lists in order to support abstraction computation. To facilitate the application of this encoding, we provide a program transformation that, given a program operating on a multi-linked heap without sharing, transforms it into one over a single-parent heap. It is then possible to apply shape analysis by predicate and ranking abstraction as in [BPZ05]. The technique has been successfully applied on examples with trees of fixed arity (balancing of and insertion into a binary sort tree).
M.S. Thesis 2006 TimeIn: A temporal visualization for file access Borden, Jeffrey Abstract | PDF

Title: TimeIn: A temporal visualization for file access

Candidate: Borden, Jeffrey

Advisor(s): Shasha, Dennis

Abstract:

TimeIn seeks to unify a given set of file objects into a unified browsing experience providing mechanisms to Cluster visually similar objects and display objects in a timeline view from a local file system or flickr.com . To navigate this content, users are provided with a variety of mechanisms for filtering the set of objects presented.

For text based objects, TimeIn will analyze the content of the file and attempt to extract a set of descriptive phrases. For image based objects, TimeIn will annotate the object with the most frequently used colors of the image. Users have the option of augmenting these automatically generated tags by defining their own descriptive tags.

While providing novel features for browsing and searching content, TimeIn retains many of the original organizational features of existing systems. When content is imported from a hierarchical file system, users can still browse by the original hierarchical structure.

TimeIn also retains the PhotoSet structures associated with content imported from flickr.com . Users can also organize content into user-defined "albums" of objects. These albums can then be used to filter the set of objects on the timeline.
TR2006-883 2006 On Transductive Regression Cortes, Corinna; Mohri, Mehryar Abstract | PDF

Title: On Transductive Regression

Author(s): Cortes, Corinna; Mohri, Mehryar

Abstract:

In many modern large-scale learning applications, the amount of unlabeled data far exceeds that of labeled data. A common instance of this problem is the 'transductive' setting where the unlabeled test points are known to the learning algorithm. This paper presents a study of regression problems in that setting. It presents 'explicit' VC-dimension error bounds for transductive regression that hold for all bounded loss functions and coincide with the tight classification bounds of Vapnik when applied to classification. It also presents a new transductive regression algorithm inspired by our bound that admits a primal and kernelized closed-form solution and deals efficiently with large amounts of unlabeled data. The algorithm exploits the position of unlabeled points to locally estimate their labels and then uses a global optimization to ensure robust predictions. Our study also includes the results of experiments with several publicly available regression data sets with up to 20,000 unlabeled examples. The comparison with other transductive regression algorithms shows that it performs well and that it can scale to large data sets.
Ph.D. Thesis 2006 Guaranteed Precision for Transcendental and Algebraic Computation Made Easy Du, Zilin Abstract | PDF

Title: Guaranteed Precision for Transcendental and Algebraic Computation Made Easy

Candidate: Du, Zilin

Advisor(s): Yap, Chee

Abstract:

Numerical non-robustness is a well-known phenomenon when implementing geometric algorithms. A general approach to achieve geometric robustness is Exact Geometric Computation (EGC). This dissertation explores the redesign and extension of Core Library, a C++ library which embraces the EGC approach. The contributions of this thesis are organized into three parts.

In the first part, we discuss the redesign of Core Library, especially the expression "Expr" and bigfloat "BigFloat" classes. Our new design emphasizes extensibility in a clean and modular way. The three facilities in "Expr", filter, root bound and bigfloat, are separated into independent modules. This allows new filters, root bounds and some bigfloat substitute to be plugged in. The key approximate evaluation and precision propagation algorithms have been greatly improved. A new bigfloat system based on MPFR and interval arithmetic has been incorporated. Our benchmark shows that the redesigned Core Library typically has 5-10 times speedup. We also provide tools to facilitate extensions of "Expr" to incorporate new type of nodes, especially transcendental nodes.

Although the Core Library was originally designed for algebraic applications, transcendental functions are needed in many applications. In the second part, we present a complete algorithm for absolute approximation of the general hypergeometric functions. It's complexity is also given. The extension of this algorithm to ``blackbox number'' is provided. A general hypergeometric function package based on our algorithm is implemented and integrated into the Core Library based on our new design.

Brent has shown that many elementary functions, such as $\exp, \log, \sin$, etc., can be efficiently computed using the Arithmetic-Geometric Mean (AGM) based algorithm. However, he only gave an asymptotic error analysis. The constants in the Big $O(\cdot)$ notation required for implementation are unknown. We provide a non-asymptotic error analysis of the AGM algorithm and the related algorithms for logarithm and exponential functions. These algorithms have been implemented and incorporated into the Core Library.
Ph.D. Thesis 2006 On Cryptographic Techniques for Digital Rights Management Fazio, Nelly Abstract | PDF
Title: On Cryptographic Techniques for Digital Rights Management

Candidate: Fazio, Nelly

Advisor(s): Dodis, Yevgeniy

Abstract:

With more and more content being produced, distributed, and ultimately rendered and consumed in digital form, devising effective Content Protection mechanisms and building satisfactory Digital Rights Management (DRM) systems have become top priorities for the Publishing and Entertaining Industries.

To help tackle this challenge, several cryptographic primitives and constructions have been proposed, including mechanisms to securely distribute data over a unidirectional insecure channel (Broadcast Encryption), schemes in which leakage of cryptographic keys can be traced back to the leaker (Traitor Tracing), and techniques to combine revocation and tracing capabilities (Trace-and-Revoke schemes).

In this thesis, we present several original constructions of the above primitives, which improve upon existing DRM-enabling cryptographic primitives along the following two directions:
1. Widening their scope of applicability e.g., by considering models taking into accounts usability issues typical of the DRM setting; and
2. Strengthening their security guarantees to higher levels that are standards, for example, in the case of stand-alone encryption.
Our results along the first line of work include the following:
- An efficient public-key broadcast encryption scheme, which allows mutually mistrusting content providers to leverage a common delivery infrastructure, and can cope with low-end, stateless receivers;
- A traitor tracing scheme with optimal transmission rate, in which encryption does not cause a blow-up in the size of the content, thus allowing for optimal utilization of the broadcast channel;
- A public-key tracing and revoking scheme that can deal with both server-side and client-side scalability issues, while preserving traceability.
As for the second direction, our contribution can be divided as follows:
- A forward-secure public-key broadcast encryption scheme, in which the unauthorized access resulting from cracking a user-key is constrained to a minimal time frame which is delimited, in the future, by the revocation mechanism, and in the past, by forward secrecy;
- A precise formalization of the notion of adaptive chosen-ciphertext security for public-key broadcast encryption schemes, along with a modular and efficient construction.
Overall, the cryptographic tools developed in this thesis provide more flexibility and more security than existing solutions, and thus offer a better match for the challenges of the DRM setting.
Ph.D. Thesis 2006 Finding Your Match: Techniques for Improving Sequence Alignment in DNA and RNA Gill, Ofer Hirsch Abstract | PDF

Title: Finding Your Match: Techniques for Improving Sequence Alignment in DNA and RNA

Candidate: Gill, Ofer Hirsch

Advisor(s): Mishra, Bud

Abstract:

In Bioinformatics, finding correlations between species allows us the better understand the important biological functions of those species and trace its evolution. This thesis considers sequence alignment, a method for obtaining these correlations. We improve upon sequence alignment tools designed for DNA with Plains, an algorithm than uses piecewise-linear gap functions and parameter-optimization to obtain correlations in remotely-related species pairs such as human and fugu using reasonable amounts of memory and space on an ordinary computer. We then discuss Planar, which is similar to Plains, but is designed for aligning RNA, and accounts for secondary structure. We also explore SEPA, a tool that uses p-value estimation based on exhaustive empirical data to better emphasize key results from an alignment with a measure of reliability. Using SEPA to measure the quality of an alignment, we proceed to compare Plains and Planar against similar alignment tools, emphaisizing the interesting correlations caught in the process.
M.S. Thesis 2006 Kronosphere: A temporal visualization for file access Harrison, Chris Abstract | PDF

Title: Kronosphere: A temporal visualization for file access

Candidate: Harrison, Chris

Advisor(s): Shasha, Dennis

Abstract:

Hierarchical file systems mirror the way people organize data in the real world. However, this method of organization is often inadequate in managing the immense number of files that populate hard drives. Kronosphere provides a novel time and content-based navigational paradigm for managing and accessing media. This allows users to browse their documents by time, content, history, metadata, and relationships with other files.
Ph.D. Thesis 2006 DataSlicer: A Hosting Platform for Data-Centric Network Services He, Congchun Abstract | PDF

Title: DataSlicer: A Hosting Platform for Data-Centric Network Services

Candidate: He, Congchun

Advisor(s): Karamcheti, Vijay

Abstract:

As the Web evolves, the number of network services deployed on the Internet has been growing at a dramatic pace. Such services usually involve a massive volume of data stored in physical or virtual back-end databases, and access the data to dynamically generate responses for client requests. These characteristics restrict use of traditional mechanisms for improving service performance and scalability: large volumes prevent replication of the service data at multiple sites required by content distribution schemes, while dynamic responses do not support the reuse required by web caching schemes.

However, many deployed data-centric network services share other properties that can help alleviate this situation: (1) service usage patterns exhibit locality of various forms, and (2) services are accessed using standard protocols and publicly known message structures. When properly exploited, these characteristics enable the design of alternative caching infrastructures, which leverage distributed network intermediaries to inspect traffic flowing between clients and services, infer locality information dynamically, and potentially improve service performance by taking actions such as partial service replication, request redirection, or admission control.

This dissertation investigates the nature of locality in service usage patterns for two well-known web services, and reports on the design, implementation, and evaluation of such a network intermediary architecture, named DataSlicer. DataSlicer incorporates four main techniques: (1) Service-neutral request inspection and locality detection on distributed network intermediaries; (2) Construction of oriented overlays for clustering client requests; (3)Integrated load-balancing and service replication mechanisms that improve service performance and scalability by either redistributing the underlying traffic in the network or creating partial service replicas on demand at appropriate network locations; and (4) Robustness mechanisms to maintain system stability in a wide-area network environment.

DataSlicer has been successfully deployed on the PlanetLab network. Extensive experiments using synthetic workloads show that our approach can: (1) create appropriate oriented overlays to cluster client requests according to multiple application metrics; (2) detect locality information across multiple dimensions and granularity levels; (3) leverage the detected locality information to perform appropriate load-balancing and service replication actions with minimal cost; and (4) ensure robust behavior in the face of dynamically changing network conditions.
Ph.D. Thesis 2006 Multimarker Genetic Analysis Methods for High Throughput Array Data Ionita, Iuliana Abstract | PDF

Title: Multimarker Genetic Analysis Methods for High Throughput Array Data

Candidate: Ionita, Iuliana

Advisor(s): Mishra, Bud

Abstract:

In this thesis, we focus on multi-marker/-locus statistical methods for analyzing high-throughput array data used for the detection of genes implicated in complex disorders. There are two main parts: the first part concerns the localization of cancer genes from copy number variation data, with an application to lung cancer; the second part concerns the localization of disease genes using an affected-sib-pair design, with an application to inflammatory bowel disease. A third part addresses an important issue involved in the design of these disease-gene-detection studies. More details follow:

1. Detection of Oncogenes and Tumor Suppressor Genes using Multipoint Statistics from Copy Number Variation Data

ArrayCGH is a microarray-based comparative genomic hybridization technique that has been used to compare a tumor genome against a normal genome, thus providing rapid genomic assays of tumor genomes in terms of copy number variations of those chromosomal segments, which have been gained or lost. When properly interpreted, these assays are likely to shed important light on genes and mechanisms involved in initiation and progression of cancer. Specifically, chromosomal segments, amplified or deleted in a group of cancer patients, point to locations of cancer genes. We describe a statistical method to estimate the location of such genes by analyzing segmental amplifications and deletions in the genomes from cancer patients and the spatial relation of these segments to any specific genomic interval. The algorithm assigns to a genomic segment a score that parsimoniously captures the underlying biology. It computes a p-value for every putative disease gene by using results from the theory of scan statistics. We have validated our method using simulated datasets, as well as a real dataset on lung cancer.

2. Multi-locus Linkage Analysis of Affected-Sib-Pairs

A The affected-sib-pair (ASP) design is a simple and popular design in the linkage analysis of complex traits. The traditional ASP methods evaluate the linkage information at a locus by considering only the marginal linkage information present at that locus. However complex traits are influenced by multiple genes that together interact to increase the risk to disease. We describe a multi-locus linkage method that uses both the marginal information and information derived from the possible interactions among several disease loci, thereby increasing the significance of loci with modest marginal effects. Our method is based on a statistic that quantifies the linkage information contained in a set of markers. By a marker selection-reduction process, we screen a set of polymorphisms and select a few that seem linked to disease. We test our approach on simulated data and a genome-scan data for inflammatory bowel disease. We show that our method is expected to be more powerful than single-locus methods in detecting disease loci responsible for complex traits.

3. A Practical Haplotype Inference Algorithm

We consider the problem of efficient inference algorithms to determine the haplotypes and their distribution from a dataset of unrelated genotypes.

With the currently available catalogue of single-nucleotide polymorphisms (SNPs) and given their abundance throughout the genome (one in about $500$ bps) and low mutation rates, scientists hope to significantly improve their ability to discover genetic variants associated with a particular complex trait. We present a solution to a key intermediate step by devising a practical algorithm that has the ability to infer the haplotype variants for a particular individual from its own genotype SNP data in relation to population data. The algorithm we present is simple to describe and implement; it makes no assumption such as perfect phylogeny or the availability of parental genomes (as in trio-studies); it exploits locality in linkages and low diversity in haplotype blocks to achieve a linear time complexity in the number of markers; it combines many of the advantageous properties and concepts of other existing statistical algorithms for this problem; and finally, it outperforms competing algorithms in computational complexity and accuracy, as demonstrated by the studies performed on real data and synthetic data.
TR2006-882 2006 A FETI-DP algorithm for elasticity problems with mortar discretization on geometrically non-conforming partitions Kim, Hyea Hyun Abstract | PDF

Title: A FETI-DP algorithm for elasticity problems with mortar discretization on geometrically non-conforming partitions

Author(s): Kim, Hyea Hyun

Abstract:

In this paper, a FETI-DP formulation for three dimensional elasticity on non-matching grids over geometrically non-conforming subdomain partitions is considered. To resolve the nonconformity of the finite elements, a mortar matching condition is imposed on the subdomain interfaces (faces). A FETI-DP algorithm is then built by enforcing the mortar matching condition in dual and primal ways. In order to make the FETI-DP algorithm scalable, a set of primal constraints, which include average and momentum constraints over interfaces, are selected from the mortar matching condition. A condition number bound, $C(1+\text{log}(H/h))2$, is then proved for the FETI-DP formulation for the elasticity problems with discontinuous material parameters. Only some faces need to be chosen as primal faces on which the average and momentum constraints are imposed.
Ph.D. Thesis 2006 Expressive Motion Lees, Alyssa Abstract | PDF

Title: Expressive Motion

Candidate: Lees, Alyssa

Advisor(s): Bregler, Christopher; Geiger, Davi

Abstract:

Since the advent of motion capture animation, attempts have been made to extract the seemingly nebulously defined attributes of 'content' and 'style' from the motion data. Enabling quick access to highly precise data, the benefits of motion capture for animation purposes are abundant. Yet manipulating the expressive attributes of the motion data in a comprehensive manner has proved elusive. This dissertation poses practical solutions that are based on insights from the dance community and learning attributes from the motion data itself. The culminating project is a system which learns the deformations of the human body and reapplies them in exaggerated form for enhanced expressivity.

While simultaneously developing efficient and usable tools for animators, the result is a three pronged technique to enhance the expressive qualities of motion capture animation. The key aspect is the creation of a deformable skeleton representation of the human body using a unique machine learning approach. The deformable skeleton is modeled by replicating the actual movements of the human spine. The second step relies on exploiting the subtle aspects of motion, such as hand movement to create an emotional effect visually. Both of these approaches involve exaggerating the movements in the same vein as traditional 2-D animation technique of 'squash and stretch'. Finally, a novel technique for the application of style on a baseline motion capture sequence is developed.

All of these approaches are rooted in machine learning techniques. Linear discriminate analysis was initially applied to a single phrase of motion demonstrating various style characteristics in LABAN notation. A variety of methods including nonlinear PCA, and LLE were used to learn the underlying manifold of spine movements. Nonlinear dynamic models were learned in attempts to describe motion segments versus single phrases. In addition, the dissertation focuses on the variety of obstacles in learning with motion data. This includes the correct parameterization of angles, applying statistical analysis to quaternions, and appropriate distance measures between postures.
Ph.D. Thesis 2006 Building Trustworthy Storage Services out of Untrusted Infrastructure Li, Jinyuan Abstract | PDF

Title: Building Trustworthy Storage Services out of Untrusted Infrastructure

Candidate: Li, Jinyuan

Advisor(s): Mazieres, David

Abstract:

As the Internet has become increasingly ubiquitous, it has seen tremendous growth in the popularity of online services. These services range from online CVS repositories like sourceforge , shopping sites, to online financial and administrative systems, etc. It is critical for these services to provide correct and reliable execution for clients. However, given their attractiveness as targets and ubiquitous accessibility, online servers also have a significant chance of being compromised, leading to Byzantine failures.

Designing and implementing a service to run on a machine that may be compromised is not an easy task, since infrastructure under malicious control may behave arbitrarily. Even worse, as any monitoring facility may also be subverted at the same time, there is no easy way for system behavior to be audited, or for malicious attacks to be detected.

We propose our solution to the problem by reducing the trust needed on the server side in the first place. In the other words, our system is designed specifically for running on untrusted hosts. In this thesis, we realize this principle by two different approaches. First, we design and implement a new network file system -- SUNDR. In SUNDR, malicious servers cannot forge users' operations or tamper with their data without being detected. In the worst case, attackers can only conceal users' operations from each other. Still, SUNDR is able to detect this misbehavior whenever users communicate with each other directly.

The limitation of the approach above lies in that the system cannot guarantee ideal consistency with even one single failure. In the second approach, we use replicated state machines to tolerate some fraction of malicious server failures, which is termed Byzantine Fault Tolerance (BFT) in the literature. Classical BFT systems assume less than 1/3 of the replicas are malicious, to provide ideal consistency. In this thesis, we push the boundary from 1/3 to 2/3. With fewer than 1/3 of replicas faulty, we provide the same guarantees as classical BFT systems. Additionally, we guarantee weaker consistency, instead of arbitrary behavior, when between 1/3 and 1/3 of replicas fail.
Ph.D. Thesis 2006 Measures for Robust Stability and Controllability Mengi, Emre Abstract | PDF

Title: Measures for Robust Stability and Controllability

Candidate: Mengi, Emre

Advisor(s): Overton, Michael

Abstract:

A linear time-invariant dynamical system is robustly stable if the system as well as all of its nearby systems in a neighborhood of interest are stable. An important property of robustly stable systems is they decay asymptotically without exhibiting significant transient behavior. The first part of this thesis work focuses on measures revealing the degree of robust stability of a dynamical system. We put special emphasis on pseudospectral measures, those based on the eigenvalues of nearby matrices for a first-order system or matrix polynomials for a higher-order system. We present algorithms for the computation of pseudospectral measures for continuous and discrete systems with quadratic rate of convergence and analyze their accuracy in the presence of rounding errors. We also provide an efficient algorithm for the numerical radius of a matrix, the modulus of the outermost point in the field of values (the set of Rayleigh quotients) of the matrix. These algorithms are inspired by algorithms of Byers, Boyd-Balakrishnan and Burke-Lewis-Overton.

The second part is devoted to indicators of robust controllability. We call a system robustly controllable if it is controllable and remains controllable under perturbations of interest. We describe efficient methods for the computation of the distance to the closest uncontrollable system. Our first algorithm for the first-order distance to uncontrollability depends on a grid and is well-suited for low precision approximation. We then discuss algorithms for high precision approximation of the first-order distance to uncontrollability. These are based on the bisection method of Gu and the trisection variant of Burke-Lewis-Overton.

These algorithms require the extraction of the real eigenvalues of matrices of size $O(n2)$ typically at a cost of $O(n6)$, where $n$ is the dimension of the state space. We propose a new divide-and-conquer algorithm that reduces the cost to $O(n4)$ on average in both theory and practice and $O(n5)$ in the worst case. The new iterative approach to the extraction of real eigenvalues may also be useful in other contexts. For higher-order systems we derive a singular value characterization and exploit this characterization for the computation of the higher-order distance to uncontrollability to low precision. The algorithms in this thesis assume arbitrary complex perturbations are applicable to the input system and usually require the extraction of the imaginary eigenvalues of Hamiltonian matrices (or even matrix polynomials) or the unit eigenvalues of symplectic pencils (or palindromic matrix polynomials).
Ph.D. Thesis 2006 Algorithmic Algebraic Model Checking: Hybrid Automata & Systems Biology Mysore, Venkatesh Pranesh Abstract | PDF

Title: Algorithmic Algebraic Model Checking: Hybrid Automata & Systems Biology

Candidate: Mysore, Venkatesh Pranesh

Advisor(s): Mishra, Bud

Abstract:

Systems Biology strives to hasten our understanding of the fundamental principles of life by adopting a systems-level approach for the analysis of cellular function and behavior. One popular framework for capturing the chemical kinetics of interacting biochemicals is Hybrid Automata. Our goal in this thesis is to aid Systems Biology research by improving the current understanding of hybrid automata, by developing techniques for symbolic rather than numerical analysis of the dynamics of biochemical networks modeled as hybrid automata, and by honing the theory to two classes of problems: kinetic mass action based simulation in genetic regulatory & signal transduction pathways, and pseudo-equilibrium simulation in metabolic networks.

We first provide new constructions that prove that the "open" Hierarchical Piecewise Constant Derivative (HPCD) subclass is closer to the decidability and undecidability frontiers than was previously understood. After concluding that the HPCD-like classes are unsuitable for modeling chemical reactions, our quest for semi-decidable subclasses leads us to define the "semi-algebraic" subclass. This is the most expressive hybrid automaton subclass amenable to rigorous symbolic temporal reasoning. We begin with the bounded reachability problem, and then show how the dense-time temporal logic Timed Computation Tree Logic (TCTL) can be model-checked by exploiting techniques from real algebraic geometry, primarily real quantifier elimination. We also prove the undecidability of reachability in the Blum-Shub-Smale Turing Machine formalism. We then develop efficient approximation strategies by extending bisimulation partitioning, rectangular grid-based approximation, polytopal approximation and time discretization. We then develop a uniform algebraic framework for modeling biochemical and metabolic networks, also extending flux balance analysis. We present some preliminary results using a prototypical tool Tolque. It is a symbolic algebraic dense time model-checker for semi-algebraic hybrid automata, which uses Qepcad for quantifier elimination.

The "Algorithmic Algebraic Model Checking" techniques developed in this thesis present a theoretically-grounded mathematically-sound platform for powerful symbolic temporal reasoning over biochemical networks and other semi-algebraic hybrid automata. It is our hope that by building upon this thesis, along with the development of computationally efficient parallelizable quantifier elimination algorithms and the integration of different computer algebra tools, scientific software systems will emerge that fundamentally transform the way biochemical networks (and other hybrid automata) are analyzed.
Ph.D. Thesis 2006 Building an Automatic Phenotyping System of Developing Embryos Ning, Feng Abstract | PDF

Title: Building an Automatic Phenotyping System of Developing Embryos

Candidate: Ning, Feng

Advisor(s): LeCun, Yann

Abstract:

This dissertation presents a learning-based system for the detection, identification, localization, and measurement of various sub-cellular structures in microscopic images of developing embryos. The system analyzes sequences of images obtained through DIC microscopy and detects cell nuclei, cytoplasm, and cell walls automatically. The system described in this dissertation is the key initial component of a fully automated phenotype analysis system.

Our study primarily concerns the early stages of development of C. Elegans nematode embryos, from fertilization to the four-cell stage. The method proposed in this dissertation consists in learning the entire processing chain {\em from end to end}, from raw pixels to ultimate object categories.

The system contains three modules: (1) a convolutional network trained to classify each pixel into five categories: cell wall, cytoplasm, nuclear membrane, nucleus, outside medium; (2) an Energy-Based Model which cleans up the output of the convolutional network by learning local consistency constraints that must be satisfied by label images; (3) A set of elastic models of the embryo at various stages of development that are matched to the label images.

When observing normal (wild type) embryos it is possible to visualize important cellular functions such as nuclear movements and fusions, cytokinesis and the setting up of crucial cell-cell contacts. These events are highly reproducible from embryo to embryo. The events will deviate from normal behaviors when the function of a specific gene is perturbed, therefore allowing the detection of correlations between genes activities and specific early embryonic events. One important goal of the system is to automatically detect whether the development is normal (and therefore, not particularly interesting), or abnormal and worth investigating. Another important goal is to automatically extract quantitative measurements such as the migration speed of the nuclei and the precise time of cell divisions.
Ph.D. Thesis 2006 A Polymorphic Type System and Compilation Scheme for Record Concatenation Osinski, Edward Abstract | PDF

Title: A Polymorphic Type System and Compilation Scheme for Record Concatenation

Candidate: Osinski, Edward

Advisor(s): Goldberg, Benjamin

Abstract:

The notion of records, which are used to organize closely related groups of data so the group can be treated as a unit, and also provide access to the data within by name, is almost universally supported in programming languages. However, in virtually all cases, the operations permitted on records in statically typed languages are extremely limited. Providing greater flexibility in dealing with records, while simultaneously retaining the benefits of static type checking is a desirable goal.

This problem has generated considerable interest, and a number of type systems dealing with records have appeared in the literature. In this work, we present the first polymorphic type system that is expressive enough to type a number of complex operations on records, including three forms of concatenation and natural join. In addition, the precise types of the records involved are inferred, to eliminate the burden of explicit type declarations. Another aspect of this problem is an efficient implementation of records and their associated operations. We also present a compilation method which accomplishes this goal.
Ph.D. Thesis 2006 A Probabilistic Learning Approach to Attribute Value Inconsistency Resolution Pevzner, Ilya Abstract | PDF

Title: A Probabilistic Learning Approach to Attribute Value Inconsistency Resolution

Candidate: Pevzner, Ilya

Advisor(s): Goldberg, Arthur

Abstract:

Resolving inconsistencies in data is a problem of critical practical importance. Inconsistent data arises whenever an attribute takes on multiple, inconsistent, values. This may occur when a particular entity is stored multiple times in one database, or in multiple databases that are combined.

We investigate Attribute Value Inconsistency Resolution (AVIR), the problem of semi-automatically resolving data inconsistencies among multiple database records that describe the same person or thing.

Our survey of the area shows that existing solutions are either limited in scope or impose a significant burden on their users. Either they do not cover all types of inconsistencies and attributes, or they require users to write or choose attribute resolution functions for each potentially conflicting attribute.

Our ML based approach applies to all types of inconsistencies and attributes, and automatically selects appropriate resolution functions based on the conflicting data. We have invented and developed a system, that uses a set of binary features that detect data properties and relationships and resolution functions that merge data. Many such features and resolution functions have been written. The system uses supervised learning with maximum likelihood estimation to determine which function(s) to apply, based on which feature(s) fire.

We have validated our system by comparing its error rate, decision rate and decision accuracy on a test data set to baseline values determined by a clairvoyant application of a standard approach where each potentially conflicting attribute is resolved by the best resolution function for the attribute.
TR2006-881 2006 PSL Model Checking and Run-time Verification via Testers Pnueli, Amir; Zaks, Aleksandr Abstract | PDF

Title: PSL Model Checking and Run-time Verification via Testers

Author(s): Pnueli, Amir; Zaks, Aleksandr

Abstract:

The paper introduces the construct of \emm{temporal testers} as a compositional basis for the construction of automata corresponding to temporal formulas in the PSL logic. Temporal testers can be viewed as (non-deterministic) transducers that, at any point, output a boolean value which is 1 iff the corresponding temporal formula holds starting at the current position.

The main advantage of testers, compared to acceptors (such as Buchi automata) is that they are compositional. Namely, a tester for a compound formula can be constructed out of the testers for its sub-formulas. In this paper, we extend the application of the testers method from LTL to the logic PSL.

Besides providing the construction of testers for PSL, we indicate how the symbolic representation of the testers can be directly utilized for efficient model checking and run-time monitoring.
Ph.D. Thesis 2006 Animating Autonomous Pedestrians Shao, Wei Abstract | PDF

Title: Animating Autonomous Pedestrians

Candidate: Shao, Wei

Advisor(s): Terzopoulos, Demetri

Abstract:

This thesis addresses the difficult open problem in computer graphics of autonomous human modeling and animation, specifically of emulating the rich complexity of real pedestrians in urban environments.

We pursue an artificial life approach that integrates motor, perceptual, behavioral, and cognitive components within a model of pedestrians as highly capable individuals. Our comprehensive model features innovations in these components, as well as in their combination, yielding results of unprecedented fidelity and complexity for fully autonomous multi-human simulation in large urban environments. Our pedestrian model is entirely autonomous and requires no centralized, global control whatsoever.

To animate a variety of natural interactions between numerous pedestrians and their environment, we represent the environment using hierarchical data structures, which efficiently support the perceptual queries of the autonomous pedestrians that drive their behavioral responses and sustain their ability to plan their actions on local and global scales.

The animation system that we implement using the above models enables us to run long-term simulations of pedestrians in large urban environments without manual intervention. Real-time simulation can be achieved for well over a thousand autonomous pedestrians. With each pedestrian under his/her own autonomous control, the self-animated characters imbue the virtual world with liveliness, social (dis)order, and a realistically complex dynamic.

We demonstrate the automated animation of human activity in a virtual train station, and we employ our pedestrian simulator in the context of virtual archaeology for visualizing urban social life in reconstructed archaeological sites. Our pedestrian simulator is also serving as the basis of a testbed for designing and experimenting with visual sensor networks in the field of computer vision.
Ph.D. Thesis 2006 Complexity Analysis of Algorithms in Algebraic Computation Sharma, Vikram Abstract | PDF

Title: Complexity Analysis of Algorithms in Algebraic Computation

Candidate: Sharma, Vikram

Advisor(s): Yap, Chee

Abstract:

Numerical computations with real algebraic numbers require algorithms for approximating and isolating real roots of polynomials. A classical choice for root approximation is Newton's method. For an analytic function on a Banach space, Smale introduced the concept of approximate zeros, i.e., points from which Newton's method for the function converges quadratically. To identify these approximate zeros he gave computationally verifiable convergence criteria called point estimates. However, in developing these results Smale assumed that Newton's method is computed exactly. For a system of $n$ homogeneous polynomials in $n+1$ variables, Malajovich developed point estimates for a different definition of approximate zero, assuming that all operations in Newton's method are computed with fixed precision. In the first half of this dissertation, we develop point estimates for these two different definitions of approximate zeros of an analytic function on a Banach space, but assume the strong bigfloat computational model of Brent, i.e., where all operations involve bigfloats with varying precision. In this model, we derive a uniform complexity bound for approximating a root of a zero-dimensional system of $n$ integer polynomials in $n$ variables. We also derive a non-asymptotic bound, in terms of the condition number of the system, on the precision required to implement the robust Newton method.

The second part of the dissertation analyses the worst-case complexity of two algorithms for isolating real roots of a square-free polynomial with real coefficients: The Descartes method and Akritas' continued fractions algorithm. The analysis of both algorithms is based upon amortization bounds such as the Davenport-Mahler bound. For the Descartes method, we give a unified framework that encompasses both the power basis and the Bernstein basis variant of the method; we derive an $O(n(L+\log n))$ bound on the size of the recursion tree obtained by applying the method to a square-free polynomial of degree n with integer coefficients of bit-length $L$, the bound is tight for $L=\Omega(\log n)$; based upon this result we readily obtain the best known bit-complexity bound of $\wt{O}(n^4L2) $ for the Descartes method, where $\wt{O}$ means we ignore logarithmic factors. Similar worst case bounds on the bit-complexity of Akritas' algorithm were not known in the literature. We provide the first such bound, $\wt{O}(n^{12}L3)$, for a square-free integer polynomial of degree $n$ and coefficients of bit-length $L$.
Ph.D. Thesis 2006 Pairwise Comparison between Genomic Sequences and Optical-Maps Sun, Bing Abstract | PDF

Title: Pairwise Comparison between Genomic Sequences and Optical-Maps

Candidate: Sun, Bing

Advisor(s): Mishra, Bud

Abstract:

With the development and improvement of high throughput experimental technologies, massive amount of biological data including genomic sequences and optical-maps have been collected for various species. Comparative techniques play a central role in investigating the adaptive significance of organismal traits and revealing evolutionary relations among organisms by comparing these biological data. This dissertation presents two efficient comparative analysis tools used in comparative genomics and comparative optical-map study, respectively.

A complete genome sequence of an organism can be viewed as its ultimate genetic map, in the sense that the heritable information are encoded within the DNA and the order of nucleotides along chromosomes is known. Comparative genomics can be applied to find functional sites by comparing genetic maps. Comparing vertebrate genomes requires efficient cross-species sequence alignment programs. The first tool introduced in this thesis is COMBAT (Clean Ordered Mer-Based Alignment Tool), a new mer-based method which can search rapidly for highly similar translated genomic sequences using the stable-marriage algorithm (SM) as an alignment filter. In experiments COMBAT is applied to comparative analysis between yeast genomes, and between the human genome and the recently published bovine genome. The homologous blocks identified by COMBAT are comparable with the alignments produced by BLASTP and BLASTZ.

When genetic maps are not available, other genomic maps, including optical-maps, can be constructed. An optical map is an ordered enumeration of the restriction sites along with the estimated lengths of the restriction fragments between consecutive restriction sites. CAPO (Comparative Analysis and Phylogeny with Optical-Maps), introduced as a second technique in this thesis, is a tool for inferring phylogeny based on pairwise optical map comparison and bipartite graph matching. CAPO combines the stable matching algorithm with either the Unweighted Pair Group Method with Arithmetic Averaging (UPGMA) or the Neighbor-Joining (NJ) method for constructing phylogenetic trees. This new algorithm is capable of constructing phylogenetic trees in logarithmic steps and performs well in practice. Using optical maps constructed in silico and in vivo, our work shows that both UPGMA-flavored trees and the NJ-flavored trees produced by CAPO share substantial overlapping tree topology and are biologically meaningful.
Ph.D. Thesis 2006 Exploiting Service Usage Information for Optimizing Server Resource Management Totok, Alexander Abstract | PDF

Title: Exploiting Service Usage Information for Optimizing Server Resource Management

Candidate: Totok, Alexander

Advisor(s): Karamcheti, Vijay

Abstract:

It is difficult to provision and manage modern component-based Internet services so that they provide stable quality-of-service (QoS) guarantees to their clients, because: (1) component middleware are complex software systems that expose several independently tuned configurable application runtime policies and server resource management mechanisms; (2) session-oriented client behavior with complex data access patterns makes it hard to predict what impact tuning these policies and mechanisms has on application behavior; (3) component-based Internet services exhibit complex structural organization with requests of different types accessing different components and data sources, which could be distributed and/or replicated for failover, performance, or business purposes.

This dissertation attempts to alleviate this situation by targeting three interconnected goals: (1) providing improved QoS guarantees to the service clients, (2) optimizing server resource utilization, and (3) providing application developers with guidelines for natural application structuring, which enable efficient use of the proposed mechanisms for improving service performance. Specifically, we explore the thesis that exposing and using detailed information about how clients use component-based Internet services enables mechanisms that achieve the range of goals listed above. To validate this thesis we show its applicability to the following four problems: (1) maximizing reward brought by Internet services, (2) optimizing utilization of server resource pools, (3) providing session data integrity guarantees, and (4) enabling service distribution in wide-area environments.

The techniques that we propose for the identified problems are applicable at both the application structuring stage and the application operation stage, and range from automatic (i.e., performed by middleware in real time) to manual (i.e., involve the programmer, or the service provider). These techniques take into account service usage information exposed at different levels, ranging from high-level structure of user sessions to low level information about data access patterns and resource utilization by requests of different types. To show the benefits of the proposed techniques, we implement various middleware mechanisms in the JBoss application server, which utilizes the J2EE component model, and comprehensively evaluate them on several publicly-available sample J2EE applications - Java Pet Store, RUBiS, and our own implementation of the TPC-W web transactional benchmark. Our experimental results show that the proposed techniques achieve optimal utilization of server resources and improve application performance by up to two times for centralized Internet services and by up to 6 times for distributed ones.
Ph.D. Thesis 2006 Time Series Matching: A Multi-Filter Approach Wang, Zhihua Abstract | PDF

Title: Time Series Matching: A Multi-Filter Approach

Candidate: Wang, Zhihua

Advisor(s): Shasha, Dennis

Abstract:

Data arriving in time order (time series) arises in disciplines ranging from music to meteorology to finance to motion capture data, to name a few. In many cases, a natural way to query the data is what we call time series matching - a user enters a time series by hand, keyboard or voice and the system finds "similar" time series.

Existing time series similarity measures, such as DTW (Dynamic Time Warping), can accommodate certain timing errors in the query and perform with high accuracy on small databases. However, they all have high computational complexity and the accuracy dramatically drops when the data set grows. More importantly, there are types of errors that cannot be captured by a single similarity measure.

Here we present a general time series matching framework. This framework can easily optimize, combine and test different features to execute a fast similarity search based on the application's requirement. Basically we use a multi-filter chain and boosting algorithms to compose a ranking algorithm. Each filter is a classifier which removes bad candidates by comparing certain features of the time series data. Some filters use a boosting algorithm to combine a few different weak classifiers into a strong classifier. The final filter will give a ranked list of candidates in the reference data which matches the query data.

The framework is applied to build query algorithms for a Query-by-Humming system. Experiments show that the algorithm has a more accurate similarity measure and its response time increases much slower than the pure DTW algorithm when the number of songs in the database increases from 60 to 1400.
Ph.D. Thesis 2006 Incremental Web Search: Tracking Changes in the Web Wang, Ziyang Abstract | PDF

Title: Incremental Web Search: Tracking Changes in the Web

Candidate: Wang, Ziyang

Advisor(s): Davis, Ernest

Abstract:

A large amount of new information is posted on the Web every day. Large-scale web search engines often update their index slowly and are unable to present such information in a timely manner. Here we present our solutions of searching new information from the web by tracking the changes of web documents.

First, we present the algorithms and techniques useful for solving the following problems: detecting web pages that have changed, extracting changes from different versions of a web page, and evaluating the significance of web changes. We propose a two-level change detector: MetaDetector and ContentDetector. The combined detector successfully reduces network traffic by about 67%. Our algorithm for extracting web changes consists of three steps: document tree construction, document tree encoding and tree matching. It has linear time complexity and extracts effectively the changed content from different versions of a web page. In order to evaluate web changes, we propose a unified ranking framework combining three metrics: popularity ranking, content-based ranking and evolution ranking. Our methods can identify and deliver important new information in a timely manner.

Second, we present an application using the techniques and algorithms we developed, named "Web Daily News Assistant (WebDNA): finding what's new on Your Web". It is a search tool that helps community users search new information on their community web. Currently WebDNA is deployed on the New York University web site.

Third, we model the changes of web documents using survival analysis. Modeling web changes is useful for web crawler scheduling and web caching. Currently people model changes to web pages as a Poisson Process, and use a necessarily incomplete detection history to estimate the true frequencies of changes. However, other features that can be used to predict change frequency have not previously been studied. Our analysis shows that PageRank value is a good predictor. Statistically, the change frequency is a function proportional to $\exp[0.36\cdot (\ln(PageRank)+C)]$. We further study the problem of combining the predictor and change history into a unified framework. An improved estimator of change frequency is presented, which successfully reduces the error by 27.3% when the change history is short.
Ph.D. Thesis 2006 Fast Algorithms for Burst Detection Zhang, Xin Abstract | PDF

Title: Fast Algorithms for Burst Detection

Candidate: Zhang, Xin

Advisor(s): Shasha, Dennis

Abstract:

Events occur in every aspect of our lives.

An unexpectedly large number of events occurring within some certain measurement (e.g. within some time duration or a spatial region) is called a {\em burst}, suggesting unusual behaviors or activities. Bursts come up in many natural and social processes. It is a challenging task to monitor the occurrence of bursts whose lasting duration is unknown in a fast data stream environment.

This work describes efficient data structures and algorithms for high performance burst detection under different settings. Our view is that bursts, as an unusual phenomenon, constitute a useful preliminary primitive in a knowledge discovery hierarchy. Our intent is to build a high performance primitive detection algorithm to support high-level data mining tasks.

The work starts with an algorithmic framework including a family of data structures and a heuristic optimization algorithm to choose an efficient data structure given the inputs. The advantage of this framework is that it's adaptive to different inputs. Experiments on both synthetic data and real world data show the new framework significantly outperforms existing techniques over a variety of inputs.

Furthermore, we present a greedy dynamic detection algorithm which handles the changing data. It evolves the structure to adapt to the incoming data. It achieves better performance in both synthetic and real data streams than a static algorithm in most cases.

We have applied this framework to different real world applications in physics, stock trading and website traffic monitoring. All the case studies show our framework has superb performance.

We extend this framework to multi-dimensional data and use it in an epidemiology simulation to detect infectious disease outbreak and spread.
Ph.D. Thesis 2006 High Performance Algorithms for Multiple Streaming Time Series Zhao, Xiaojian Abstract | PDF

Title: High Performance Algorithms for Multiple Streaming Time Series

Candidate: Zhao, Xiaojian

Advisor(s): Shasha, Dennis

Abstract:

Data arriving in time order (a data stream) arises in fields ranging from physics to finance to medicine to music, to name a few. Often the data comes from sensors (in physics and medicine for example) whose data rates continue to improve dramatically as sensor technology improves. Furthermore, the number of sensors is increasing, so analyzing data between sensors becomes ever more critical in order to distill knowledge from the data. Fast response is desirable in many applications (e.g. to aim a telescope at an activity of interest or to perform a stock trade). In applications such as finance, recent information, e.g. correlation, is of far more interest than older information, so analysis over sliding windows is a desired operation.

These three factors -- huge data size, fast response, and windowed computation -- motivated this work. Our intent is to build a foundational library of primitives to perform online or near online statistical analysis, e.g. windowed correlation, incremental matching pursuit, burst detection, on thousands or even millions of time series. Beside the algorithms, we also propose the concept of ``uncooperative'' time series, whose power spectra are spread over all frequencies with any regularity.

Previous work showed how to do windowed correlation with Fast Fourier Transforms and Wavelet Transforms, but such techniques don't work for uncooperative time series. This thesis will show how to use sketches (random projections) in a way that combines several simple techniques -- sketches, convolution, structured random vectors, grid structures, combinatorial design, and bootstrapping -- to achieve high performance, windowed correlation over a variety of data sets. Experiments confirm the asymptotic analysis.

To conduct matching pursuit (MP) over time series windows, an incremental scheme is designed to reduce the computational effort. Our empirical study demonstrates a substantial improvement in speed.

In previous work, Zhu and Shasha introduced an efficient algorithm to monitor bursts within windows of multiple sizes. We implemented it in a physical system by overcoming several practical challenges. Experimental results support the authors' linear running time analysis.
Ph.D. Thesis 2006 Distribution of Route-Impacting Control Information in a Publish/Subscribe System with Delivery Guarantees Zhao, Yuanyuan Abstract | PDF

Title: Distribution of Route-Impacting Control Information in a Publish/Subscribe System with Delivery Guarantees

Candidate: Zhao, Yuanyuan

Advisor(s): Kedem, Zvi

Abstract:

Event-driven middleware is a popular infrastructure for building large-scale asynchronous distributed systems. Content-based publish/subscribe systems are a type of event-driven middleware that provides service flexibility and specification expressiveness, creating opportunities for improving reliability and efficiency of the system.

The use of route-impacting control information, such as subscription filters and access control rules, has the potential to enable efficient routing for applications that require selective and regional distribution of events. Such applications range from financial information systems to sensor networks to service-oriented architectures. However, it has been a great challenge to design correct and efficient protocols for distributing control information and exploiting it to achieve efficient and highly available message routing.

In this dissertation, we study the problem of distributing and utilizing route-impacting control information. We present an abstract model of content-based routing and reliable delivery in redundant broker networks. Based on this model, we design a generic algorithm that propagates control information and performs content-based routing and delivers events reliably. The algorithm is efficient and light-weight in that it does not require heavy-weight consensus protocols between redundant brokers. We extend this generic algorithm to support consolidation and merging of control information. Existing protocols can be viewed as particular encodings and optimizations of the generic algorithm. We show an encoding using virtual time vectors that supports reliable delivery and deterministic dynamic access control in redundant broker networks. In our system, the semantics of reliable delivery is clearly defined even if subscription information and access control policy can dynamically change. That is, one or more subscribers of same principal will receive exactly the same sequence of messages (modulo subscription filter differences) regardless of where they are connected and the network latency and failure conditions in their parts of the network.

We have implemented these protocols in a fully-functioning content-based publish/subscribe system - Gryphon. We evaluate its efficiency, scalability and high availability.