Theses & Reports

Instructions for submitting a technical report or thesis.

You can find technical reports published prior to 1990 archived here.

Title

Authors

Year

M.S. Thesis 2023 On Certified Isotopic Approximation of Space Curves Dogan, Caglar Abstract | PDF

Title: On Certified Isotopic Approximation of Space Curves

Candidate: Dogan, Caglar

Advisor(s): Chee Yap

Abstract:

The approximation of implicitly defined curves or surfaces is a problem of interest for many fields. As a result, this problem has been explored using algebraic, geometric, and numerical methods. Amongst these, a numerical method called Marching Cubes Algorithm ([4]) has been the primary choice in implementations because of its efficiency and implementability, even though a guarantee for topological correctness was not generally present.

Research in this area has largely focused on approximations of 𝑛 − 1 dimensional manifolds in 𝑛 dimensional Euclidean space. These are called co-dimension 1 manifolds, defined as the zero sets of single equations in 𝑛 variables. Plantinga and Vegter (2004) [8] derived the first algorithms with guaranteed topological correctness using interval arithmetic and adaptive subdivision for 𝑛 = 2, 3. Faster variants of such algorithms were described by Yap et al. (2009, 2014) [10] [11]. Galehouse (2008) [9] succeeded in producing such algorithms for all 𝑛.

This thesis addresses the problem of computing isotopic approximations of co-dimension 2 manifolds, i.e., 𝑛 − 2 dimensional manifolds in 𝑛 dimensional Euclidean space. Such manifolds are the intersection of the zero sets of two equations in 𝑛 variables. The first interesting case is 𝑛 = 3, i.e., the problem of computing an isotopic approximation of a space curve in 3D. We work on devising new algorithms by extending the previous interval techniques in co-dimension 1. Moreover, we implement and visualize such curves in order to verify their practical efficiency.
M.S. Thesis 2022 Symbolic Execution of GRASShopper Programs Cox, Eric Abstract | PDF

Title: Symbolic Execution of GRASShopper Programs

Candidate: Cox, Eric

Advisor(s): Thomas Wies

Abstract:

Symbolic execution is a more efficient and viable alternative to implementing deductive verification tools to fully automate the formal verification of programs. Symbolic execution in many cases can provide performance gains over verification condition generation (VCG) based verification tools due to the fact that symbolic execution directly manipulates in-memory data structures.

This thesis presents the design and implementation of a symbolic execution engine for the GRASSHopper programming language which already supports verification using VCG. The goal of this work was to adapt ideas from the symbolic execution engine from the Viper verification infrastructure language to the semantics of GRASSHopper and to demonstrate its utility on sample programs. We present a rigorous description of the operational semantics of the symbolic interpreter, discuss implementation details and illustrate the symbolic execution behavior on a set of sample programs.

In order to explore interesting details around implementing a symbolic execution backend for GRASSHopper, this work introduced a method to support encoding of snapshots at the struct field level using injective functions. In addition, several language extensions were added to the GRASSHopper user-facing language and the intermediate representation. A few of these extensions will now allow support for finer-grained permissions for individual fields rather than granting permissions to all fields of structures, unfolding and folding of recursive predicates, and support for if-then-else expressions in predicate and heap-dependent functions.
M.S. Thesis 2022 Program Unrolling by Abstract Interpretation for Probabilistic Proofs Feldan, Daniel Abstract | PDF

Title: Program Unrolling by Abstract Interpretation for Probabilistic Proofs

Candidate: Feldan, Daniel

Advisor(s): Patrick Cousot

Abstract:

Zero-knowledge proofs are cryptographic protocols that enable one party to prove the validity of a statement to another party while revealing no additional information beyond the statement’s truth. These protocols have a myriad of applications, especially within the realm of cloud computing. Zero-knowledge protocols can be used to probabilistically verify cloud-computed program by first converting an input program into a boolean circuit, then using this circuit in a zero-knowledge proof system to show the correctness of the computed output. This work focuses on enhancing zero-knowledge proofs for practical implementation, as many of the current protocols are currently very computationally expensive.

The primary contribution of this thesis is a formalization of a program transformation technique that combines abstract interpretation and program unrolling to analyze, transform and optimize a program before transforming it into a boolean circuit. By analyzing a program with an abstract interpreter while simultaneously unrolling it, we achieve a significantly more precise static analysis without the need for traditional widening and narrowing operations. This approach enables aggressive optimization of the unrolled program, reducing both the cost of transforming the program into a circuit, and the resulting circuit’s size
M.S. Thesis 2020 Cooperation and Deception in multi-agent signaling Enaganti, Inavamsi Abstract | PDF

Title: Cooperation and Deception in multi-agent signaling

Candidate: Enaganti, Inavamsi

Advisor(s): Bhubaneshwar Mishra

Abstract:

We aim to study cooperation and deception in a system with multiple agents through utility and signaling. We start with the classic standard for cooperation namely the ‘Prisoner’s Dilemma’ and then we move on to the ‘Iterated Prisoner’s Dilemma’ which we treat as an iterated version of a signaling Game. This is because the previous actions of an agent are a signal to the opponent about the agent’s type. We then move on to Bio-mimicry and deception where we study dynamics and interesting phenomena that arise due to signaling between Predator and Prey. Cooperation and deception are two sides of a coin and it is imperative to understand both of them as we develop better and more efficient Artificial Intelligence systems.
M.S. Thesis 2020 Static Responsibility Analysis of Floating-Point Programs Saatcioglu, Goktug Abstract | PDF

Title: Static Responsibility Analysis of Floating-Point Programs

Candidate: Saatcioglu, Goktug

Advisor(s): Thomas Wies

Abstract:

The last decade has seen considerable progress in the analysis of floating-point programs. There now exist frameworks to verify both the total amount of round-off error a program accrues and the robustness of floating-point programs. However, there is a lack of static analysis frameworks to identify causes of erroneous behaviors due to the use of floating-point arithmetic. Such errors are both sporadic and triggered by specific inputs or numbers computed by programs. In this work, we introduce a new static analysis by abstract interpretation to define and detect responsible entities for such behaviors in finite precision implementations. Our focus is on identifying causes of test discontinuity where small differences in inputs may lead to large differences in the control flow of programs causing the computed finite precision path to differ from the same ideal computation carried out in real numbers. However, the analysis is not limited to just discontinuity, as any type of error cause can be identified by the framework. We propose to carry out the analysis by a combination of over-approximating forward partitioning semantics and under-approximating backward semantics of programs, which leads to a forward-backward static analysis with iterated intermediate reduction. This gives a way to the design of a tool for helping programmers identify and fix numerical bugs in their programs due to the use of finite-precision numbers. The implementation of this tool is the next step for this work.
M.S. Thesis 2020 Pointer-Generator Transformers for Morphological Inflection Singer, Assaf Abstract | PDF

Title: Pointer-Generator Transformers for Morphological Inflection

Candidate: Singer, Assaf

Advisor(s): Kyunghyun Cho

Abstract:

In morphologically rich languages, a word's surface form reflects syntactic and semantic properties such as gender, tense or number. For example, most English nouns have both singular and plural forms (e.g., robot/robots, process/processes), which are known as the inflected forms of the noun. The vocabularies of morphologically rich languages, e.g., German or Spanish, are larger than those of morphologically poor languages, e.g., Chinese, if every surface form is considered an independent token. This motivates the development of models that can deal with inflections by either analyzing or generating them and, thus, alleviate the sparsity problem.

This thesis presents approaches to generate morphological inflections. We cast morphological inflection as a sequence-to-sequence problem and apply different versions of the transformer, a state-of-the art deep learning model, to the task. However, for many languages, the availability of morphological lexicons, and, thus, training data for the task, is a big challenge. In our work, we explore different ways to overcome this: 1. We propose a pointer-generator transformer model to allow easy copying of input characters, which is known to improve performance of neural models in the low-resource setting. 2. We implement a system for the task of unsupervised morphological paradigm completion, where systems produce inflections from raw text alone, without relying on morphological information. 3. We explore multitask training and data hallucination pretraining, two methods
which yield more training examples.

With our formulated models and data augmentation methods, we participate in the SIGMORPHON 2020 shared task, and describe the NYU-CUBoulder systems for Task 0 on typologically diverse morphological inflection and Task 2 on unsupervised morphological paradigm completion. Finally, we design a low-resource experiment to show the effectiveness of our proposed approaches for low-resource languages.
M.S. Thesis 2020 Data Flow Refinement Type Inference Tool Drift² Su, Yusen Abstract | PDF

Title: Data Flow Refinement Type Inference Tool Drift²

Candidate: Su, Yusen

Advisor(s): Thomas Wies

Abstract:

Refinement types utilize logical predicate for capturing run-time properties of programs which can be used for program verification. Traditionally, SMT-based checking tools of refinement types such as the implementation of Liquid Types [1] require either heuristics or random sampling logical qualifiers to find the relevant logical predicates.

In this thesis, we describe the implementation of a novel algorithm proposed in Zvonimir Pavlinovic’s PhD thesis "Leveraging Program Analysis for Type Inference" [2], based on the framework of abstract interpretation for inferring refinement types in functional programs. The analysis generalizes Liquid type inference and is parametric with the abstract domain used to express type refinements. The main contribution of this thesis is to achieve the process of instantiating our parametric type analysis and to evaluate the algorithm’s precision and efficiency. Moreover, we describe a tool, called DRIFT², which allows users to select an abstract domain for expressing type refinements and to control the degree to which context-sensitive information is being tracked by the analysis.

Finally, our work compares the precision and efficiency of DRIFT² for different configurations of numerical abstract domains and widening operations [3]. In addition, we compare DRIFT² with existing refinement type inference tools. The experimental results show that our method is both effective and efficient in automatically inferring refinement types.
M.S. Thesis 2020 Are the proposed similarity metrics also a measure of functional similarity? Yellapragada, Manikanta Srikar Abstract | PDF

Title: Are the proposed similarity metrics also a measure of functional similarity?

Candidate: Yellapragada, Manikanta Srikar

Advisor(s): Kyunghyun Cho

Abstract:

A recent body of work attempts to understand the behavior and training dynamics of neural networks by analyzing intermediate representations and designing metrics to defi ne the similarity between those representations. We observe that the representations of the last layer could be thought of as the functional output of the model up to that point. In this work, we investigate if the similarity between these representations can be considered a stand-in for the similarity of the networks' output functions. This can have an impact for many downstream tasks, but we specifically analyze it in the context of transfer learning. Consequently, we perform a series of experiments to understand the relationship between the representational similarity and the functional similarity of neural networks. We show in two ways that the leading metric for representational similarity, CKA, does not bear a strict relationship with functional similarity.
M.S. Thesis 2019 End-to-End Hierarchical Clustering with Graph Neural Networks Choma, Nicholas Abstract | PDF

Title: End-to-End Hierarchical Clustering with Graph Neural Networks

Candidate: Choma, Nicholas

Advisor(s): Bruna, Joan

Abstract:

The objective of this thesis is to develop a data-driven, hierarchical clustering method which is capable of operating on large point cloud datasets, necessitating a runtime which is sub-quadratic. Hierarchical clustering is noteworthy for its ability to produce multiscale views of data, allowing for rich and interpretable representations, and for its ability to cluster when the number of clusters is not specified a priori. To date, deep learning methods for clustering have primarily focused on a narrower class of models which cluster using partitioning strategies and require as input the number of clusters to produce. In this work, we introduce the clustering graph neural network, extending previous research into graph neural networks to handle large clustering tasks where the number of clusters is variable and not pre-specified. Our architecture is fast, operating with O(n log n) time complexity, and we note its amenability to high levels of parallelization. Because each stage is differentiable, we emphasize that our architecture is capable of end-to-end training, leveraging signal throughout the learning pipeline as part of a multi-objective loss function. Finally, we demonstrate the clustering graph neural network on a challenging particle tracking task, which, while unable to outperform highly-tuned and domain-specific baselines, nevertheless achieves high performance while remaining flexible to a wide array of clustering tasks.
M.S. Thesis 2019 On Zero-Shot Transfer Learning for Event Extraction Haroon, Shaheer Abstract | PDF

Title: On Zero-Shot Transfer Learning for Event Extraction

Candidate: Haroon, Shaheer

Advisor(s): Grishman, Ralph

Abstract:

Event extraction normally requires large amounts of annotated data for each event type. Each event consist of trigger words and arguments that fulfill certain roles. This limits the ability to add new event types to an existing ontology or when building a new one because of the massive effort involved for manually annotating a corpus. Recent methods have proposed using zero-shot transfer learning to minimize the amount of annotated data required for a classifier to predict new event types. The zero-shot classifier relies on several components, including a preexisting event ontology to be successful. Our goal was to explore factors including choice of role names, event type names, and definitions of event mention and event type structures that could influence the results of a zero-shot classifier. We found that the use of paradigmatic role names and characteristic event type names in an event ontology especially have significant impact on the success of the classifier. As a result, there is still a decent amount of effort required when adding new event types to an ontology in order to promote the success of a zero-shot approach.
M.S. Thesis 2019 Leveraging Communication for Efficient Sampling Kapoor, Sanyam Abstract | PDF

Title: Leveraging Communication for Efficient Sampling

Candidate: Kapoor, Sanyam

Advisor(s): Bruna, Joan

Abstract:

Machine Learning has shown promising success tasks like classification, regression and more recently generation. However, long-term planning still remains a challenge for real-world deployment and one of the key components of long-term planning is exploration. In this work, we discuss how communication can be leveraged to improve space exploration. We study this problem from the perspective of sampling from un-normalized density functions.

Hamiltonian Monte Carlo (HMC) finds it improbable to sample from highly separated multi- modal distributions and parallel chains can be wasteful by the nature of Markov chain sampling. We see how replica exchange induces a weak for of communication. This is contrasted with a particle based approach called the Stein Variational Gradient Descent (SVGD) which induces a stronger form of communication via kernel evaluations. The quality of samples from both HMC and SVGD are evaluated with Maximum Mean Discrepancy. We finally propose Graph Neural Networks with stronger inductive biases to amortize the dynamics of SVGD for fast generation of representative samples.
M.S. Thesis 2019 Machine Learning Applications to Protein Variant Effect Prediction Soules, Jeffrey Abstract | PDF

Title: Machine Learning Applications to Protein Variant Effect Prediction

Candidate: Soules, Jeffrey

Advisor(s): Bonneau, Richard

Abstract:

Proteins are microscopic machines whose activity forms the basis of all life processes. If a mutation causes variation in the typical amino acid sequence of a protein, the protein’s normal biological function may be compromised. Variant Interpretation and Prediction Using Rosetta (VIPUR) uses sequence and structural data features to predict whether a mutation is deleterious to the protein’s function. VIPUR was originally released with a curated set of protein variants as its training data. As released, it achieved 80% accuracy on this data set. However, the original design was tightly coupled to a logistic regression classifier, so other machine learning techniques could not be easily tested. The reimplementation of VIPUR presented in this work offers a modular design that can be extended with classifiers built on any machine learning approach. It establishes a methodologically sound basis for experimentation with new classifiers, data features, and data sets. This work examines the predictive power of the data features in the original VIPUR training set, and establishes a high baseline for classification performance based on one strongly predictive feature category. The present work includes classifier modules built with four different machine learning approaches—logistic regression, support vector machines, gradient-boosted forests, and neural networks. These represent the two model types considered in the original VIPUR work, and two more recent classifier types. The modules are trained with automated hyperparameter cross-validation and rigorously evaluated with k-fold cross validation, establishing a baseline of performance for future experiments. Results show very slight improvement over the original logistic regression method, consistent with the dominance of a small handful of features in determining classification results. Potential new data features and sources are discussed, which can be used in the new VIPUR design without modification while maintaining backwards compatibility with previously trained classifiers.
M.S. Thesis 2019 Cold Case: The Lost MNIST Digits Yadav, Chhavi Abstract | PDF

Title: Cold Case: The Lost MNIST Digits

Candidate: Yadav, Chhavi

Advisor(s): Fergus, Rob

Abstract:

Although the popular MNIST dataset (LeCun, Cortes, and Burges 1994) is derived from the NIST database (Grother and Hanaoka 1995), the precise processing steps for this derivation have been lost to time. We propose a reconstruction that is accurate enough to serve as a replacement for the MNIST dataset, with insignificant changes in accuracy. We trace each MNIST digit to its NIST source and its rich metadata such as writer identifier, partition identifier, etc. We also reconstruct the complete MNIST test set with 60,000 samples instead of the usual 10,000. Since the balance 50,000 were never distributed, they enable us to investigate the impact of twenty-five years of MNIST experiments on the reported testing performances. Our results unambiguously confirm the trends observed by (Recht et al. 2018; Recht et al. 2019): although the misclassification rates are slightly off, classifier ordering and model selection remain broadly reliable. We attribute this phenomenon to the pairing benefits of comparing classifiers on the same digits.
M.S. Thesis 2018 Detecting Dead Weights and Units in Neural Networks Evci, Utku Abstract | PDF

Title: Detecting Dead Weights and Units in Neural Networks

Candidate: Evci, Utku

Advisor(s): Fergus, Rob

Abstract:

Deep Neural Networks are highly over-parameterized and the size of the neural networks can be reduced significantly after training without any decrease in performance. One can clearly see this phenomenon in a wide range of architectures trained for various problems. Weight/channel pruning, distillation, quantization, matrix factorization are some of the main methods one can use to remove the redundancy to come up with smaller and faster models.

This work starts with a short informative chapter, where we motivate the pruning idea and provide the necessary notation. In the second chapter, we compare various saliency scores in the context of parameter pruning. Using the insights obtained from this comparison and stating the problems it brings we motivate why pruning units instead of the individual parameters might be a better idea. We propose some set of definitions to quantify and analyze units that don't learn and create any useful information. We propose an efficient way for detecting dead units and use it to select which units to prune. We get 5x model size reduction through unit-wise pruning on MNIST.
M.S. Thesis 2018 Classifying the Quality of Movement via Motion Capture and Machine Learning Saxe, Ryan Abstract | PDF

Title: Classifying the Quality of Movement via Motion Capture and Machine Learning

Candidate: Saxe, Ryan

Advisor(s): Shasha, Dennis

Abstract:

With the recent surge of Machine Vision technology and available video data, computational methods that utilize this data are becoming increasingly important. This Thesis shows that, with the proper application of Skeletal Tracking, it is possible to discern whether or not a physical task — a squat — is performed well. The Skeletal Tracking software employed is provided by Optitrack’s Motion Capture client, Motive:Body. The data generated from Optitrack was used to extract features related to the proper execution of a squat. This thesis uses a variety of machine learning techniques to evalute the quality of physical performance. Support Vector Machines, Random Forests, and Decision Tree algorithms were tested with ten-fold cross validation, and compared to a baseline of Logistic Regression given the binary nature of the problem. While Regression performed at 66% accuracy, all three other algorithms performed substantially better, with Decision Trees performing best at 80%.
M.S. Thesis 2017 Atypical: A type system for live performances Nunes, Gabriel Barbosa Abstract | PDF

Title: Atypical: A type system for live performances

Candidate: Nunes, Gabriel Barbosa

Advisor(s): Panozzo, Daniele; Perlin, Ken

Abstract:

Chalktalk is a visual language based around real-time interaction with virtual objects in a blackboard-style environment. Its aim is to be a presentation and communication tool, using animation and interactivity to allow easy illustration of complex topics or ideas. Among many of the capabilities of these virtual objects is the ability to send data from one object to another via a visual linking system. In this paper, we describe a way of making the link system more robust by adding type information to these links, and compare and contrast the requirements of a presentation-oriented visual language with a more traditional programming language.

Demonstration video:
M.S. Thesis 2017 Inducing Cooperation Through Virtual Reality Zhang, Daniel W. Abstract | PDF

Title: Inducing Cooperation Through Virtual Reality

Candidate: Zhang, Daniel W.

Advisor(s): Perlin, Ken

Abstract:

There has been a recent resurgence in Virtual Reality (VR) as a new medium for entertainment and communication. With these potentially exciting developments, I decided to create an experiment to test how people can potentially be influenced by virtual reality interfaces. I had hoped that I could induce people to cooperate better in a virtual reality version of a task compared to an un-augmented version of the task in the regular world. After conducting 16 separate trials, with half in VR and the other half in the regular world, there is no conclusive evidence to completely confirm or deny this hypothesis. I have found evidence to suggest that there can be such an influence, as there were more successes in the VR trials than the regular trials, but they can potentially be explained away by the sample size and the attitudes of participants before starting the experiment. This data suggests that further research in this field can lead to interesting discoveries regarding human behavior in virtual reality environments, and that the Holojam framework invented by the Future Reality Lab at New York University can be very helpful in designing experiments for this research.
M.S. Thesis 2016 A New Strongly Polynomial Algorithm for Computing Fisher Market Equilibria with Spending Constraint Utilities Wang, Zi Abstract | PDF

Title: A New Strongly Polynomial Algorithm for Computing Fisher Market Equilibria with Spending Constraint Utilities

Candidate: Wang, Zi

Advisor(s): Cole, Richard

Abstract:

This thesis develops and analyzes an algorithm to compute equilibrium prices for a Fisher market in which the buyer utilities are given by spending constraint functions, utility functions originally defined by Devanur and Vazirani.

Vazirani gave a weakly polynomial time algorithm to compute the equilibrium prices. More recently Vegh gave a strongly polynomial algorithm. Here we provide another strongly polynomial algorithm, which arguably is conceptually simpler, although the running time is not always better.
M.S. Thesis 2015 Responsive Visualization of Points in Space: Sampling, Clustering, Partitioning Jain, Akshay Abstract | PDF
M.S. Thesis 2014 Resolution-Exact Planner for a 2-link Planar Robot using Soft Predicates Luo, Zhongdi Abstract | PDF

Title: Resolution-Exact Planner for a 2-link Planar Robot using Soft Predicates

Candidate: Luo, Zhongdi

Advisor(s): Yap, Chee

Abstract:

Motion planning is a major topic in robotics. It frequently refers to motion of a robot in a R 2 or R 3 world that contains obstacles. Our goal is to produce algorithms that are practical and have strong theoretical guarantees. Recently, a new framework Soft Subdivision Search (SSS) was introduced to solve various motion planning problems. It is based on soft predicates and a new notion of correctness called resolution-exactness. Unlike most theoretical algorithms, such algorithms can be implemented without exact computation. In this thesis we describe a detailed, realized algorithm of SSS for a 2-link robot in R 2 . We prove the correctness of our predicates and also do experimental study of several strategies to enhance the basic SSS algorithm. In particular, we introduce a technique called T/R Splitting, in which the splittings of the rotational degrees of freedom are deferred to the end. The results give strong evidence of the practicability of SSS.
M.S. Thesis 2013 An Efficient Active Learning Framework for New Relation Types Fu, Lisheng Abstract | PDF

Title: An Efficient Active Learning Framework for New Relation Types

Candidate: Fu, Lisheng

Advisor(s): Grishman, Ralph; Davis, Ernest

Abstract:

Relation extraction is a fundamental task in information extraction. Different methods have been studied for building a relation extraction system. Supervised training of models for this task has yielded good performance, but at substantial cost for the annotation of large training corpora (About 40K same-sentence entity pairs). Semi-supervised methods can only require a seed set, but the performance is very limited when the seed set is very small, which is not very satisfactory for real relation extraction applications. The trade-off of annotation and performance is also hard to decide in practice. Active learning strategies allow users to gradually improve the model and to achieve comparable performance to supervised methods with limited annotation. Recent study shows active learning on this task needs much fewer labels for each type to build a useful relation extraction application. We feel active learning is a good direction to do relation extraction and presents a more efficient active learning framework. This framework starts from a better balance between positive and negative samples, and boosts by interleaving self-training and co-testing. We also studied the reduction of annotation cost by enforcing argument type constraints. Experiments show a substantial speed-up by comparison to previous state-of-the-art pure co-testing active learning framework. We obtain reasonable performance with only a hundred labels for individual ACE 2004 relation types. We also developed a GUI tool for real human-in-the-loop active learning trials. The goal of building relation extraction systems in a very short time seems to be promising.
M.S. Thesis 2013 Parsing and Analyzing POSIX API behavior on different platforms Savvides, Savvas Abstract | PDF

Title: Parsing and Analyzing POSIX API behavior on different platforms

Candidate: Savvides, Savvas

Advisor(s): Cappos, Justin; Li, Jinyang

Abstract:

Because of the increased variety of operating systems and architectures, developing applications that are supported by multiple platforms has become a cumbersome task. To mitigate this problem, many portable APIs were created which are responsible for hiding the details of the underlying layers of the system and they provide a universal interface on top of which applications can be built. Many times it is necessary to examine the interactions between an application and an API, either to check that the behavior of these interactions is the expected one or to confirm that this behavior is the same across platforms. In this thesis, the POSIX Omni Tracer tool is described. POSIX Omni Tracer provides an easy way to analyze the behavior of the POSIX API under various platforms. The behavior of the POSIX API can be captured in traces during an application�۪s execution using various tracing utilities. These traces are then parsed into a uniform representation. Since the captured behavior from different platforms share the same format, they can be easily analyzed or compared between them.
M.S. Thesis 2013 PhyloBrowser: A visual tool to explore phylogenetic trees Tershakovec, Tamara Abstract | PDF

Title: PhyloBrowser: A visual tool to explore phylogenetic trees

Candidate: Tershakovec, Tamara

Advisor(s): Shasha, Dennis; Coruzzi, Gloria

Abstract:

Primary acknowledgements go to my research advisor, Dennis Shasha, for his patient and unwavering support. I would also like to thank my second reader, Gloria Coruzzi, for encouraging and showcasing my work. Kranthi Varala helped immensely in explaining the biological and statistical concepts involved in this project. The Virtual Plant team got me started in biological data visualization and was a joy to work with. And finally, thanks to Chris Poultney, who put me in touch with Professor Shasha and started me on my way back to NYU.
M.S. Thesis 2013 PAC-Learning for Energy-based Models Zhang, Xiang Abstract | PDF

Title: PAC-Learning for Energy-based Models

Candidate: Zhang, Xiang

Advisor(s): LeCun, Yann; Sontag, David

Abstract:

In this thesis we prove that probably approximately correct (PAC) learning is guaranteed for the framework of energy-based models. Starting from the very basic inequalities, we establish our theory based on the existence of metric between hypothesis, to which the energy function is Lipschitz continuous. The result of the theory provides a new scheme of regularization called central regularization, which puts the effect of deep learning and feature learning in a new perspective. Experiments of this scheme shows that it achieved both good generalization error and testing error.
M.S. Thesis 2012 A tool for extracting and indexing spatio-temporal information from biographical articles in Wikipedia Morton-Owens, Emily Abstract | PDF

Title: A tool for extracting and indexing spatio-temporal information from biographical articles in Wikipedia

Candidate: Morton-Owens, Emily

Advisor(s): Davis, Ernest

Abstract:

The Kivrin program, consisting of a crawler, a data collection, and a front-end interface, attempts to extract biographical information from Wikipedia, specifically, spatio-temporal information--who was where when--and make it easily searchable. Some of the considerations standard to moving object databases do not apply in this context, because the texts by their nature discuss a discontinuous series of notable moments. The paper discusses different methods of arranging the crawler queue priority to find more important figures and of disambiguating locations when the same place name (toponym) is shared among several places. When lifespan information is not available, it is estimated to exclude sightings outside the person's plausible lifetime.

The results are grouped by the number of sightings in the user's search range to minimize the visibility of false drops when they occur. Erroneous results are more visible in times and places where fewer legitimate sightings are recorded; the data is skewed, like Wikipedia itself, towards the U.S. and Western Europe and relatively recent history. The system could be most improved by using statistical methods to predict which terms are more likely personal names than place names and to identify verbs that precede location information rather than personal names. It could also be improved by incorporating the times as a third dimension in the geospatial index, which would allow "near" queries to include that dimension rather than a strict range.

The program can be used at http://linserv1.cims.nyu.edu:48866/cgi-bin/index.cgi
M.S. Thesis 2010 DTAC: A method for planning to claim in Bridge Bethe, Paul Abstract | PDF

Title: DTAC: A method for planning to claim in Bridge

Candidate: Bethe, Paul

Advisor(s): Davis, Ernest

Abstract:

The DTAC program uses depth-first search to find an unconditional claim in bridge; that is, a line of play that is guaranteed to succeed whatever the distribution of the outstanding cards among the defenders. It can also find claims that are guaranteed to succeed under specified assumptions about the distribution of the defenders. cards. Lastly, DTAC can find a claim which requires losing a trick at some point. Using transposition tables to detect repeated positions, DTAC can carry out a complete DFS to find an unconditional ordered claim in less than 0.001 seconds on average, and less than 1 second for claims which lose a trick. The source code for DTAC is available from: http://cs.nyu.edu/~pmb309/DTAC.html
M.S. Thesis 2010 TestRig: A Platform independent system testing tool Kaul, Vaibhav Abstract | PDF

Title: TestRig: A Platform independent system testing tool

Candidate: Kaul, Vaibhav

Advisor(s): Shasha, Dennis

Abstract:

The goal of the TestRig software is to give a test engineer a fixed interface to help him with system/integration testing of software systems. TestRig is platform independent and can be utilized to test software systems coded with any programming language. In addition to doing that, it provides templates and examples of using various Open Source testing tools to help a user design their test cases. TestRig has been designed keeping in mind the current scenario in software development where complex systems are often created using multiple programming languages across different platforms. The challenge is to have a defined set of rules that are able to test any such system. The software makes use of various open source testing tools to run tests and verify results, which enables a user to test a system at different levels such as Performance Testing, Blackbox Testing, and User Acceptance Testing. TestRig is open source and utilizes a programmer’s creativity to test across multiple scenarios. The thesis will show how different software systems have been tested using TestRig.
M.S. Thesis 2009 Plinkr: an Application of Semantic Search Scott, John Abstract | PDF

Title: Plinkr: an Application of Semantic Search

Candidate: Scott, John

Advisor(s): Shasha, Dennis

Abstract:

Plinkr extends and enriches traditional keyword search with semantic search technology. Specifically, Plinkr facilitates the process of discovering the intersection of information between two subjects. This intersection represents what the subjects have in common and thus effectively captures the relationships between them. This is accomplished by semantically tagging and scoring entities that are contained within various keyword searches. The most relevant entities are thus abstracted and presented as metadata which can be explored to highlight the most pertinent content.
M.S. Thesis 2008 Friendshare: A decentralized, consistent storage repository for collaborative file sharing Chiang, Frank Abstract | PDF

Title: Friendshare: A decentralized, consistent storage repository for collaborative file sharing

Candidate: Chiang, Frank

Advisor(s): Li, Jinyang

Abstract:

Data sharing has become more and more collaborative with the rise of Web 2.0, where multiple writers jointly write and organize the content in a repository. Current solutions use a centralized entity, such as Wikipedia or Google Groups, to serve the data. However, centralized solutions may be undesirable due to privacy concerns and censorship, which are problems that can be alleviated by switching to decentralized solutions.

The challenge of building a decentralized collaborative repository is achieving high data availability, durability, and consistency. Attaining these goals is difficult because peer nodes have limited bandwidth and storage space, low availability, and the repository has high membership churn.

This thesis presents Friendshare, a decentralized multiple-writer data repository. Separating the metadata from the data allows for efficient metadata replication across privileged admin nodes, thus increasing availability and durability. The primary commit scheme, where a primary node is responsible for determining the total order of writes in the repository, is employed to ensure eventual consistency. If the primary leaves the system unexpectedly, the remaining admin nodes run Paxos, a consensus protocol, to elect a new primary.

The Paxos protocol requires high node availability in order to be run efficiently, a criteria that is rarely met in typical peer-to-peer networks. To rectify this problem, we offer two optimizations to improve Paxos performance in low availability environments.

Friendshare has been implemented and deployed to gather real-world statistics. To offer theoretical predictions, we built a simulator to demonstrate the performance and service availability of Friendshare at various node online percentages. In addition, we show the performance improvements of our Paxos optimizations in comparison with the basic Paxos protocol.
M.S. Thesis 2008 STUMP: Stereo Correspondence in the Cyclopean Eye under Belief Propagation Distler, George Abstract | PDF

Title: STUMP: Stereo Correspondence in the Cyclopean Eye under Belief Propagation

Candidate: Distler, George

Advisor(s): Geiger, Davi

Abstract:

The human visual system sees at any moment a static scene in three dimensions. This 3D view of the world is acquired by two images, one from the left eye and the other by the right eye. Fusing the left and right stereo pair of images yields a single cyclopean view portraying depth. Stereo vision can be applied to the field of computer vision via calibrated stereo cameras to capture the left and right images. Given a stereo pair of images, one can compute the field of depth via a stereo correspondence algorithm. We present a new approach to computing the disparity (depth) by means the STUMP algorithm.

The STUMP algorithm presents a solution to the stereo correspondence problem. We propose to solve the problem of discontinuities in disparity within epipolar lines by modeling geometric constraints of smooth, tilted, and occluded surfaces as well as unicity and opaqueness. Our algorithm runs upon a framework built upon the BP-TwoGraphs belief propagation estimation [17]. As a result, we provide a disparity map in the cyclopean coordinate system determined by a probability distribution computed in polynomial time.
M.S. Thesis 2008 Measuring biomolecules: an image processing and length estimation pipeline using atomic force microscopy to measure DNA and RNA with high precision Sundstrom, Andrew Abstract | PDF

Title: Measuring biomolecules: an image processing and length estimation pipeline using atomic force microscopy to measure DNA and RNA with high precision

Candidate: Sundstrom, Andrew

Advisor(s): Mishra, Bud

Abstract:

Background. An important problem in molecular biology is to determine the complete transcription profile of a single cell, a snapshot that shows which genes are being expressed and to what degree. Seen in series as a movie, these snapshots would give direct, specific observation of the cell.s regulation behavior. Taking a snapshot amounts to correctly classifying the cell.s ~300 000 mRNA molecules into ~30 000 species, and keeping accurate count of each species. The cell.s transcription profile may be affected by low abundances (1-5 copies) of certain mRNAs; thus, a sufficiently sensitive technique must be employed. A natural choice is to use atomic force microscopy (AFM) to perform single-molecule analysis. Reed et al. ("Single molecule transcription profiling with AFM", Nanotechnology , 18:4 , 2007) developed such an analysis that classifies each mRNA by first multiply cleaving its corresponding synthesized cDNA with a restriction enzyme, then constructing its classification label from ratios of the lengths of its resulting fragments. Thus, they showed the transcription profiling problem reduces to making high-precision measurements of cDNA backbone lengths . correct to within 20-25 bp (6-7.5 nm).

Contribution. We developed an image processing and length estimation pipeline using AFM that can achieve these measurement tolerances. In particular, we developed a biased length estimator using James-Stein shrinkage on trained coefficients of a simple linear regression model, a formulation that subsumes the models we studied.

Methods. First, AFM images were processed to extract molecular objects, skeletonize them, select proper backbone objects from the skeletons, then compute initial lengths of the backbones. Second, a linear regression model was trained on a subset of molecules of known length, namely their computed image feature quantities. Third, the model.s coefficients underwent James-Stein shrinkage to create a biased estimator. Fourth, the trained and tuned model was applied to the image feature quantities computed for each test molecule, giving its final, corrected backbone length.

Results. Training data: one monodisperse set of cDNA molecules of theoretical length 75 nm. Test data: two monodisperse sets of cDNA molecules of unknown length. Corrected distributions of molecular backbone lengths were within 6-7.5 nm from the theoretical lengths of the unknowns, once revealed.

Conclusions. The results suggest our pipeline can be employed in the framework specified by Reed et al. to render single-molecule transcription profiles. The results reveal a high degree of systematic error in AFM measurements that suggests image processing alone is insufficient to achieve a much higher measurement accuracy.
M.S. Thesis 2007 Degeneracy Proof Predicates for the Additively Weighted Voronoi Diagram Millman, David Abstract | PDF

Title: Degeneracy Proof Predicates for the Additively Weighted Voronoi Diagram

Candidate: Millman, David

Advisor(s): Yap, Chee

Abstract:

This thesis focuses on the Additively Weighted Voronoi diagram. It begins by presenting the history of the diagram and some of the early algorithms used for its generation [OBSC00, Aur91]. The paper then addresses the more recent incremental approach to calculating the diagram, as seen in the 2D Apollonius Graphs (Delaunay Graphs of Disks) package of CGAL [KY06]. Next, the algorithm of Boissonnat et al. [BD05] for calculating Convex Hulls is presented. We then apply the predicates presented by Bossonnat to the CGAL implementation of the AW-Voronoi diagram, and the results are discussed. The main contribution of this paper results in predicates of the AW-Voronoi diagram, with a lower algebraic degree which also handle degeneracies in such a way as to produce a conical result.
M.S. Thesis 2007 Cellstorm: A bioinformatics software system to visualize subcellular networks Neves, Ana Abstract | PDF

Title: Cellstorm: A bioinformatics software system to visualize subcellular networks

Candidate: Neves, Ana

Advisor(s): Shasha, Dennis

Abstract:

Cellstorm is a software system that allows a rapid visualization of genes and subcellular networks. Given a set of genes, expression levels, structural hierarchy and network's data, Cellstorm displays the networks at any level of the hierarchy and provides a set of user options such as zooming, network selection and list filtering.

Although Cellstorm is mainly aimed at biological applications, it can be used in any field that needs to display networks. Cellstorm achieves this by avoiding application-specific assumptions.
M.S. Thesis 2006 TimeIn: A temporal visualization for file access Borden, Jeffrey Abstract | PDF

Title: TimeIn: A temporal visualization for file access

Candidate: Borden, Jeffrey

Advisor(s): Shasha, Dennis

Abstract:

TimeIn seeks to unify a given set of file objects into a unified browsing experience providing mechanisms to Cluster visually similar objects and display objects in a timeline view from a local file system or flickr.com . To navigate this content, users are provided with a variety of mechanisms for filtering the set of objects presented.

For text based objects, TimeIn will analyze the content of the file and attempt to extract a set of descriptive phrases. For image based objects, TimeIn will annotate the object with the most frequently used colors of the image. Users have the option of augmenting these automatically generated tags by defining their own descriptive tags.

While providing novel features for browsing and searching content, TimeIn retains many of the original organizational features of existing systems. When content is imported from a hierarchical file system, users can still browse by the original hierarchical structure.

TimeIn also retains the PhotoSet structures associated with content imported from flickr.com . Users can also organize content into user-defined "albums" of objects. These albums can then be used to filter the set of objects on the timeline.
M.S. Thesis 2006 Kronosphere: A temporal visualization for file access Harrison, Chris Abstract | PDF

Title: Kronosphere: A temporal visualization for file access

Candidate: Harrison, Chris

Advisor(s): Shasha, Dennis

Abstract:

Hierarchical file systems mirror the way people organize data in the real world. However, this method of organization is often inadequate in managing the immense number of files that populate hard drives. Kronosphere provides a novel time and content-based navigational paradigm for managing and accessing media. This allows users to browse their documents by time, content, history, metadata, and relationships with other files.
M.S. Thesis 1997 Real/Expr: Implementation of an Exact Computation Package Ouchi, Kouji Abstract | PDF

Title: Real/Expr: Implementation of an Exact Computation Package

Candidate: Ouchi, Kouji

Advisor(s): Yap, Chee

Abstract:

The Real/Expr package is a C++ project to support the precision-driven approach to exact computation of geometric algorithms. The package is built on top of the class Real that encompasses a variety of numerical representations. The class Expr captures a set of algebraic expressions on which any comparison can be done precisely.

The software libraries described here are available via the Web page http://simulation.nyu.edu/projects/exact/.