Lecture 13: Open Source Code Search

Commercial Code Search Engines

Paul Bethe tells me that in his experience, with ordinary Google web search, if you search on a classname that is not very well known, it is likely that many of the results will be forums where people have posed questions about the class and not gotten any answers. Moreover popular blogs with no answer often rank higher than more obscure blogs which do have the answer. Clearly it would be possible to do better.

Almost all the research work I have found on web-based code search is exclusively Java-based. (The one exception is Blueprint.) Much of the general characteristics could presumably be generalized to any language with type declarations, though the details are obviously language-specific.

General problem acknowledged by everyone: What the field needs is an accepted set of benchmark problems. Instead, people tend to test their code on a dozen problems of their own choosing --- quite likely, the same problems they had in mind when they wrote their system; almost certainly problems of the same flavor --- so, not surprisingly their own system works well. Many of these systems are evaluated in a way that does not support statistically significant conclusions (e.g. a half-dozen programmers are each asked to write four programs.) It is a particularly difficult area in which to do evaluation.

The range of examples considered in the research literature tends to be narrow, focussing on format-fiddling and user-interfaces. (Reiss's paper is very much an exception.) It would be interesting to study broader classes of programming.

Also, the documentation tends to be surprisingly slight; projects that certainly look large, for which there exists a single 2-page paper.

***********************************************************

Sourcerer --- An Infrastructure for Large-scale Collection and Analysis of Open-source Code, Sushil Bajracharya, Joel Ossher, and Cristina Lopes, Workshop on Academic Software Development Tools and Techniques 2010.

Sourcerer Project Home Page

Collects and analyzes large repository of open-source code and supports applications.

Crawler: Crawls a large number of known open-source repositories.

Relational model: Project, file, entity, comment, relation.
Entity: package, class, interface, enum, annotation, initializer, field, enum constant, constructor, method, annotation element, parameter, local variable, primitive, array, type variable, wildcard, parameterized type, unknown.
Relations: inside, extends, implements, holds, receives, calls, throws, returns, accesses.

Keywords: Text asssociated with entities. Extracted from FQNs (fully qualified names) and comments. ComputerDeutch identifiers are split; e.g. "QuickSort" gives rise to two keywords, "Quick" and "Sort". Words in comments are associated with nearby entities.

Fingerprints: Quantifiable features associated with entities.

Services:

Other applications: Data mining, Inter-project structural analysis.

Sample queries: (from OOPSLA '06 paper)

Multiple pages of results for all. In all cases the top result was relevant.

A Study of Ranking Schemes in Internet-Scale Code Search, Bajracharya et al., ISR Tech. Report, #UCI-ISR-07-8, 2007.

CodeGenie: using test-cases to search and reuse source code Lemos et al. ASE 2007.

As far as I can tell this two-page article is all that has been written about it. It uses test cases --- presumably to filter candidate results, retrieved by keyword search though even that is not spectacularly clear. However it has some interesting examples:

arabic to roman
arabic to ordinals
arabic to alpha
quicksort
complementary DNA
reverse complementary DNA
filtering folder contents
unzip
zip
etc.

**************************************************************

Jungloid mining: helping to navigate the API jungle David Mandelin et al. ACM SIGPLAN Notices 2005.

Consider the programming task of parsing a Java source code file using the Eclipse IDE framework version 2.1, assuming the file is represented by an IFile object. Two of the authors encountered this problem independently, and in each case it took several hours to arrive at the desired code, which is shown here:
    Ifile file = ...;
    IComplilationUnit cu = JavaCore.createCompilationUnitFrom(file);
    ASTNote ast - AST.parseCompilationUnit(cu,false);

This example illustrates some of the difficulties programmers often encounter. First, a programmer unfamiliar with the framework would not think to look at class JavaCore, yet JavaCore is a crucial link; one of its static methods converts one type of file handle, IFile, to another, ICompilationUnit that can be used by the parser. Second, the programmer may look for cluse using the class browsing features provided by object-oriented IDEs, but that will not help here: the IDE can easily show members of IFile, but the first step is a static method of a different class. Finally, although a programmer might know that the result of parsing is an ASTNote and might therefore grep for methods returning ASTNode, the method parseCompilationUnit would not be found because its return type is actually CompilationUnit, a subclass of ASTNode.

PROSPECTOR:
Input to Prospector is a pair [InClass, OutClass].
Output: A list of jungloids. Each jungloid is a short sequence of essentially unary operations that converts an object of InClass into one of OutClass. (The name is because of a metaphor to a monkey swinging from vine to vine in the jungle.)

Problem of finding the correct InClass, OutClass not addressed.

Easy case: No downcasts. Signature graph. Vertices are classes. Arcs are

Ranking. Prefer jungloids that (a) have few instructions; (b) cross fewer package boundaries; (c) return as unspecific a subclass of OutClass as possible.

Hard case: Downcasts. If S is a superclass of U, then you can write

    S s = ...
    U u = (U) s;
downcasting S to U. A run-time test is done to make sure that object S is actually of type U; otherwise, an error is thrown. Static analysis can't tell you whether this code is reasonable to insert in yours, because you have to know whether object s is actually a U.

To incorporate downcasts, PROSPECTOR does jungloid mining. That is:

Experiments:

General comment: The semantics of a programming problem is being squeezed down to an input class and an output class. It is remarkable that this should be sufficient to determine what the task to be carried out is. (Of course that depends on the domain; it would probably not be sufficient in numerical computing.) So what does this say about programming problems and classes?

********************************************************

ParseWeb: A programmer assistant for reusing open source code on the web Suresh Thummalapenta and Tao Xie, ASE 2007.

As with PROSPECTOR, specify input and output classes.

********************************************************

Semantics-based code search Steven P. Reiss, ICSE 2009.

Specifying what to search for, Steven P. Reiss, SUITE 2009.

S6 Semantics-Based Code Search

Static specifications: Keywords, signature.

Dynamic specification: Test cases. Contracts: preconditions and postconditions expressed in JML (Java modelling language).

Other: Security constraints.

Issue query to general code search engine. Collect pages. Collect candidate method.

Transformations of methodto fit query specs:

Testing Run tests, check correctness of answer. Run jmlc, see if contract can be authenticated.

Examples: See paper pps. 249-250.

Interesting but unconvincing examples in SUITE 2009 paper.

Find an HTML parser that preserves white space. Comment: You're not going to find it. The best you can do is to find a well-written HTML parser, and then unless you're really lucky you've got a lot of work, because white space deletion is done by a low-level tokenizer.

Find code to do a topological sort of my own graph structure. Comment: Why would you suppose that it's easier to find this and adapt it than to write topological sort (10 lines of code) from scratch?

Least squares solution to a system of linear equation, with the additional constraint that resultant values are non-negative. This becomes quadratic programming (I don't know why, rather than linear programming) and Reiss had to write significant code to translate the linear equations into a form that the quadratic programming solver could accept. Comment: Again, what do you expect?

********************************************************

Example-Centric Programming: Integrating Web Search into the Development Environment, Joel Brandt et al., CHI 2010.

Blueprint.

Integrates code search into an IDE (to be exact it is a plugin for Adobe Flex Builder which in turn is a plugin for Eclipse Development Environment).

Functionality:

Uses both code pages and other pages (tutorials, help pages, forums, etc.) For other pages, makes sure that the code is complete and is not posted as an example of buggy code.

Maintains cache of code examples.

On receiving user query:

(Similar project: Euklas from CMU.

Two studies of opportunistic programming: interleaving web foraging, learning, and writing code, Joel Brandt et al., CHI 2009.

User study of the ways web is used in programming:

Features of these three categories in table 2, p. 1593.

********************************************************

SNIFF: A Search Engine for Java Using Free-Form Queries, Shaunak Chatterjee, Sundeep Juvekar, and Koushik Sen.

Idea: Collect corpus of open-source Java code. Annotate each method call with the significant works in the Javadoc documentation for that method. Use standard IR techniques on NL queries. Cluster results by similarity. Intersect the snippets in a single cluster. Rank by number of instances of code in the clusters.

********************************************************

Mica: A Web-Search Tool for Finding API Components and Examples Jeffrey Stylos and Brad Myers, VL/HCC, 2006.

Code-specific "snippets". Issues query to Google Web search. Downloads top 10 results. For each such result, download and extract suggestive keyword, in all 10 combined. Exclude common keywords, then rank by TF first and then by IDF. Organize by class containing keyword. Mousing over keyword highlights the results containing the keyword.

(Not currently working, at least at the URL in the paper.)