Your first assignment is to put together a small suite of tools that can be used for simple data analysis. For straghtforward "eyeballing," what we want to be able to do is to take data extracted from a database, put it into some standard flat format, and then analyze data values extracted from that flat file.
To do this, we will need the following tools:
1) Format Standardizer: This tool will take a file with rows of data, each of which contains comma-separated data fields. The first line of each file contains the names of the columns in the file. The format standardizer will read this file and generate an output file. The output file is the same data, except that any quotes that surround data values are removed, and the data in each row is separated by a specified delimiter.
2) Column Extractor: This tool will take as an argument a transformed data file and a configuration file. The configuration file indicates which columns by name are to be extracted, and the order they are to be output. The output file contains only the columns specified in the configuration file with the columns ordered accordingly.
3) Sort: This tool will sort the data by a specified field, either numerically or lexicographically, and in ascending or descending order. The UNIX sort will work fine for this, but if you are using a non-UNIX machine, it is worthwhile to have a version of this as well.
4) Uniq: This tool will eliminate exact duplicates from a sorted file. An option is to have uniq count the number of each data instance. The UNIX uniq will work fine, but if you are not using a UNIX machine, it is worthwhile to have a vresion of this also.
I will provide test data.
The assignments and test runs are to be zipped and emailed directly to the TA by May 30th.
Your Second assignment is to make use of the simple tools you developed in assignment 1 as components of a data analysis tool. This new tool will be a type inferencer and pattern analysis tool.
This tool will do the following:
1) Perform type inferencing: Assume the strictest type for each column, then determine the conformance to that type. Iteratively loosen the restrictions, then check for conformance. When the conformance has reached a specified threashold, output the types and the levels of conformance, then propose a data type for the column.
2) Perform pattern analysis: Transform each character string into a pattern represented as either a digit, a character, punctuation, etc. Use your sorting and counting tools to see if there is an overwhelming pattern evident in the data.
3) Create a file to catalog your discovered patterns. We will transfer these to a database table during our next assignment.
Use the test data from assignment 1
The assignments and test runs are to be zipped and emailed directly to the TA by June 22.
DISTINCT (mapping of underlying type)
Your Third assignment is to demonstrate what we have learned about similarity scoring and matching, as well as the fast matching techniques for comparing strings for set membership.
This assignment comprises 3 smaller tools, of which the 3rd is for "extra credit":
You should program two different kinds of string similairty measures from the ones we discussed in class. The first is using Soundex, which we discussed a few weeks back. The Soundex comparator will take two strings, transform both into its Soundex form, and the two strings will be compared for equality in the Soundex system. This should transform the string entirely, not just the first few consonant sounds, and I expect that you should be able to transform the strings going both forward and backward.
The second flavor of string similarity is edit distance. We discussed this system where the distance between 2 strings is the number of edits that must be aplied to the first string to transform it into the 2nd string.
The third flavor is using ngramming. Ngrams are the "sliding window" of substrings that are extracted from each string and the sets of these substringsares compared. This function will take 2 strings, break each into its component n-grams, and then generate scores of similarity based on the different metrics covered.
The next part of the assignment is set matching. I would like you to program Bloom filters as a string set membership function. This function will instantiate a Bloom filter, input a large set if strings, and will answer yes or no for each query string, depending on whether it is in the set or not.
The extra component is programming the K-means cluster algorithm we discussed at the end of class8. This will take a set of objects, and a number k, and divide the set of objects into k clusters, using a similairity measure.
This will be due July 31st. Please email to Jack. Test sets to be provided