Due: April 9, 2001, 11AM by email
This programming assignment will give you practise with binary search trees and tree traversal algorithms. Along the way we will measure various properties of trees.
The assignment is to count frequency of word occurrences in text by inserting each new word in a tree, or incrementing the count associated with a word if it occurs more than once. Words will be taken from text that you can download from this programming assignment web page. Note that you will have to parse each word in the text string, prior to inserting it in the tree (Please write your own string tokenizer). Punctuation, and capitalization should be ignored. Words that differ by "s" or other suffixes should be treated as different, and inserted into the tree as a new word.
When you have finished inserting all the tokens in a tree, two operations should be done. First, find the most common word in the text. If more than one word occurs the maximal number of times, print them all. Secondly, for each of the texts on the web that are to be used as input, measure the height of the binary tree that results from the word insertions. Graph the results of each text input experiment with the number of nodes on one axis and the depth on the other. Comment on this relationship (you can do this in your program comments). (For extra credit, download more texts of large and varying sizes, so you can have more cases to gauge the behavior of the tree height versus number of nodes).
Here are some texts to use as input:
gettyb.txt (264 words)
monroe.txt (957 words)
kennedy.txt (1342 words)
magna.txt (4920 words)
war+peace.txt (564263 words)
Here are two sample string readers for