Due: April 1, 2002, email by midnight
This programming assignment will give you practise with binary search trees and tree traversal algorithms. Along the way we will experimentally measure some properties of trees.
The assignment is to build a binary
search tree of words from an input text.
Each time a new word occurs it is inserted into the tree; if the word has previously occurrred, a count associated with the word is incremented. Note that you will have to parse the text string to find the words before inserting them into the tree. This will require you to replace capital letters with lower case and remove punctuation. You should treat hyphens as word separators, so that a word like self-confident is treated as two words. You can use the java string tokenizer if you want for this assignment.
When you have finished inserting all the words into a tree, several experiments should be done. First, compute the average length of a search path in the tree. Second, find the ten most common words in the text. (You should output both the word and the number of times it occurred.) Delete the ten most common words in the tree (using the delete algorithm discussed in class) and re-compute the average search path. How much has it changed? Third, for each of the texts on the web page that are to be used as input, measure the height of the binary tree that results from the word insertions. Graph the results of each input text experiment with the number of nodes (distinct words) on one axis and the height on the other. What is the relationship? You should hand in (or mail in if the graph is done on-line) both the graph and a paragraph commenting on it. If the relationship is unclear, you could download more sample input texts of varying sizes, so you can have more cases to gauge the behavior of the tree height versus number of nodes.
Here are some texts to use as input:
lincoln.txt (703 words)
kennedy.txt (1342 words)
magna.txt (4920 words)
gulliver.txt (104135 words)
karamazov.txt (349362 words)
war+peace.txt (564263 words)
Here are two sample string readers
for file input: