Trees v. Tables
The main purpose of this project is to give you the opportunity to examine the efficiency of your Hashmap and your BSTMap when counting words in the Reddit comment files.
- Modify your WordCounter class from last week. Instead of having a field of type BSTMap<String, Integer>, use a field of type MapSet<String, Integer>. In the constructor, require a parameter that selects whether the WordCounter will use a BSTMap or a Hashtable for storing the data.
- Make sure your WordCounter main function is
able to calculate the amount of time it takes to analyze a file
System.currentTimeMillis(). For example:
long startTime = System.currentTimeMillis(); wc.analyze( inputFilename ); long finishTime = System.currentTimeMillis(); System.out.println("It took: "+(finishTime-startTime)/1000.0+ " second to read in the file of words");
- Use code like that above to make the WordCounter analyze the input file five times. Record each of the 5 times, drop the low time, drop the high time, and compute the mean of the remaining three (your WordCounter should do all of this, don't do this by hand). If you wish, automate the process for all eight of the Reddit files.
- Run the analysis using the BSTMap on all eight Reddit files and record the times in a spreadsheet or CSV file.
- Repeat the process using the Hashtable data structure. Record the times in a spreadsheet or CSV file.
- Use a spreadsheet program to report the numbers with a line graph or bar graph. The x-axis should be the file/year and the y-axis should be the runtime). If there are any trends, describe them. Which data structure appears to be faster on this data set? Include the graph or graphs in your report.
- Analyze the performance of the BSTMap and Hashtable to see how close to ideal they perform. For the Hashtable, that will involve counting collisions. For the BSTMap, that will involve determining the height of the tree. You may need to modify the BSTMap to calculate height. Include this analysis in your report.
Each assignment will have a set of suggested extensions. The required tasks constitute about 85% of the assignment, and if you do only the required tasks and do them well you will earn a B+. To earn a higher grade, you need to undertake at least one extension. The difficulty and quality of the extension or extensions will determine your final grade for the assignment. One significant extension, or 2-3 smaller ones, done well, is typical.
- Use additional files to test the (time and/or space) efficiency of your implementations.
- Implement your own hash function. How does its performance compare to the built-in string hash function?
- Try implementing more than one collision handling method. For example, (1) use a linked list instead of a BSTMap at each table entry, or (2) use a closed hash table (no extra data structures). Compare performance.
Examine the space efficiency of your implementations. One
way to do this would be to refrain from using the
-Xmx512mflag. Count words of files of increasing sizes. See what is the size of the smallest size that crashes your program. Does a smaller file crash the Hashtable code or the BSTMap code? Why?
- Improve the time-efficiency of one of your data structures. Explain what you did and report the improvement in speed.
- For any assignment, a good extension will be to implement a Java class yourself and demonstrate that it has the same functionality as the Java class. For example, you could implement your own ArrayList class for this assignment.
- For any assignment, a good extension will be to annotate your code to indicate all places where memory is "lost" (in other words, each place where the last remaining reference to an object is either destroyed or is given a new value). If you do this extension, please indicate so in your write-up.
Your report should have a simple format.
- A brief description of the overall project, in your own words. Identify both the data structure used and the task solved by using it. Finish by indicating whether your analysis worked as expected and what you discovered.
- An explanation of your solution, focusing on the interesting bits. In this assignment, for example, the interesting parts are the hash function and how your code handles collisions.
- Printouts, pictures, or results to show what you did. For this assignment, you should include the graph of run times.
- Other results to demonstrate extensions you undertook.
- A brief conclusion and description of what you learned.
- A list of people you worked with, including TAs, and instructors. Include in that list anyone whose code you may have seen, such as those of friends who have taken the course in a previous semester.
Make your report for the project a wiki page in your personal space. If you have questions about making a wiki page, stop by my office or ask in lab.
Once you have written your report, give the page the label:
You can give any wiki page a label using the label field at the bottom of the page. The label is different from the title.
Do not put code on your writeup page or anywhere it can be publicly accessed. To hand in code, put it in your folder on the Courses fileserver. Create a directory for each project inside the private folder inside your username folder.