Trees v. Hashing
The main purpose of this project is to give you the opportunity to examine the efficiency of your Hashmap and your BSTMap when counting words in the Reddit comment files.
Note: please do not hand in the Reddit comment files with your code. If everyone hands them in with their code, it takes up a lot of unnecessary space on the server.
- Enable the WordCounter to use either a BSTMap or a Hashmap
Modify the WordCounter class from last week. Instead of having a field of type BSTMap<String, Integer>, use a field of type MapSet<String, Integer>. In the constructor, require a parameter that selects whether the WordCounter will use a BSTMap or a Hashmap for storing the data.
- Have the WordCounter main function time itself
Update the WordCounter main function so it calculates the amount of time it takes to analyze a file. Use
System.currentTimeMillis()to get the current time.
+ (more detail)
The function currentTimeMillis returns the number of milliseconds since the epoch (00:00:00 January 1, 1970) as a long integer. Taking the difference of two times provides the number of milliseconds between the two calls.
long startTime = System.currentTimeMillis(); wc.analyze( inputFilename ); long finishTime = System.currentTimeMillis(); System.out.println("It took: "+(finishTime-startTime)/1000.0+ "s to read the file");
- Calculate an average computation time for WordCounter
Have the WordCounter main function analyze the input file five times. Record each of the 5 times, drop the low time, drop the high time, and compute the mean of the remaining three (your WordCounter should do all of this, don't do this by hand). If you wish, automate the process for all eight of the Reddit files. At the end of the analysis, have your program print out the average run time, the total number of words, and the number of unique words (for each file).
- Compute statistics for the BSTMap for all of the Reddit files
Run the analysis using the BSTMap on all eight Reddit files and record the run-times, total words, total unique words, and the depth of your tree in a spreadsheet or CSV file. Add the file size for each year as a separate column. You can use the Terminal command
ls -lhto get the file sizes in human-readable format.
- Compute statistics for the Hashmap for all of the Reddit files
Repeat the process using the Hashmap data structure. Add the hash map times as another column in your spreadsheet or CSV file. To measure the effectiveness of the hash function (how well it distributed keys), have your hash table version also print the number of collisions.
A collision occurs when a key is inserted into the hash table for the first time and another key already exists at that hash location. This means that simply updating the value in a key-value pair does not cause a collision. Also, since the goal is to understand how well the hash function distributes the data across the table, reset the collision count to 0 whenever the table size increases. As the key-value pairs are re-inserted into the larger table, increment the collision count appropriately.
- Plot your results
Use a spreadsheet program (or python and matplotlib) to report the numbers with an XY plot or bar graph. Make three plots.
- Year versus run-time: the x-axis should be the year, and the y-axis should be the run-time. Show both the BST and Hash map run times.
- Total number of words versus run-time: the x-axis should be the total number of words. Show both the BST and Hash map run times.
- Unique words versus run-time: the x-axis should be the number of unique words. Show both the BST and Hash map run times.
Do your plots make sense? (If not, figure out why not and fix your code). If there are any trends, describe them. Which data structure appears to be faster on this data set? Include the graphs in your report.
- Analyze the performance of the BSTMap and Hashmap as data structures
Analyze the performance of the BSTMap and Hashtable to see how close to ideal they perform using either total words or file sie versus run-time. For the Hashtable, does the number of collisions affect performance? For the BSTMap, does the depth/height of the tree affect performance? Include this analysis in your report.
The following are some suggested extensions. You should feel free to pursue your own custom extensions to your project that you find interesting. Please clearly identify your extensions in your report. In your code, make sure the baseline simulations run properly. Making a separate sub-folder for major extensions is recommended.
- Use additional files to test the (time and/or space) efficiency of your implementations.
- Implement your own hash function. How does its performance compare to the built-in string hash function?
- Try implementing more than one collision handling method. For example, (1) use a linked list instead of a BSTMap at each table entry, or (2) use a closed hash table (no extra data structures). Compare the performance of the methods.
Examine the space efficiency of your implementations. One way to do this would be to refrain from using the
-Xmx512mflag. Count words of files of increasing sizes. See what is the size of the smallest size that crashes your program. Does a smaller file crash the Hashtable code or the BSTMap code? Why?
- Improve the time-efficiency of one of your data structures. Explain what you did and report the improvement in speed.
- For any assignment, a good extension will be to annotate your code to indicate all places where memory is "lost" (in other words, each place where the last remaining reference to an object is either destroyed or is given a new value). If you do this extension, please indicate so in your report.
- Abstract: a brief summary (200 words or less) of
the task, in your own words. give the reader context and
identify the key purpose(s) of the project. You can assume the
reader has read your prior assignments.
Writing an effective abstract is an important skill. Consider the following questions while writing it.
- Does it describe the CS concepts of the project?
- Does it describe the specific project application?
- Does it describe your the solution or how it was developed?
- Does it describe the results or outputs?
- Is it concise?
- Are all of the terms well-defined?
- Does it read logically and in the proper order?
- An explanation of your solution, focusing on the interesting bits. In this assignment, for example, the interesting parts are the hash function and how your code handles collisions.
- Printouts, pictures, or results to show what you did. For this assignment, you should include the various performance graphs.
- Other results to demonstrate extensions you undertook. If you tried to reconstruct a tree as a complete tree, for example, show that you succeeded.
- A brief conclusion and description of what you learned.
- A list of people you worked with, including TAs, and instructors. Include in that list anyone whose you may have seen, such as those of friends who have taken the course in a previous semester.
Make your report for the project a wiki page in your personal space. If you have questions about making a wiki page, stop by my office or ask in lab.
Once you have written your report, give the page the label:
You can give any wiki page a label using the label field at the bottom of the page. The label is different from the title.
Do not put code on your writeup page or anywhere it can be publicly accessed. To hand in code, put it in your folder on the Courses fileserver. Create a directory for each project inside the private folder inside your username folder.
Please do not submit the reddit text files along with your code.