CS 231: Project #8

Title image Spring 2020

Trees v. Hashing

The main purpose of this project is to give you the opportunity to examine the efficiency of your Hashmap and your BSTMap when counting words in the Reddit comment files.

Note: please do not hand in the Reddit comment files with your code. If everyone hands them in with their code, it takes up a lot of unnecessary space on the server.

It looks like Java takes at least 4G of memory to put the words into an ArrayList. Use the flag -Xmx4096m when running your program or set the default memory size to at least 4G.

Tasks

  1. Write a WordCounter2 class

    In order to properly compare the times for building a BSTMap versus a Hashmap, we need to separate the file I/O from the process of building the data structure. Your WordCounter2 should have a field of type MapSet to hold the data structure and a field of type int to hold the total word count.

    Write the following methods. Many of them can re-use code from last week.

    1. public WordCounter2( String data_structure ) - constructor, where data_structure is either "bst" or "hashmap". It should create the appropriate data structure and store it in the map field.
    2. public ArrayList<String> readWords( String filename ) - given the filename of a text file, read the text file and return an ArrayList list of all the words in the file.
    3. public double buildMap( ArrayList<String> words ) - given an ArrayList of words, put the words into the map data structure. Return the time taken in ms. Record the time using System.nanoTime().
    4. public void clearMap() - clear the map data structure.
    5. public int totalWordCount() - return the total word count from the last time readWords was called.
    6. public int uniqueWordCount() - return the unique word count, which should be the size of the map data structure.
    7. public int getCount( String word ) - return the number of times the word occurred in the list of words. Query the data structure to get the information. Return 0 if the word does not exist in the data structure.
    8. public double getFrequency( String word ) - return the frequency of the word in the list of words. Query the data structure to get the word count and then divide by the total number of words to get the frequency. Return 0 if the word does not exist in the data structure.
    9. public boolean writeWordCount( String filename ) - write a word count file given the current set of words in the data structure. The first line of the file should contain the total number of words. Each subsequent line should contain a word and its frequency.
    10. public boolean readWordCount( String filename ) - read a word count file given the filename. The function should clear the map and then put all of the words, with their counts, into the map data structure.
  2. Robustly calculate the time it takes to build the map

    The WordCounter2 main function should first read the words from the file into an ArrayList using the readWords method. Then it should calculate a robust average of the time it takes to build the map. For example, loop five times, each time through the loop clearing the map and then building the map, storing the run times. Then calculate a robust average by dropping the low and high values and computing the average of the remaining times.

    If you wish, automate the process for all eight of the Reddit files. At the end of the analysis, have your program print out the average run time in ms, the total number of words, and the number of unique words (for each file).

    Remember to clear your MapSet data structure prior to running the buildMap method each time through the loop.

  3. Compute statistics for the BSTMap for all of the Reddit files

    Run the analysis using your BSTMap method from last week on all eight Reddit files and record the run-times, total words, total unique words, and the depth of your tree in a spreadsheet or CSV file. Add the file size for each year as a separate column. You can use the Terminal command ls -lh to get the file sizes in human-readable format.

  4. Compute statistics for the Hashmap for all of the Reddit files

    Repeat the timing process using the Hashmap data structure. Add the hash map times as another column in your spreadsheet or CSV file. To measure the effectiveness of the hash function (how well it distributed keys), have your hash table version also print the number of collisions.

    A collision occurs when a key is inserted into the hash table for the first time and another key already exists at that hash location.

    Simply updating the value in a key-value pair does not cause a collision. Also, since the goal is to understand how well the hash function distributes the data across the table, reset the collision count to 0 whenever the table size increases. As the key-value pairs are re-inserted into the larger table, increment the collision count appropriately.

  5. Plot your results

    Use a spreadsheet program (or python and matplotlib) to report the numbers with an XY scatter plot. Make three plots.

    1. Year versus run-time: the x-axis should be the year, and the y-axis should be the run-time. Show both the BST and Hash map run times.
    2. Total number of words versus run-time: the x-axis should be the total number of words. Show both the BST and Hash map run times.
    3. Unique words versus run-time: the x-axis should be the number of unique words. Show both the BST and Hash map run times.

    Do your plots make sense? (If not, figure out why not and fix your code). If there are any trends, describe them. Which data structure appears to be faster on this data set? Include the graphs in your report. Wny do you think one is faster than the other?

  6. Analyze the performance of the BSTMap and Hashmap as data structures

    Analyze the performance of the BSTMap and Hashtable to see how close to ideal they perform using either total words or file size versus run-time. Start by stating what ideal means in terms of big-O notation. For the Hashtable, does the number of collisions affect performance? For the BSTMap, does the depth/height of the tree affect performance? Include this analysis in your report.

  7. Experiment with the Hashmap

    Change one aspect of your Hashmap to try and make it faster. For example, change the hash function, change when you expand the table, or change the data structure you are using at each position in the table.

    Analyze the effect of your change on performance. Did it make the Hashmp faster, slower, or have no discernable effect? Does the result make sense?

    Good extensions are to explore more than one modification.


Extensions

The following are some suggested extensions. You should feel free to pursue your own custom extensions to your project that you find interesting. Please clearly identify your extensions in your report. In your code, make sure the baseline simulations run properly. Making a separate sub-folder for major extensions is recommended.

  1. Test multiple modifications of your Hashmp (not just one) and compare and contrast. Many of the suggested extensions fall in this category.
  2. Use additional files to test the (time and/or space) efficiency of your implementations. You want to cover a much bigger range of values (in a logarithmic sense) so you can get an accurate curve.
  3. Implement your own hash function. How does its performance compare to the built-in string hash function?
  4. Try implementing more than one collision handling method. For example, (1) use a linked list instead of a BSTMap at each table entry, or (2) use a closed hash table (no extra data structures). Compare the performance of the methods.
  5. Examine the space efficiency of your implementations. One way to do this would be to refrain from using the -Xmx4096m flag. Count words of files of increasing sizes. See what is the size of the smallest size that crashes your program. Does a smaller file crash the Hashtable code or the BSTMap code? Why?
  6. Improve the time-efficiency of one of your data structures. Explain what you did and report the improvement in speed.
  7. For any assignment, a good extension will be to annotate your code to indicate all places where memory is "lost" (in other words, each place where the last remaining reference to an object is either destroyed or is given a new value). If you do this extension, please indicate so in your report.

Report

Your intended audience for your report are your peers not in the class, but who have taken a CS course. Your report should explain the core CS concept used, present the results of your program, and discuss the meaning of the results: what did you discover and does it make sense?

Your project report should contain the following elements. Please include a header for each section.

Handin

Make your report for the project a wiki page in your personal space. If you have questions about making a wiki page, stop by my office or ask in lab.

Once you have written your report, give the page the label:

cs231s20project8

You can give any wiki page a label using the label field at the bottom of the page. The label is different from the title.

Do not put code on your writeup page or anywhere it can be publicly accessed. To hand in code, put it in your folder on the Courses fileserver. Create a directory for each project inside the private folder inside your username folder.

Please do not submit the reddit text files along with your code.