A Note on Process for 2 Week Project:

This project will be completed in 2 weeks. However to ensure that you are making progress on it during the first week, there is a required checkpoint submission on April 16th worth 5/50 points on the project. Submit your code including the BSTMap.java file to the assignment titled "Project 6 CHECKPOINT: BSTMap."

Late submissions: You can use your late days for this checkpoint (if you think it's worthwhile). If you submit this more than 3 days late, you won't get credit for the checkpoint. You can still get credit for the BSTMap part of the main project rubric under the usual system.


Word Frequency

The main purpose of this project is to provide you the opportunity to use your different implementations of MapSet to determine the frequencies of all the words in a given text document. The ability to identify word frequencies in a large corpus has been the basis for a number of different research projects in the digital social sciences. In particular, looking at word usage over time in books and articles has revealed how culture and language change.

In particular, we will analyze the top words in colelction of Reddit comments from 2015, compared to the complete works of William Shakespeare. Download the data files here. The text files are large enough that using a list or an otherwise inefficient implementation will take a prohibitively long time to do the analysis.

Note: please do not hand in the data files with your code. If everyone hands them in with their code, it takes up a lot of unnecessary space on the server.

Tasks

WordCounter Class

The WordCounter is the class that creates and manages a MapSet of word-count pairs KeyValuePair<String, Integer>. You will not be implementing it yourself as the data structures themselves require significant implementation effort for this project. Instead, please download the WordCounter class from this link. However, to use this class in your analysis, please familiarize yourself with the methods it implements. Also note that you will need to change the name of the text file it reads in. At a high level, the WordCounter class has methods for doing two things.

First, the WordCounter needs to be able to Build a MapSet from a text document. The method reads in the text and separates it into words. For each word, it checks to see if the word is already a key in the MapSet. If it isn't, then it adds the Key-Value pair to the MapSet with the word as the key and 1 as the value. If the word is an existing key, then it increments the value associated with that word.

Second, the WordCounter needs to be able to write out a file containing the key-value pairs of the MapSet. The file has the total number of words on the first line, and each word and its count on the subsequent lines (one key-value pair per line). We will refer to a file formatted like this as a word count file.

The class needs a field to hold the MapSet with types <String, Integer>. It also needs a field to hold the total word count.


Exploration

The WordCounter class does the majority of the analysis work for you, but you may need to write some additional code to conduct the exploration. You may choose to do this in the main section of the WordCounter class but it's your choice how to implement this. Make sure you tell us how to replicate your results in the report!

  1. Required analysis 1: Make a table of the top 10 words for the reddit comments vs Shakespeare. Do the results match your expectations? Why or why not?
  2. Required analysis 2: Analyze the total time it takes for your different data structures in the buildMap method of your WordCounter class in both datasets. You might want to present this as a table with the results averaged over several runs.
  3. Required analysis 3: Analyze the maxDepth() of your data structures after calling the buildMap method of your WordCounter class in both datasets. Does this help to explain your results from analysis 2? Why or why not?

Reflections

1: Expected Heights for BSTs

We've seen that BSTs can have really bad (linear) height when the data given to it is practically sorted. But maybe this is somewhat contrived - should we really expect this in practice? Let's suppose the data were given in randomized order. Show that when the data has been input in random order the height of the tree is O(log(n)) on average (I'm expecting a plot of some sort here).

2: A HashedListSet

So far this semester, to implement Stacks and Queues we've used our two list-like structures this semester (ArrayLists and LinkedLists). Both of these structures (with some mild modifications) can be used well for these structures, but the downside is that searching in them takes linear time. Describe (but no need for explicit implementation) how one could use the ideas of Hashing to implement a LinkedList with an O(1) contains method.

Extensions

The following are some suggested extensions. You should feel free to pursue your own custom extensions to your project that you find interesting. Please clearly identify your extensions in your report. In your code, make sure the baseline simulations run properly. Making a separate sub-folder for major extensions is recommended.

Extension Ideas Particular to BSTs

  1. Write out your tree to a word count file in an order that will ensure it is a balanced tree when you read it in. Demonstrate that it works.
  2. Explore how badly unbalanced your BSTMap gets without doing self-balancing. One thing you'd first have to do is decide on a metric for what it means to be unbalanced. One reasonable thing to do would be to calculate the Balance Factors of every node in the tree, and caculate the number of nodes that have a balance factor greater than 1 or less then -1.
  3. Implement a self-balancing tree data structure (such as an AVL Tree or an Red-Black Tree) and demonstrate that it works properly.

Extension Ideas Particular to HashMaps

  1. Implement your own hash function. How does its performance compare to the built-in string hash function?
  2. Try implementing more than one collision handling method. For example, (1) use whichever (chaining or probing) you haven't implemented yet, or (2) use a BSTMap instead of a linked list at each table entry for chaining. Compare the performance of the methods.
  3. Try implementing different probing methods. For example, (1) Linear, Quadratic and Double Hashing. Compare the performance of the methods.

Extension Ideas Relating to the Project Application

  1. Use an ArrayList or your own LinkedList instead of the BST/HashMap to implement the map and compare processing times.
  2. Use additional files to test the (time and/or space) efficiency of your implementations. You want to cover a much bigger range of values (in a logarithmic sense) so you can get an accurate curve. There are several datasets of wikipedia pages available online - this will work well.

Congrats, you're done with everything except for your project report where you will communicate your findings from all that code! Make sure to write up your report in the project report file in the Google Classrooms assignment. The project report is an important part of the work in this project, so don't forget to do it!