CS 231: Project #6

Title image Fall 2019

Word Frequency

The main purpose of this project is to provide you the opportunity to use your BSTMap to determine the frequencies of all the words in a given text document. The ability to identify word frequencies in a large corpus has been the basis for a number of different research projects in the digital social sciences. In particular, looking at word usage over time in books and articles has revealed how culture and language change.

This project is the first of a series of 3 projects that will allow you to analyze the topics in Reddit comments over a period of 8 years. The text files are large enough that using a list or an inefficient implementation will take a prohibitively long time to do the analysis.

Note: please do not hand in the Reddit comment files with your code. If everyone hands them in with their code, it takes up a lot of unnecessary space on the server.

Tasks

There is only one additional class you need to write this week - WordCounter. It is the class that creates and manages a BSTMap of word-count pairs KeyValuePair<String, Integer>. The WordCounter class needs methods for doing three things.

First, the WordCounter needs to be able to Build a BSTMap from a text document. The method should read in the text and separate it into words. For each word, it should check to see if the word is already a key in the BSTMap. If it isn't, then it should add the Key-Value pair to the dictionary with the word as the key and 1 as the value. If word is an existing key, then it should increment the value associated with that word.

Second, the WordCounter needs to be able to write out file containing the key-value pairs of the Map. The file should have the total number of words on the first line, and each word and its count on the subsequent lines (one key-value pair per line). We will refer to a file formatted like this as a word count file.

Finally, the WordCounter needs to be able to Read a word count file, recreating the BSTMap so it can be used for further analysis. This will be important becaus the original text files may be much larger than the word count files.

WordCounter Class

  1. Setup

    The WordCounter class will use the following Java packages:

    import java.io.BufferedReader;
    import java.io.FileNotFoundException;
    import java.io.FileReader;
    import java.io.IOException;
  2. Begin the WordCounter class

    The class needs a field to hold the BSTMap with types <String, Integer>. It also needs a field to hold the total word count.

  3. Create the WordCounter methods
    • public WordCounter() - constructor that makes an empty BSTMap and sets the total word count to zero.
    • public void analyze(String filename) - generates the word counts from a file of words. This method should use the BufferedReader to read in the file one line at a time. It should also use the String's split method to separate each line into words. Your code will be similar to the Board class file reader in the Sudoku project, including the need to use a try-catch structure to enclose your code. See lab 3 for the code template or look at your own code.

      Given a String line read from a file, you can use the following code template to split the line into words and then process each word.

          // split line into words. The regular expression can be interpreted
          // as split on anything that is not (^) (a-z or A-Z or 0-9 or ').
          String[] words = line.split("[^a-zA-Z0-9']");
          for (int i = 0; i < words.length; i++) {
              String word = words[i].trim().toLowerCase();
              // Might want to check for a word of length 0 and not process it
              // Write code to update the map
          }

      Each valid word should increase the total word count, whether or not it already exists in the Map.

    • public int getTotalWordCount() - returns the total word count (note that this is not the number of unique words, which would be the size of the BSTMap - it is the total number of words in the document that was originally read in).
    • public int getUniqueWordCount() - returns the number of unique words, which should be the size of the BSTMap.
    • public int getCount( String word ) - returns the frequency value associated with this word.
    • public double getFrequency( String word ) - returns the value associated with this word divided by the total word count. Use a cast to ensure this is a floating point calculation.
  4. Write a main method to test your code

    In your main method, create a WordCounter object and then have it analyze the file counttest.txt. It should generate the tree shown below. (You might want to enable your main function to get the filename of the document to analyze as a command-line parameter.)

    You probably want to write a toString function for your BSTMap that gives you an idea of the structure of your tree. The following is one possible kind of output created using a pre-order traversal.

    root    it:4
    left       best:1
    left          age:2
    right         foolishness:1
    right      was:4
    left          the:4
    left             of:4
    right            times:2
    right         worst:1
    left             wisdom:1
    						
  5. Define a method to write a BSTMap to a file

    Create a method that writes the contents of the BSTMap to a word count file.

    public void writeWordCountFile( String filename )

    On the first line of the file, the method should write the string "totalWordCount : " and the value of the total word count. For each subsequent line, the method should write each key value pair separated by spaces. For example, the word count file for counttest.txt is the following.

    totalWordCount : 24
    it 4
    best 1
    age 2
    foolishness 1
    was 4
    the 4
    of 4
    times 2
    worst 1
    wisdom 1

    Write the word count file using a pre-order traversal of the tree. The entrySet method should provide the KeyValuePairs in that order.

    Use the FileWriter class to write the file. It has a write method that writes a string to the active file. Be sure to close the file when you are done with it.

  6. Define a method to read the contents of a BSTMap file

    Write a method that reads the contents of a word count file and reconstructs the fields of the WordCount object, including the BSTMap.

    public void readWordCountFile( String filename )

    Read in the file, using the FileReader and BufferedReader classes. Use the split method to separate the values on a line. Make sure you read in the first line separately, since it contains the total word count. To reconstruct the tree, call the BSTMap's put method every time you parse a new key-value pair.

  7. Test your ability to reconstruct the tree. You may want to create a separate WCTester class to do this to avoid having too many tasks for the WordCounter main class.

    • Use analyze to load counttest.txt.
    • Use writeWordCount to write the results to counts_ct.txt.
    • Use readWordCount to load counts_ct.txt.
    • Use writeWordCount to write counts_ct_v2.txt.
    • On the command line, using the Unix utility diff to determine if counts_ct.txt and counts_ct_v2.txt are identical. If they are, then diff will output nothing. If they aren't, it will identify the differences.
    • diff counts_ct.txt counts_ct_v2.txt

      Debug your code until diff doesn't tell you about any differences between your files. Show the results of this test in your wiki report.

  8. Add a command-line parameter to the WordCounter main method

    Update the WordCounter main method be able to take the input and/or output filenames from the command line. For each input filename, your program should generate the word count files. You may want to have your program be able to take in multiple files to process. Use the function System.currentTimeMillis() to get the time before and after you analyze a file and print out the amount of time each file takes to process. Test this with the counttest.txt.

    On the Courses server in the Course Materials folder, are 8 files (reddit_comments_2008.txt through reddit_comments_2015.txt) containing a random selection of user comments from the month of May during each of the years from 2008 to 2015. Process these files and generate a word count file for each one. Each file is approximately 1 GB, so it will take some time both to download the files and to run the code. Keep track of the time it takes to process each file.

    The files are large, and Java may not be set up to allow you to use that much memory. You can increase the amount of memory allocated to java with the -Xmx512m flag.

    Use a straight-forward file-naming scheme when generating the word counts files. In the next two projects you will be looking at word frequency trends across multiple years, so you want the year from which the comments were extracted to be easy to parse from the filename.

  9. Execute an analysis of word frequency

    Generate three graphs (using your favorite software, e.g. a spreadsheet). The first two graphs should show the total word count and unique word count for each year. The third graph should show total word count (x-axis) and processing time (y-axis). What relationship do you expect to exist on the last graph? Does the graph fit your expectations?


Extensions

The following are some suggested extensions. You should feel free to pursue your own custom extensions to your project that you find interesting. Please clearly identify your extensions in your report. In your code, make sure the baseline simulations run properly. Making a separate sub-folder for major extensions is recommended.

  1. Develop a file with common words in it and have your WordCounter ignore those words. In your report, discuss how you developed and use the list. Does it modify the analysis time?
  2. Write out your tree to a word count file in an order that will ensure it is a complete tree (balanced) when you read it in. (Hint: you could sort your pairs by keys, then traverse the list as if you were traversing a complete binary tree.) Then demonstrate that it works.
  3. For any assignment, a good extension will be to implement a Java class that you have't implemented in the past projects and demonstrate that it has the same functionality as the Java class.
  4. Use an ArrayList or your own LinkedList instead of a Binary Tree to implement the map and compare processing times.
  5. Write a toString method that prints the tree level by level from top to bottom. You may need a Queue.
  6. Write a remove method for your BSTMap and demonstrate that it works properly.
  7. Have your WordCounter main function finish by printing out the 20 (or N) most frequent words in the document.
  8. Implement a balanced tree data structure and demonstrate that it works properly.

Report

Your report should have a simple format.

Handin

Make your report for the project a wiki page in your personal space. If you have questions about making a wiki page, stop by my office or ask in lab.

Once you have written your report, give the page the label:

cs231s20project6

You can give any wiki page a label using the label field at the bottom of the page. The label is different from the title.

Do not put code on your writeup page or anywhere it can be publicly accessed. To hand in code, put it in your folder on the Courses fileserver. Create a directory for each project inside the private folder inside your username folder.

Please do not submit the reddit text files along with your code.