CS 231: Project #6

Title image Spring 2018

Word Frequency

The main purpose of this project is to provide you the opportunity to use your BSTMap to determine the frequencies of all the words in a given text document. The ability to identify word frequencies in a large corpus has been the basis for a number of different research projects in the digital social sciences. In particular, looking at word usage over time in books and articles has revealed how culture and language change.

This project is the first of a series of 3 projects that will allow you to analyze the topics in Reddit comments over a period of 8 years. The text files are large enough that using a list or an inefficient implementation will take a prohibitively long time to do the analysis.

Tasks

There is only one additional class you need to write this week - WordCounter. It is the class that creates and manages a BSTMap of word-count pairs KeyValuePair<String, Integer>. The WordCounter class needs methods for doing the following tasks.

WordCounter

  1. You will need the following Java packages:
    import java.io.BufferedReader;
    import java.io.FileNotFoundException;
    import java.io.FileReader;
    import java.io.IOException;
    										
  2. Create the WordCounter class. The class needs a field to hold the BSTMap with types <String, Integer>. It also needs a field to hold the total word count.
  3. Create the following methods.
    • public WordCounter() - constructor that makes an empty map and sets the total word count to zero.
    • public boolean analyze(String filename) - generates the word counts from a file of words. This method should use the BufferedReader to read in the file on line at a time. It should also use the String's split method to separate each line into words. Your code will be similar to the Board class file reader in the Sudoku project, including the need to use a try-catch structure to enclose your code. See lab 3 for the code template or look at your own code.

      Given a String line read from a file, you can use the following code template to split the line into words and then process each word.

          // split line into words. The regular expression can be interpreted
          // as split on anything that is not (^) (a-z or A-Z or 0-9 or ').
          String[] words = line.split("[^a-zA-Z0-9']");
          for (int i = 0; i < words.length; i++) {
              String word = words[i].trim().toLowerCase();
              // Might want to check for a word of length 0
              // Write code to update the map
          }

      Each valid word should increase the total word count, whether or not it already exists in the Map.

    • public int getTotalWordCount() - returns the total word count (note that this is not the number of unique words, which would be the size of the BSTMap - it is the total number of words in the document that was originally read in).
    • public int getCount( String word ) - returns the value associated with this word.
    • public double getFrequency( String word ) - returns the value associated with this word divided by the total word count. Use a cast to ensure this is a floating point calculation.
  4. Write a main method to test your code. Create a WordCounter object and then have it analyze the file counttest.txt. It should generate the tree shown below.

    You probably want to write a toString function for your BSTMap that gives you an idea of the structure of your tree. The following is one possible kind of output created using a pre-order traversal.

    root    it:4
    left       best:1
    left          age:2
    right         foolishness:1
    right      was:4
    left          the:4
    left             of:4
    right            times:2
    right         worst:1
    left             wisdom:1
    						
  5. Write a method that writes the contents of the BSTMap to aword count file.

    public void writeWordCountFile( String filename )

    On the first line of the file, the method should write the string "totalWordCount : " and the value of the total word count. For each subsequent line, the method should write each key value pair separated by spaces. For example, the word count file for counttest.txt is the following.

    totalWordCount : 24
    it 4
    best 1
    age 2
    foolishness 1
    was 4
    the 4
    of 4
    times 2
    worst 1
    wisdom 1

    Write the word count file using a pre-order traversal of the tree. Your entrySet method should provide the KeyValuePairs in that order.

    Use the FileWriter class to write the file. It has a write method that you can use to write a string to the file. Be sure to close the file when you are done with it.

  6. Write a method that reads the contents of a word count file and reconstructs the fields of the WordCount object.

    public void readWordCountFile( String filename )

    Read in the file, using the FileReader and BufferedReader classes. Use the split method to separate the values on a line. Make sure you read in the first line separately, since it contains the total word count. To reconstruct the tree, call the BSTMap's put method every time you parse a new key-value pair.

  7. Test your ability to reconstruct the tree. You may want to create a separate WCTester class to do this to avoid having too many tasks for the WordCounter main class.

    • Use analyze to load counttest.txt.
    • Use writeWordCount to write the results to counts_ct.txt.
    • Use readWordCount to load counts_ct.txt.
    • Use writeWordCount to write counts_ct_v2.txt.
    • On the command line, using the Unix utility diff to determine if counts_ct.txt and counts_ct_v2.txt are identical. If they are, then diff will output nothing. If they aren't, it will identify the differences.
    • diff counts_short.txt redo_counts_short.txt

      Debug your code until diff doesn't tell you about any differences between your files. Show the results of this test in your wiki report.

  8. Update the WordCounter main method be able to take the input and/or output filenames from the command line. For each input filename, your program should generate the word count files. You may want to have your program be able to take in multiple files to process. Use the function System.currentTimeMillis() to get the time before and after you analyze a file and print out the amount of time each file takes to process. Test this with the counttest.txt.

    On the Courses server in the Course Materials folder, are 8 files (reddit_comments_2008.txt through reddit_comments_2015.txt) containing a random selection of user comments from the month of May during each of the years from 2008 to 2015. Process these files and generate a word count file for each one. Each file is approximately 1 GB, so it will take some time both to download the files and to run the code. Keep track of the time it takes to process each file.

    The files are large, and Java may not be set up to allow you to use that much memory. You can increase the amount of memory allocated to java with the-Xmx512m flag.

    Use a straight-forward file-naming scheme when generating the word counts files. In the next two projects you will be looking at word frequency trends across multiple years, so you want the year from which the comments were extracted to be easy to parse.

  9. Generate three graphs (using your favorite software, e.g. a spreadsheet). The first two graphs should show the total word count and unique word count for each year. The third graph should show total word count (x-axis) and processing time (y-axis). What relationship do you expect to exist on the last graph? Does the graph fit your expectations?

Extensions

  1. Develop a file with common words in it and have your WordCounter ignore those words. In your report, discuss how you developed and use the list.
  2. Write out your tree to a word count file in an order that will ensure it is a complete tree (balanced) when you read it in. (Hint: you could sort your pairs by keys, then traverse the list as if you were traversing a complete binary tree.) Then demonstrate that it works.
  3. For any assignment, a good extension will be to implement a Java class that you have't implemented in the past projects and demonstrate that it has the same functionality as the Java class.
  4. Use an ArrayList or your own LinkedList instead of a Binary Tree to implement the map and compare processing times.
  5. Write a toString method that prints the tree level by level from top to bottom.
  6. Write a remove method for your BSTMap and demonstrate that it works properly.
  7. Have your WordCounter main function finish by printing out the 20 (or N) most frequent words in the document.

Report

Your report should have a simple format.

Handin

Make your report for the project a wiki page in your personal space. If you have questions about making a wiki page, stop by my office or ask in lab.

Once you have written your report, give the page the label:

cs231s18project6

You can give any wiki page a label using the label field at the bottom of the page. The label is different from the title.

Do not put code on your writeup page or anywhere it can be publicly accessed. To hand in code, put it in your folder on the Courses fileserver. Create a directory for each project inside the private folder inside your username folder.