CS 231: Data Structures and Algorithms (Lab Page)

Title image Project 7
Fall 2016

Mapping Words to their Frequencies

The main purpose of this project is to provide you the opportunity to use your BSTMap to determine the frequencies of all the words in a given text document.

This project is the first of a series of 3 projects that will allow you to analyze the topics in Reddit comments over a period of 8 years.


Tasks

There is only one class you need to write this week - WordCounter. It is the class that manages a BSTMap of word-count pairs (KeyValuePair<String><Integer>). It should be able to
  1. WordCounter - create the WordCounter class. We will begin with its fields and the first few methods. It should store a map (BSTMap<String,Integer>) and the total word count (int). Begin by implementing the following methods:
    • public WordCounter() - constructor that makes an empty map and sets the total word count to zero.
    • public void loadFromOriginalWordsFile( String filename ) - generate the word counts tree from a file of words. This method should use the BufferedReader to read in the file on line at a time. It should also use the String's split method to separate each line into words.

      You will need to import IO packages:

      import java.io.BufferedReader;
      import java.io.FileNotFoundException;
      import java.io.FileReader;
      import java.io.IOException;
      

      To split the line into words, you should use the split method. Before adding words to the map, you should make them lowercase, so that your word-counting is not case sensitive:

              // split line into words. Any character other than a letter (a-z and A-Z) or a single quote
              // is considered a non-word character and is used to split words.
              // You may want to include numbers in the list of characters that are consider 
              // parts of words.
              String[] parse = line.split("[^a-zA-Z']");// Regular expressions!
              for (int i = 0; i < parse.length; i++) {
                  String word = parse[i].trim().toLowerCase();
                  // Write code to update the map
              }
          

      As you encounter words, you should increment the total word count field of the WordCounter. We will be using this to determine the word frequency (the-number-of-times-we-have-seen-a-particular-word/the-total-number-of-nonunique-words-we-have-seen).

    • public int getTotalWordCount() - return the total word count (note that this is not the number of unique words, which would be the size of the BSTMap - it is the total number of words in the document that was originally read in).
    • public int getCount( String word ) - return the value associated with this word.
    • public double getFrequency( String word ) - return the value associated with this word divided by the total word count. Don't forget to cast the ints to doubles before you do the division. :)

    Test the code using short.txt. It should generate the tree shown on the left panel. Stephanie wrote code that would print out the structure of the tree. Its output is

    key = cs, value = 1
    left      key = a, value = 1
    right        key = class, value = 3
    right            key = classy, value = 1
    right    key = is, value = 1
    left          key = happy, value = 1
    right        key = wonderful, value = 2
    left              key = so, value = 1
    

    We strongly suggest you write a similar method so that you can debug your code and so that you can be sure it is storing your values the way you think it is.

  2. Next, write a method that will write the contents of the fields to a word count file.

    public void writeWordCountFile( String filename )

    On the first line of the file, the method should write the string "total_word_count : " and the value of the total word count. For each subsequent line, the method should write each key value pair (separated by spaces). For example, the word count file for short.txt is

    total_word_count : 11
    cs 1
    a 1
    class 3
    classy 1
    is 1
    happy 1
    wonderful 2
    so 1
        

    Notice that these pairs are in a particular order - in a pre-order traversal of the tree. Be sure to use your getPairs method. Now is a good time to double-check the order of your pairs.

    Note that you can use the FileWriter and BufferedWriter to write to the file.

  3. Write a method that will read the contents of a word count file and reconstruct the fields of the WordCount object.

    public void readWordCountFile( String filename )

    Read in the file, using the BufferedReader and the split method like you did above. To reconstruct the tree, simply call the BSTMap's put method every time you parse a new key-value pair.

  4. Test your ability to reconstruct the tree by
    • Use loadFromOriginalWordsFile to load short.txt
    • Use writeWordCountFile to write counts_short.txt
    • Use readWordCountFile to load counts_short.txt
    • Use writeWordCountFile to write redo_counts_short.txt
    • On the command line, using the Unix utility diff to determine if counts_short.txt and redo_counts_short.txt are identical. If they are, then diff will output nothing. If they aren't, it will identify the differences.
      diff counts_short.txt redo_counts_short.txt

      Debug your code until diff doesn't tell you about any differences between your files.

  5. Now, generate word count files for files containing a large sample of reddit comments. On the Courses server in the Course Materials folder, are 8 files (reddit_comments_2008.txt through reddit_comments_2015.txt) containing a random selection of user comments from the month of May during each of the years from 2008 to 2015. Each file is approximately 1 GB, so it will take some time both to download the files and to run the code.

    Also, java may not be set up to allow you to use that much memory. You can increase the amount of memory allocated to java with the -Xmx512m flag. Also, I strongly suggest you use a straight-forward file-naming scheme. For example, here is the command I used to generate the counts for 2010 (Notice I prepended the word "counts" rather than appended it - that was because I knew that the next project would be easier to implement if I left the year as the last part of the filename.)

    java -Xmx512m WordCounter reddit_comments_2010.txt counts_reddit_comments_2010.txt
    
  6. Report on your outuput. Next week, we will be looking at the frequencies of particular words and will be developing lists of the most frequent words. This week, we will simply look at our output. Generate a graph (using your favorite software, e.g. a spreadsheet) showing how the total word count changed from sample to sample. Then generate another showing how the number of unique words changed from sample to sample. Put these graphs in your write-up.

Extensions

Each assignment will have a set of suggested extensions. The required tasks constitute about 85% of the assignment, and if you do only the required tasks and do them well you will earn a B+. To earn a higher grade, you need to undertake at least one extension. The difficulty and quality of the extension or extensions will determine your final grade for the assignment. One significant extension, or 2-3 smaller ones, done well, is typical.

  1. Develop a file with common words in it and have your WordCounter ignore those words. In your write-up, you should discuss how you developed and use the list.
  2. Write out your tree to a word count file in an order that will ensure it is a complete tree (balanced) when you read it in. (Hint: you could sort your pairs by keys, then traverse the list as if you were traversing a complete binary tree.) Then demonstrate that it works.
  3. For any assignment, a good extension will be to implement a Java class yourself and demonstrate that it has the same functionality as the Java class. For example, you could implement your own ArrayList class for this assignment.
  4. For any assignment, a good extension will be to annotate your code to indicate all places where memory is "lost" (in other words, each place where the last remaining reference to an object is either destroyed or is given a new value). If you do this extension, please indicate so in your write-up.

Handin

Make your writeup for the project a wiki page in your personal space. If you have questions about making a wiki page, stop by my office or ask in lab.

Your writeup should have a simple format.

Once you have written up your assignment, give the page the label:

cs231f16project7

You can give any wiki page a label using the label field at the bottom of the page. The label is different from the title.

Do not put code on your writeup page or anywhere it can be publicly accessed. To hand in code, put it in your folder on the Courses fileserver. Create a directory for each project inside the private folder inside your username folder.