Due: Monday, April 17, 2017, 11:59 pm

Mapping Words to their Frequencies

The main purpose of this project is to provide you the opportunity to use your BSTMap to determine the frequencies of all the words in a given text document.

This project is the first of a series of 2 projects that will allow you to analyze the topics in Reddit comments over a period of 8 years.


There is only one class you need to write this week - WordCounter. It is the class that manages a BSTMap of word-count pairs KeyValuePair<String><Integer>. It should be able to

  1. WordCounter: create the WordCounter class. We will begin with its fields and the first few methods. It should store a map BSTMap<String,Integer> and the total word count (int). Begin by implementing the following methods:
    • public WordCounter() constructor that makes an empty map and sets the total word count to zero.
    • public void loadFromOriginalWordsFile( String filename ) generates the word counts tree from a file of words. This method should use the BufferedReader to read in the file on line at a time. It should also use the String's split method to separate each line into words.

      You will need to import IO packages:

      import java.io.BufferedReader;
      import java.io.FileNotFoundException;
      import java.io.FileReader;
      import java.io.IOException;

      To split the line into words, you should use the split method. Before adding words to the map, you should make them lowercase, so that your word-counting is not case sensitive:

          // split line into words. Any character other than a letter (a-z and A-Z) or a single quote
          // is considered a non-word character and is used to split words.
          // You may want to include numbers in the list of characters that are consider 
          // parts of words.
          String[] parse = line.split("[^a-zA-Z0-9']");// Regular expressions!
          for (int i = 0; i < parse.length; i++) {
              String word = parse[i].trim().toLowerCase();
              // Write code to update the map

      As you encounter words, you should increment the total word count field of the WordCounter. We will be using this to determine the word frequency (the-number-of-times-we-have-seen-a-particular-word/the-total-numb er-of-nonunique-words-we-have-seen).

    • public int getTotalWordCount() returns the total word count (note that this is not the number of unique words, which would be the size of the BSTMap - it is the total number of words in the document that was originally read in).
    • public int getCount( String word ) returns the value associated with this word.
    • public double getFrequency( String word ) returns the value associated with this word divided by the total word count. Don't forget to cast the ints to doubles before you do the division. :)

    Test the code using short.txt. It should generate the tree shown as follow.

    We strongly suggest you write a method that can print out the tree information similar as follow, so that you can debug your code and so that you can be sure it is storing your values the way you think it is.

    key = cs231, value = 1
    left      key = a, value = 1
    right        key = class, value = 3
    right            key = classy, value = 1
    right    key = is, value = 1
    left          key = happy, value = 1
    right        key = wonderful, value = 2
    left              key = so, value = 1
  2. Next, write a method that will write the contents of the fields to a word count file.

    public void writeWordCountFile( String filename )

    On the first line of the file, the method should write the string "total_word_count : " and the value of the total word count. For each subsequent line, the method should write each key value pair (separated by spaces). For example, the word count file for short.txt is

    total_word_count : 11
    cs231 1
    a 1
    class 3
    classy 1
    is 1
    happy 1
    wonderful 2
    so 1

    Notice that these pairs are in a particular order - in a pre-order traversal of the tree. Be sure to use your getPairs method. Now is a good time to double-check the order of your pairs.

    Note that you can use the FileWriter and BufferedWriter to write to the file.

  3. Write a method that will read the contents of a word count file and reconstruct the fields of the WordCount object.

    public void readWordCountFile( String filename )

    Read in the file, using the BufferedReader and the split method like you did above. To reconstruct the tree, simply call the BSTMap's put method every time you parse a new key-value pair.

  4. Test your ability to reconstruct the tree by
    • Use loadFromOriginalWordsFile to load short.txt.
    • Use writeWordCountFile to write counts_short.txt.
    • Use readWordCountFile to load counts_short.txt.
    • Use writeWordCountFile to write redo_counts_short.txt.
    • On the command line, using the Unix utility diff to determine if counts_short.txt and redo_counts_short.txt are identical. If they are, then diff will output nothing. If they aren't, it will identify the differences.
    • diff counts_short.txt redo_counts_short.txt

      Debug your code until diff doesn't tell you about any differences between your files.

  5. Now, make the main method be able to take the input and/or output filenames from the command line. Then, enerate word count files for files containing a large sample of reddit comments. On the Courses server in the Course Materials folder, are 8 files (reddit_comments_2008.txt through reddit_comments_2015.txt) containing a random selection of user comments from the month of May during each of the years from 2008 to 2015. Each file is approximately 1 GB, so it will take some time both to download the files and to run the code.

    Also, java may not be set up to allow you to use that much memory. You can increase the amount of memory allocated to java with the -Xmx512m flag. Also, I strongly suggest you use a straight-forward file-naming scheme. For example, here is the command I used to generate the counts for 2010 (Notice I prepended the word "counts" rather than appended it - that was because I knew that the next project would be easier to implement if I left the year as the last part of the filename.)

    java -Xmx512m WordCounter reddit_comments_2010.txt counts_reddit_comments_2010.txt

  6. Report on your outuput. Next week, we will be looking at the frequencies of particular words and will be developing lists of the most frequent words. This week, we will simply look at our output. Generate a graph (using your favorite software, e.g. a spreadsheet) showing how the total word count changed from sample to sample. Then generate another showing how the number of unique words changed from sample to sample. Put these graphs in your write-up.


  1. Develop a file with common words in it and have your WordCounter ignore those words. In your write-up, you should discuss how you developed and use the list.
  2. Write out your tree to a word count file in an order that will ensure it is a complete tree (balanced) when you read it in. (Hint: you could sort your pairs by keys, then traverse the list as if you were traversing a complete binary tree.) Then demonstrate that it works.
  3. For any assignment, a good extension will be to implement a Java class that you have't implemented in the past projects and demonstrate that it has the same functionality as the Java class.


Your writeup should have a simple format.


Make your writeup for the project a wiki page in your personal space. If you have questions about making a wiki page, stop by my office or ask in lab.

Once you have written up your assignment, give the page the label:


You can give any wiki page a label using the label field at the bottom of the page. The label is different from the title.

Do not put code on your writeup page or anywhere it can be publicly accessed. To hand in code, put it in your folder on the Courses fileserver. Create a directory for each project inside the private folder inside your username folder.

© 2015 Ying Li. Page last modified: .