CS 231: Data Structures and Algorithms (Lab Page)

Title image Project 8
Fall 2016

Trends in Word Usage

The main purpose of this project is to give you an opportunity to use your priority queue to determine which words are the most frequently used in reddit post. A second purpose is to give you the opportunity to find trends in the usage of particular words over 8 years of Reddit posts.


Tasks

  1. FindCommonWords - create a FindCommonWords class that allows you to read in a word count file and report the the most common words. There are two ways to design this class.
    1. You could use a WordCounter to read the word count file and store the word-value pairs so that you can look them up by key. Then use your PriorityQueue (PQHeap<KeyValuePair<String,Integer>>) to store the pairs and retrieve them in order from highest count to lowest count (i.e. the Comparator will need to operate on KeyValuePairs, but use the Value to do the comparison).
    2. Write code in FindCommonWords that reads in a word count file and puts the key-value pairs directly into a PQHeap.

    Test your class using counts_short.txt. Debug it until it works perfectly.

    Run your code on at least one set of Reddit comments and report the 10 most frequent words.

  2. FindTrends - create a FindTrends class that allows you to determine the frequency of a specific set of words in a set of words. For example, you could use command line arguments to specify the names of multiple word count files and then a set of words to search for in those files.

    Report the frequency for each word in each file. Note that it is important to report the frequency, rather than the count, because each file has a different number of words.

    For example, you could design your main program to require inputs formatted according to this usage statement:

    USAGE java FindTrends <WordCountBaseFilename> <WordCountNumberBegin> <WordCountNumberEnd> <interestingWord1> <interestingWord2> ...
     where <WordCountBaseFilename> contains the text part of the name of 
     each WordCount file you want to analyze.
     and <WordCountNumberBegin> refers to the first number 
     and <WordCountNumberEnd> refers to the last number  in the range of word files you want to analyze.
     <interestingWord1> <interestingWord2> ... is the list of words you want to analyze.
    
    To generate the graph shown on the left, I used this command and output:

    The text is small, so here is the command again, in larger font:

    java -Xmx512m  FindTrends ../proj07/counts_reddit_comments_ 2008 2015 snapchat uber tesla microsoft apple yahoo
    

    I then copy-pasted the output to an Excel spreadsheet, added years as column headers, and plotted the results as lines. I tried this for several files and found that some times I needed to press the "Switch Plot" button to switch which were rows and which were columns.

    Call FindTrends with one or more lists of approximately 6-10 words. Choose a theme for the words that you think may trend over 8 years, keeping in mind that the comments are all collected during the month of May. It is a good idea to use words that are not particularly common, such as proper names. Some lists that you may use include

    • sony portal ipad syntax facebook friend
    • phelps usain raisman arod peyton
    • clinton sanders rubio trump obama romney cruz palin

    but you are encouraged to develop your own list.

    Generate a line graph with your results and include it in your write-up. Also, in your write-up, you should include an analysis of the output. Are these trends expected? unexpected? What events at the time could explain the trends?


Extensions

Each assignment will have a set of suggested extensions. The required tasks constitute about 85% of the assignment, and if you do only the required tasks and do them well you will earn a B+. To earn a higher grade, you need to undertake at least one extension. The difficulty and quality of the extension or extensions will determine your final grade for the assignment. One significant extension, or 2-3 smaller ones, done well, is typical.

  1. Use more than one list of interesting words and report the trends, including an excellent analysis.
  2. Implement your map using a hashtable instead of a BST.
  3. For any assignment, a good extension will be to implement a Java class yourself and demonstrate that it has the same functionality as the Java class. For example, you could implement your own ArrayList class for this assignment.
  4. For any assignment, a good extension will be to annotate your code to indicate all places where memory is "lost" (in other words, each place where the last remaining reference to an object is either destroyed or is given a new value). If you do this extension, please indicate so in your write-up.

Handin

Make your writeup for the project a wiki page in your personal space. If you have questions about making a wiki page, stop by my office or ask in lab.

Your writeup should have a simple format.

Once you have written up your assignment, give the page the label:

cs231f16project8

You can give any wiki page a label using the label field at the bottom of the page. The label is different from the title.

Do not put code on your writeup page or anywhere it can be publicly accessed. To hand in code, put it in your folder on the Courses fileserver. Create a directory for each project inside the private folder inside your username folder.