Objectives

The main purpose of this project is to give you an opportunity to use your priority queue to determine which words are the most frequently used in reddit post. A second purpose is to give you the opportunity to find trends in the usage of particular words over 8 years of Reddit posts.

Tasks

  1. FindCommonWords: create a FindCommonWords class that allows you to read in a word count file and report the the most common words. There are two ways to design this class.
    1. You could use a WordCounter to read the word count file and store the word-value pairs so that you can look them up by key. Then use your PriorityQueue PQHeap<KeyValuePair<String,Integer>> to store the pairs and retrieve them in order from highest count to lowest count (i.e. the Comparator will need to operate on KeyValuePairs, but use the Value to do the comparison).
    2. Write code in FindCommonWords that reads in a word count file and puts the key-value pairs directly into a PQHeap.

    Test your class using counts_short.txt. Debug it until it works perfectly.

    Run your code on at least one set of Reddit comments and report the 10 most frequent words.

  2. FindTrends: create a FindTrends class that allows you to determine the frequency of a specific set of words in a set of words. For example, you could use command line arguments to specify the names of multiple word count files and then a set of words to search for in those files.

    Report the frequency for each word in each file. Note that it is important to report the frequency, rather than the count, because each file has a different number of words.

    For example, you could design your main program to require inputs formatted according to this usage statement:

    USAGE java FindTrends <WordCountBaseFilename> <WordCountNumberBegin> <WordCountNumberEnd> <interestingWord1> <interestingWord2> ...
     where <WordCountBaseFilename> contains the text part of the name of 
     each WordCount file you want to analyze.
     and <WordCountNumberBegin> refers to the first number 
     and <WordCountNumberEnd> refers to the last number  in the range of word files you want to analyze.
     <interestingWord1> <interestingWord2> ... is the list of words you want to analyze.
    

    To generate the graph shown as follow

    You can use the following command and get the output similar to that on the screenshot.

    java -Xmx512m FindTrends ../proj07/counts_reddit_comments_ 2008 2015 snapchat uber tesla microsoft apple yahoo

    Then, copy-pasted the output to an Excel spreadsheet, added years as column headers, and plotted the results as lines. Sometimes you may need to press the "Switch Plot" button to switch the rows and columns.

    Call FindTrends with one or more lists of approximately 6-10 words. Choose a theme for the words that you think may trend over 8 years, keeping in mind that the comments are all collected during the month of May. It is a good idea to use words that are not particularly common, such as proper names. Here are some lists that you may use include:

    • sony portal ipad syntax facebook friend,
    • phelps usain raisman arod peyton,
    • clinton sanders rubio trump obama romney cruz palin.

    But you are encouraged to develop your own list.

  3. Generate a line graph with your results and include it in your write-up. Also, in your write-up, you should include an analysis of the output. Are these trends expected? unexpected? What events at the time could explain the trends?

Extensions

  1. Use more than one list of interesting words and report the trends, including an excellent analysis.
  2. For any assignment, a good extension will be to implement a Java class that you have't implemented in the past projects and demonstrate that it has the same functionality as the Java class.

Writeup

Handin

Make your writeup for the project a wiki page in your personal space. If you have questions about making a wiki page, stop by my office or ask in lab.

Once you have written up your assignment, give the page the label:

cs231f17project8

You can give any wiki page a label using the label field at the bottom of the page. The label is different from the title.

Do not put code on your writeup page or anywhere it can be publicly accessed. To hand in code, put it in your folder on the Courses fileserver. Create a directory for each project inside the private folder inside your username folder.


When you are done with the lab exercises, you may start on the rest of the project.


© 2017 Caitrin Eaton.