CS 231: Project #8

Title image Fall 2019

Word Trends

The main purpose of this project is analyze trends in 8 years of Reddit posts. The first analysis will be to find the most common words in each year. Using a heap data structure to store the word count information is an efficient way to implement the task. The second analysis will be to look at the trends of specific words of your choice over the time period of the data.


Tasks

  1. Write a FindCommonWords class

    Create a FindCommonWords class that is able to read a word count file and report the the most common words.

    The FindCommonWords class can either (1) use a WordCounter class to read the file into a BSTMap or HashMap and then insert all of the words into a PQHeap, or (2) write a method that reads the word count file and dumps the words straight into a PQHeap. In both cases, the comparator should use the Value field (frequency).

    Test your class using counts_ct.txt. Debug it until it works properly.

    Set up your main function so that it can take in a number N followed by one or more filenames of word count files on the command line. The function should loop over the filenames and for each one print out the N most frequent words along with their frequency (count / total word count). When comparing different years, it is important to report the frequency, rather than the count, because each file has a different number of words. Think about how you can structure the printout to make it easy to create a table (e.g. use CSV formatting).

    Usage: FindCommonWords <N> <WC file 1> <...>
    Reports the N most common words in each provided Word Count file.

    Run your code on the Reddit comments for all of the years 2008 through 2015 and report the 10 most frequent words for each year in a table. Do the most frequent words change much?

  2. Write a FindTrends class

    Create a FindTrends class that determines the frequency of a specific set of words across many different word files. For example, you could use command line arguments to specify the names of multiple word count files and then a set of words to extract from those files. The purpose is to track word frequency across several years of reddit comments.

    For this task you do not need a heap; use the BSTMap or HashMap generated by readWordCount in your WordCounter class to access the count/frequency of the desired words

    For example, you could design your main program to require inputs formatted according to this usage statement:

    USAGE java FindTrends <BaseFilename> <FileNumberBegin> <FileNumberEnd> <Word1> <Word2> ...
     where <BaseFilename> contains the text part of the name of each WordCount file to analyze.
     and <FileNumberBegin> specifies the first file's number suffix
     and <FileNumberEnd> specifies the last number suffix in the range of word files to analyze.
     <Word1> <Word2> ... is the list of words to analyze.

    To generate the graph shown as follow

    Using this command line ordering, the following command should produce something like the output below.

    java -Xmx512m FindTrends ../proj07/counts_reddit_comments_ 2008 2015 snapchat uber tesla microsoft apple yahoo

    You can copy-paste the output to a spreadsheet, add years as column headers, and plot the results as lines. Sometimes you may need to press the "Switch Plot" button to switch the rows and columns.

    If you want to save yourself time, output the data as a comma-separated value [CSV] file, including the headers, which can be read directly by a spreadsheet program.

    Call FindTrends with a list of approximately 6-10 words. Choose a theme for the words that you think may trend over 8 years, keeping in mind that the comments are all collected during the month of May. It is a good idea to use words that are not particularly common, such as proper names. The following are some example lists you could use:

    1. sony portal ipad syntax facebook friend,
    2. phelps usain raisman arod peyton,
    3. clinton sanders rubio trump obama romney cruz palin.

    While you are welcome to use these lists, you are encouraged to develop your own.

  3. Generate a graph of the word frequency trends

    Generate a line graph with your results and include it in your report. Your report should also include an analysis of the output. Are these trends expected? unexpected? What events at the time could explain the trends? Think about these questions when you make your lists.


Extensions

The following are some suggested extensions. You should feel free to pursue your own custom extensions to your project that you find interesting. Please clearly identify your extensions in your report. In your code, make sure the baseline simulations run properly. Making a separate sub-folder for major extensions is recommended.

  1. Use more than one list of interesting words and report the trends, including an analysis of the trends and what might explain them.
  2. An alternative method of identifying the top N words is to read them into an ArrayList and sort them. Do a time comparison of this method with using the heap.
  3. You can build a node-based or an array-based heap. Whichever one you choose, implement the other and do a time comparison (try to make it as fair as possible). Which one is faster? Which one is more memory efficient?
  4. Write a balanced tree implementation (e.g. AVL Trees or Black-Red trees) and compare it with your basic BSTMap class and the Hashmap.
  5. For any assignment, a good extension will be to implement a Java class that you have't implemented in the past projects and demonstrate that it has the same functionality as the Java class.

Report

Handin

Make your report for the project a wiki page in your personal space. If you have questions about making a wiki page, stop by my office or ask in lab.

Once you have written your report, give the page the label:

cs231f19project8

You can give any wiki page a label using the label field at the bottom of the page. The label is different from the title.

Do not put code on your writeup page or anywhere it can be publicly accessed. To hand in code, put it in your folder on the Courses fileserver. Create a directory for each project inside the private folder inside your username folder.

Please do not submit the reddit text files along with your code.