Title image Spring 2018

Single and Multiple Linear Regression

Due 23 March 2018 (before you leave for spring break)

The goal of this project is to integrate simple linear regression into your project and then implement a multiple linear regression function as a function in your analysis file. Incorporating multiple linear regression into your full GUI is the obvious extension.


Tasks

  1. Implement an updateFits method, similar to updateAxes. The updateFits function should enable the linear fit to move along with the data. Make sure updateFits is called wherever updateAxes and updatePoints is called in the Display class.
  2. Test your implementation. Make sure everything works cleanly if you run a second linear regression, open a new file, go back and forth between plotting data and linear regressions, and translate/scale/rotate the screen. Make sure cancelling the linear regression dialog actually cancels the process and does not change the existing visualization.

    The first required result is a plot of your regression line on the data-simple.csv data.

  3. In your analysis class, create a new function linear_regression that takes in the data set, a list of headers for the independent variables, and a single header (not in a list) for the dependent variable. The function should implement linear regression for one or more independent variables. The algorithm is as follows. It's not a long function. Each step identified below includes a description of what you are computing.

    def linear_regression(d, ind, dep):
      # assign to y the column of data for the dependent variable
      # assign to A the columns of data for the independent variables
      #    It's best if both y and A are numpy matrices
      # add a column of 1's to A to represent the constant term in the 
      #    regression equation.  Remember, this is just y = mx + b (even 
      #    if m and x are vectors).
      
      # assign to AAinv the result of calling numpy.linalg.inv( np.dot(A.T, A))
      #    The matrix A.T * A is the covariancde matrix of the independent
      #    data, and we will use it for computing the standard error of the 
      #    linear regression fit below.
    
      # assign to x the result of calling numpy.linalg.lstsq( A, y )
      #    This solves the equation y = Ab, where A is a matrix of the 
      #    independent data, b is the set of unknowns as a column vector, 
      #    and y is the dependent column of data.  The return value x 
      #    contains the solution for b.
    
      # assign to b the first element of x.
      #    This is the solution that provides the best fit regression
      # assign to N the number of data points (rows in y)
      # assign to C the number of coefficients (rows in b)
      # assign to df_e the value N-C, 
      #    This is the number of degrees of freedom of the error
      # assign to df_r the value C-1
      #    This is the number of degrees of freedom of the model fit
      #    It means if you have C-1 of the values of b you can find the last one.
    
      # assign to error, the error of the model prediction.  Do this by 
      #    taking the difference between the value to be predicted and
      #    the prediction. These are the vertical differences between the
      #    regression line and the data.
      #    y - numpy.dot(A, b)
    
      # assign to sse, the sum squared error, which is the sum of the
      #    squares of the errors computed in the prior step, divided by the
      #    number of degrees of freedom of the error.  The result is a 1x1 matrix.
      #    numpy.dot(error.T, error) / df_e
    
      # assign to stderr, the standard error, which is the square root
      #    of the diagonals of the sum-squared error multiplied by the
      #    inverse covariance matrix of the data. This will be a Cx1 vector.
      #    numpy.sqrt( numpy.diagonal( sse[0, 0] * AAinv ) )
    
      # assign to t, the t-statistic for each independent variable by dividing 
      #    each coefficient of the fit by the standard error.
      #    t = b.T / stderr
    
      # assign to p, the probability of the coefficient indicating a
      #    random relationship (slope = 0). To do this we use the 
      #    cumulative distribution function of the student-t distribution.  
      #    Multiply by 2 to get the 2-sided tail.
      #    2*(1 - scipy.stats.t.cdf(abs(t), df_e))
    
      # assign to r2, the r^2 coefficient indicating the quality of the fit.
      #    1 - error.var() / y.var()
    
      # Return the values of the fit (b), the sum-squared error, the
      #     R^2 fit quality, the t-statistic, and the probability of a
      #     random relationship.
    

  4. Write a simple test function in your analysis.py file that reads in a data set and then does a multiple linear regression fit. Test it on the following three data files.
    1. data-clean.csv
      m0 = 0.984, m1 = 2.088, b = -0.035, sse = 0.002,
      R2 = 0.996, t = [8.6, 18.9, -0.88], p = [5.6e-5, 2.9e-7, 0.405]
    2. data-good.csv

    3. m0 = 0.885, m1 = 1.880, b = 0.146, sse = 0.090,
      R2 = 0.885, t = [2.34, 5.22, 0.568], p = [0.052, 0.001, 0.588]
    4. data-noisy.csv

    5. m0 = -0.336, m1 = 3.335, b = -0.263, sse = 1.03,
      R2 = 0.611, t = [-0.28, 3.08, -0.255], p = [0.787, 0.018, 0.806]

    In your writeup, show the results of running your function on these three data sets and confirm that it is working properly.

  5. Find a data set where you think there is a relationship between two variables. Minimum and maximum daily temperature, for example, is one possibility. You could also try year versus average yearly temperature for the past 30 years, or carbon dioxide levels versus average yearly temperature over the same time period. Look on the main course page for data set options.

    Using the data set you selected, execute a linear regression using the GUI interface you completed in lab with one independent variable and one dependent variable. Include the results in your writeup and explain whether they make sense. Also include a picture of the linear regression plotted over your data using your GUI.

  6. Using the data set you selected, execute a multiple linear regression using the analysis function you wrote. Include the numerical results in your writeup. Also include a picture of the data plotted in your GUI (this picture does not have to include the regression line, just the data). It is an extension to have the multiple linear regression line plotted in your GUI.


Extensions


Report

Make a wiki page for the project report.

  • Write a brief summary, separate from the body of your report, of your project that describes the purpose, the task, and your solution to it. It should describe the task, the key parts of your solution, and the result of your work (did it work, what can you do with your GUI?). The summary should be 200 words or less.
  • Write a brief explanation of how to run a linear regression, with screen shots, in your application. Include any extensions or enhancements you implemented. Explain what the meaning of a linear regression plot.
  • Include the required screen shots for the provided data sets and for your own. In the text of your writeup, note what axes are being plotted in any images you show. Please also include a description of what the plot means in terms of the relationship between the two variables.

  • Describe any extensions or enhancements you implemented. Include pictures as appropriate.
  • Acknowledgements: a list of people you worked with, including TAs, and instructors. Include in that list anyone whose code you may have seen, such as those of friends who have taken the course in a previous semester.
  • Handin

    Once you have written up your assignment, give the page the label:

    cs251s18project5

    Put your code in the Private subdirectory of your folder on Courses. Please make sure you are organizing your code by project. Your handin code should include all the python files necessary to run your program as well as the data files you used test your code.