CS 251: Assignment #5

Single and Multiple Linear Regression

Due 20 March 2015 (before you leave for spring break)

The goal of this project is to integrate simple linear regression into your project and then implement a multiple linear regression function into your analysis file. Incorporating the multiple linear regression into your full GUI is the obvious big extension.


  1. Implement an updateFits method, similar to updateAxes, that makes the linear fit move along with the data. Make sure updateFits is called wherever updateAxes and updatePoints is called.
  2. Test your implementation. Make sure everything works cleanly if you run a second linear regression, open a new file, go back and forth between plotting data and linear regressions, and translate/scale/rotate the screen. Make sure cancelling the linear regression dialog actually cancels the process.
  3. In your analysis class, create a new function linear_regression that takes in the data set, a list of headers for the independent variables, and a single header (not in a list) for the dependent variable. The function should implement linear regression for one or more independent variables. The algorithm is as follows. It's not a long function, but each step includes a description of what you are computing.

    def linear_regression(d, ind, dep):
      # assign to y the column of data for the dependent variable
      # assign to A the columns of data for the independent variables
      # add a column of 1's to A to represent the constant term in the 
      #    regression equation.  Remember, this is just y = mx + b (even 
      #    if m and x are vectors).
      # assign to AAinv the result of calling numpy.linalg.inv( np.dot(A.T, A))
      #    The matrix A.T * A is the covariance matrix of the independent
      #    data, and we will use it for computing the standard error of the 
      #    linear regression fit below.
      # assign to x the result of calling nump.linalg.lstsq( A, y )
      #    This solves the equation y = Ab, where A is a matrix of the 
      #    independent data b is the set of unknowns as a column vector, 
      #    and y is the dependent column of data.  The return value x 
      #    contains the solution for b.
      # assign to b the first element of x.
      # assign to N the number of data points (rows in y)
      # assign to C the number of coefficients (rows in b)
      # assign to df_e the value N-C, 
      #    This is the number of degrees of freedom of the error
      # assign to df_r the value C-1
      #    This is the number of degrees of freedom of the model fit
      # assign to error, the error of the model prediction.  Do this by 
      #    taking the difference between the value to be predicted and
      #    the prediction. 
      #    y - numpy.dot(A, b)
      # assign to sse, the sum squared error, which is the sum of the
      #    squares of the errors computed in the prior step, divided by the
      #    number of degrees of freedom of the error.  The result is a 1x1 matrix.
      #    numpy.dot(error.T, error) / df_e
      # assign to stderr, the standard error, which is the square root
      #    of the diagonals of the sum-squared error multiplied by the
      #    inverse covariance matrix of the data. This will be a Cx1 vector.
      #    numpy.sqrt( numpy.diagonal( sse[0, 0] * AAinv ) )
      # assign to t, the t-statistic for each independent variable by dividing 
      #    each coefficient of the fit by the standard error.
      #    t = b.T / stderr
      # assign to p, the probability of the coefficient indicating a
      #    random relationship. To do this we use the cumulative distribution
      #    function of the student-t distribution.  Multiply by 2 to get the
      #    2-sided tail.
      #    2*(1 - scipy.stats.t.cdf(abs(t), df_e))
      # assign to r2, the r^2 coefficient indicating the quality of the fit.
      #    1 - error.var() / y.var()
      # Return the values of the fit (b), the sum-squared error, the
      #     R^2 fit quality, the t-statistic, and the probability of a
      #     random relationship.

  4. Write a simple test function in your analysis.py file that reads in a data set and then does a multiple linear regression fit. Test it on the following three data files.
    1. data-clean.csv
      m0 = 0.984, m1 = 2.088, b = -0.035, sse = 0.002,
      R2 = 0.996, t = [8.6, 18.9, -0.88], p = [2.8e-5, 1.4e-7, 2.0e-1]
    2. data-good.csv

    3. m0 = 0.885, m1 = 1.880, b = -0.146, sse = 0.090,
      R2 = 0.885, t = [2.34, 5.22, 0.568], p = [0.026, 0.0006, 0.294]
    4. data-noisy.csv

    5. m0 = -0.336, m1 = 3.335, b = -0.263, sse = 1.03,
      R2 = 0.611, t = [-0.28, 3.08, -0.255], p = [0.393, 0.0089, 0.403]
  5. Find a data set where you think there is a relationship between two variables. Minimum and maximum daily temperature, for example, is one possibility. You could also try year versus average yearly temperature for the past 30 years, or carbon dioxide levels versus average yearly temperature over the same time period.



Write a brief explanation of how to run a linear regression, with screen shots, in your application. Include any extensions or enhancements you implemented.

Include the screen shots for the provided data sets and for your own. Be sure to note what axes are being plotted in any images you show.


Once you have written up your assignment, give the page the label:


Put your code in the Private subdirectory of your folder on Courses. Please make sure you are organizing your code by project. Your handin code should include all the python files necessary to run your program as well as the data files you used test your code.