Single and Multiple Linear Regression
Due 23 March 2018 (before you leave for spring break)
The goal of this project is to integrate simple linear regression into your project and then implement a multiple linear regression function as a function in your analysis file. Incorporating multiple linear regression into your full GUI is the obvious extension.
Tasks
- Implement an updateFits method, similar to updateAxes. The updateFits function should enable the linear fit to move along with the data. Make sure updateFits is called wherever updateAxes and updatePoints is called in the Display class.
-
Test your implementation. Make sure everything works cleanly if you
run a second linear regression, open a new file, go back and forth
between plotting data and linear regressions, and
translate/scale/rotate the screen. Make sure cancelling the linear
regression dialog actually cancels the process and does not change the
existing visualization.
The first required result is a plot of your regression line on the data-simple.csv data.
-
In your analysis class, create a new function linear_regression that
takes in the data set, a list of headers for the independent
variables, and a single header (not in a list) for the dependent
variable. The function should implement linear regression for one or
more independent variables. The algorithm is as follows. It's not a
long function. Each step identified below includes a description of
what you are computing.
def linear_regression(d, ind, dep): # assign to y the column of data for the dependent variable # assign to A the columns of data for the independent variables # It's best if both y and A are numpy matrices # add a column of 1's to A to represent the constant term in the # regression equation. Remember, this is just y = mx + b (even # if m and x are vectors). # assign to AAinv the result of calling numpy.linalg.inv( np.dot(A.T, A)) # The matrix A.T * A is the covariancde matrix of the independent # data, and we will use it for computing the standard error of the # linear regression fit below. # assign to x the result of calling numpy.linalg.lstsq( A, y ) # This solves the equation y = Ab, where A is a matrix of the # independent data, b is the set of unknowns as a column vector, # and y is the dependent column of data. The return value x # contains the solution for b. # assign to b the first element of x. # This is the solution that provides the best fit regression # assign to N the number of data points (rows in y) # assign to C the number of coefficients (rows in b) # assign to df_e the value N-C, # This is the number of degrees of freedom of the error # assign to df_r the value C-1 # This is the number of degrees of freedom of the model fit # It means if you have C-1 of the values of b you can find the last one. # assign to error, the error of the model prediction. Do this by # taking the difference between the value to be predicted and # the prediction. These are the vertical differences between the # regression line and the data. # y - numpy.dot(A, b) # assign to sse, the sum squared error, which is the sum of the # squares of the errors computed in the prior step, divided by the # number of degrees of freedom of the error. The result is a 1x1 matrix. # numpy.dot(error.T, error) / df_e # assign to stderr, the standard error, which is the square root # of the diagonals of the sum-squared error multiplied by the # inverse covariance matrix of the data. This will be a Cx1 vector. # numpy.sqrt( numpy.diagonal( sse[0, 0] * AAinv ) ) # assign to t, the t-statistic for each independent variable by dividing # each coefficient of the fit by the standard error. # t = b.T / stderr # assign to p, the probability of the coefficient indicating a # random relationship (slope = 0). To do this we use the # cumulative distribution function of the student-t distribution. # Multiply by 2 to get the 2-sided tail. # 2*(1 - scipy.stats.t.cdf(abs(t), df_e)) # assign to r2, the r^2 coefficient indicating the quality of the fit. # 1 - error.var() / y.var() # Return the values of the fit (b), the sum-squared error, the # R^2 fit quality, the t-statistic, and the probability of a # random relationship.
-
Write a simple test function in your analysis.py file that reads in a
data set and then does a multiple linear regression fit. Test it on
the following three data files.
-
data-clean.csv
m0 = 0.984, m1 = 2.088, b = -0.035, sse = 0.002,
R2 = 0.996, t = [8.6, 18.9, -0.88], p = [5.6e-5, 2.9e-7, 0.405] - data-good.csv
- data-noisy.csv
m0 = 0.885, m1 = 1.880, b = 0.146, sse = 0.090,
R2 = 0.885, t = [2.34, 5.22, 0.568], p = [0.052, 0.001, 0.588]
m0 = -0.336, m1 = 3.335, b = -0.263, sse = 1.03,
R2 = 0.611, t = [-0.28, 3.08, -0.255], p = [0.787, 0.018, 0.806]In your writeup, show the results of running your function on these three data sets and confirm that it is working properly.
-
data-clean.csv
-
Find a data set where you think there is a relationship between two
variables. Minimum and maximum daily temperature, for example, is one
possibility. You could also try year versus average yearly
temperature for the past 30 years, or carbon dioxide levels versus
average yearly temperature over the same time period. Look on the
main course page for data set options.
Using the data set you selected, execute a linear regression using the GUI interface you completed in lab with one independent variable and one dependent variable. Include the results in your writeup and explain whether they make sense. Also include a picture of the linear regression plotted over your data using your GUI.
Using the data set you selected, execute a multiple linear regression using the analysis function you wrote. Include the numerical results in your writeup. Also include a picture of the data plotted in your GUI (this picture does not have to include the regression line, just the data). It is an extension to have the multiple linear regression line plotted in your GUI.
Extensions
- Incorporate multiple linear regression into your GUI. Start by just displaying the coefficients of the fit, then extend the GUI to display the regression line in 3D. For fits higher than 3D, you have to be careful when calculating the endpoints of the best fit line in the view space.
- Further extend your GUI in any of the directions suggested last week. Add legends, axis labels (e.g. headers and values) or other features to the GUI for plotting data.
- Do some more exploration with different data sets using your new tool.
- Give the user the ability to store and recall prior analyses. This capability can be limited to the current session.
- Give the user the ability to save the linear regression analysis to a file in a human-readable format. Extend it even further to allow the user to read an analysis back in and replot it over the correct data.
- Figure out how to save a picture of a plot to a file.
- Be creative and add useful features to your GUI.
Report
Make a wiki page for the project report.
Include the required screen shots for the provided data sets and for your own. In the text of your writeup, note what axes are being plotted in any images you show. Please also include a description of what the plot means in terms of the relationship between the two variables.
Handin
Once you have written up your assignment, give the page the label:
cs251s18project5
Put your code in the Private subdirectory of your folder on Courses. Please make sure you are organizing your code by project. Your handin code should include all the python files necessary to run your program as well as the data files you used test your code.