Title image Spring 2018

Fitting Models

The goal of this lab is to start the process of adding a new data analysis capability to your GUI. In particular, you will be implementing simple linear regression of two variables--one independent and one dependent--and plotting the results.


In the lab, your goal should be to execute a linear regression on two variables and then plot the result on the screen.

  1. Give yourself a new working directory and copy your display, viewing, analysis and data python files into it. It's best to start with copies and modify them from there.
  2. Create a single_linear_regression function in your analysis function library.
    def single_linear_regression(data_obj, ind_var, dep_var):

    First, get the requested columns of data from your Data object. It is easiest to get both columns of data at once, with your dependent variable in the first column and your dependent variable in the second.

    Second, use the scipy.stats.linregress (import scipy.stats) function to calculate the linear regression of the independent and dependent variables. The linear regression must occur in original data space, not the normalized data space. Store all of the outputs of the linregress function. You may want to have a separate variable for each of the returned outputs: slope, y-intercept, r-value, p-value, and standard error. See the linregress Documentation

    Return a tuple containing all of the outputs of of linregress plus the min and max of both the dependent and independent variables.

    You can test your function using this test program on this data . If you execute a regression using columns D0 and D1, you should get the following.

    $ python3 test_slr.py testdata52.csv D0 D1
    Executing linear regression using D2 (independent) and D3 (dependent)
    Model:    y = 1.0334x + 0.0874
    R-value:  0.811
    P-value:  0.000
    Stderr:   0.176

    If you run it using columns D2 and D3, you should get the following.

    $ python3 test_slr.py testdata52.csv D2 D3
    Executing linear regression using D2 (independent) and D3 (dependent)
    Model:    y = -0.1039x + -0.0467
    R-value:  -0.115
    P-value:  0.630
    Stderr:   0.212

    Is either set of variables correlated? Wat is the quality of the fit?

  3. In your Display class, add variables for holding (a) the graphical objects associated with a linear regression (i.e. one or more tk Line objects), and (b) the endpoints of the regression line in normalized data space (e.g. a numpy matrix or a 2-D list). You can initialize the first field as an empty list and the field holding the initial endpoints as None.
  4. In your Display class, add a menu item to the Command menu that calls the handleLinearRegression function, which you will create in the next step.
  5. In your Display class, create a method (e.g. handleLinearRegression. The function should let the user select the variables to fit and then display them on the main screen. It should have the following steps, each of which you may want to do in a separate function.
    1. Create a dialog class that lets the user select an independent (x) variable and a dependent (y) variable. If you want to let the user also pick variables for color and size, that is up to you. The dialog window needs to return at least two headers from your numeric data: the independent and dependent variables for analysis. If the user selects Cancel, the process should terminate and the existing display should not change.
    2. Clear the existing points from the window.
    3. Clear any existing data fits or models from the window. This should delete any objects in your linear regression objects list.
    4. Reset the view to the default position.
    5. Update the axes.
    6. Call a buildLinearRegression function that creates the canvas line object to show the linear regression fit graphically.

    Start by creating the dialog window and make sure it returns two headers. Then write the buildLinearRegression function (see next step for details), then go back and deal with clearing the points of any existing plots, clearing any prior linear regression fit, and resetting the view.

  6. Create the buildLinearRegression function. This function should do the following.
    1. Extract the two columns selected by the user from the Data object. Make the independent variable the X column and the dependent variable the Y column. Normalize the columns separately. Use your function from your analysis library to get the normalized data.
    2. Add a third column of zeros to the matrix. Use np.zeros and np.hstack.
    3. Add a fourth column of ones to the matrix. You need to store this matrix in your self.datapts field, or whatever field you used to store the data in your buildPoints function from last week.
    4. Build the vtm, multiply it by the data points, and then create the ovals to plot the data on the screen. This should make a 2-D plot of the two variables, with the independent variable along the x-axis. At this point, you should be able to test your function to see if it makes the 2D data plot. If you did it right, the translations, rotations, and scales should all still work as expected.
    5. Call your analysis.single_linear_regression function and save the results. In addition to the linear regression results, the function should return the min value and max value for both your independent and dependent variables, and you will need those for the next step.
    6. Make the endpoints of the linear regression line fit. Note that these endpoints need to end up in normalized data space, while the linear regression model is in unnormalized data space. In normalized space, the x values of the endpoints will be 0.0 and 1.0. If the slope is m and the y-intercept is b, then the y values of the endpoints in normalized data space will be:

      ((xmin * m + b) - ymin)/(ymax - ymin)
      ((xmax * m + b) - ymin)/(ymax - ymin)

    7. Multiply the line endpoints points by the vtm and then make a tk line object out of the two endpoints. Make it a color that will stand out relative to the data points..
    8. Your program should somehow communicate the linear regression coefficients to the user as part of the GUI. You could do this by making a tk.Label object and putting text into it giving the slope, intercept, and R-value for the fit.
  7. In addition to testing along the way, test out your system now using this data file. It contains two variables, linearly related with some noise. Your fit should give a slope of 1.995, an intercept of 1.012, and an R^2 value of 0.792 (R value of 0.89). An example plot is shown below.

    Data fit image

When you are finished with the lab, go ahead and continue with the project.