Unit 4 Hypothesis Testing

9.1-9.2 Correlation and Linear Regression


Overview of Scatter Plots and Lines

Correlation

Correlation is the relationship between 2 variables.

The data can be represented by the ordered pairs (x, y), where x is the independent variable and y is the dependent variable.

Linear Correlation Coefficients:
Possible Correlation Coefficient Values

A correlation coefficient that is between 0.8 and 1.0, inclusive, represents a strong positve correlation.

A correlation coefficient that is between -0.8 and -1.0, inclusive, represents a strong negative correlation.

image with 7 scatter plots all in first quadrant coordinate planes with correlation coefficients labeled underneath.  1st: dots are basically in a straight line sloping down from left to right. This graph is labeled -1 .  2nd: dots are still sloping down from left to right but they are more scattered than the 1st plot.  This graph is labeled -0.8 (weak negative correlation).  3rd:Random scattering of points but still largely decreasing from left to right--labeled -0.3.  4th: randomly scattered points with no clear direction--labeled 0.  5th: Random scattering of points but  largely increasing from left to right--labeled 0.3. 6th:  dots are still increasing from left to right but they are more linear than the previous graph.  This graph is labeled 0.8.  7th: The dots are basically in a straight line sloping up from left to right. This graph is labeled 1.

Graphing Correlation Coefficients on an interval from -1 to 1.

image with 5 scatter plots all in first quadrant coordinate planes with correlation labeled underneath.  1st: dots are basically in a straight line sloping down from left to right--strong negative correlation .  2nd: Random scattering of points but still largely decreasing from left to right--weak negative correlation  3rd: randomly scattered points with no clear direction--no correlation.  4th: Random scattering of points but  largely increasing from left to right--weak positive correlation.   5th: The dots are basically in a straight line sloping up from left to right--strong positive correlation. 
  In the scatterplots with either weak or strong correlation, there is a line of best fit drawn on the graph.
  Underneath the 5 scatterplots is a number line from -1 to 1. The far left of the number line is labeled r=-1, r=0 is in the middle, and r=1 is on the far right of the number line.  There are shaded rectangles at each end of the graph designating the area between the ends of the number line and the critical values.  The critical values are also marked with dotted vertical lines that separate the strong correlations on each end from the weak and no correlation scatterplots in the middle.

Linear correlations are not the only correlations.

A word of caution before we begin our work on Linear Regression: you can find the "line of best fit" for ANY paired data, but it is still not a good fit if the data is not linear. Always look at the scatter plot.

graph of data points with both a curve of best fit that goes through most of the points and a line of best fit that does not.

Linear Regression

The regression equation expresses a relationship between x (the independent variable) and y (the dependent variable).


The graph of the regression equation is called the regression line, or line of best fit.


The regression line does not usually pass through all the sample points. That is why the “hat” is used over the y. This indicates we are calculating a predicted value and not necessarily an actual value.

Scatterplot with high school GPA on the x-axis (labeled from 2 to 4) and University GPA on the y-axis (labeled from 2 to 4)  Data are scattered across the graph with many clustered in the bottom left, but overall with a clear trend of rising from left to right.  The line of best fit is drawn starting slightly below (0,2.5) and rising to the right.


Strategy for Predicting Values of Y

Is the regression equation a good model?

IF YES: The regression equation is a good model. Substitute the given value of x into the regression equation \(\hat{y}=b + mx\)

IF NO: The regression equation is a NOT good model. Regardless of the value of x, the best predicted value of y is the mean of the y values.

Steps for Regression and Correlation Test:

  1. Define \(H_0\) and \(H_A\) (the same for every regression test)

          \(H_0\):   \(\rho=0\)  There is no linear correlation in the population.

          \(H_A\):   \(\rho \neq 0\)  There is some linear correlation in the population.

    Note: A regression test is always a two-tailed test.

  2. Run the Regression and Correlation Test

    There are two ways to look at Test Statistics and Critical Values, using the T-distribution AND using the Pearson Correlation Coefficient. We will look at both:

    Normal curve shaded underneath and divided into 3 areas.  The middle of the curve (approximately 50%) is labeled Fail to reject H_0 on top and Sign used in H_1 not equal to.  Both the left and the right side of the curve are shaded a different color and labeled Reject H_0

    1. Identify the information you will use to shade your graph:
      • The two Critical Values from the T-distribution
      • df=n-2, area between tails \(= 1-\alpha\)
      • The Critical Values for the correlation coefficient From the Table of Critical Values for Correlation Coefficients on your formula sheet

    2. Identify the information about your sample:
      • Stat - Regression - Simple Linear: x-variable, y-variable, Hypothesis Test
      • Correlation Coefficient (r),
      • Test Statistic (T-Stat), and
      • the P-value for the “SLOPE”
  3. Draw the Graph:

    Shade the Critical Areas of Rejection

    Indicate the Test Statistic and Critical Values

  4. Make Your Decision About the Null Hypothesis:

    Reject \(H_0\)   p value < α  T-stat or correlation coefficient is in the Critical Rejection Region

      (The population must have some linear correlation)

    Fail to Reject \(H_0\)   p value> α   T-stat or correlation coefficient is not in the Critical Rejection Region

      (The population doesn’t appear to have a linear correlation)

  5. Make Your Statement:
    1. If you Reject \(H_0\): There is sufficient evidence to support the claim a linear correlation exists between _______ and _________
    2. If you Fail to Reject \(H_0\): There is NOT sufficient evidence to support the claim a linear correlation exists between _______ and _________
  6. Identify your regression equation: \(\hat y=b+mx\)
  7. Use your hypothesis test conclusions to calculate a value of y, given a value of x.

    You must have a linear correlation in order to use the regression equation to predict a y value. If you do not have a good correlation, the only option you have for predicting a y value is to find the mean of the y values in your set of points.

    If you Rejected \(H_0\), you have a linear correlation and will use your regression equation.

    If you Failed To Reject \(H_0\), you do not appear to have a linear correlation so you will not use your regression equation. You will have to find another method to predict your y value.

       Given an x, average the y values in your data to predict y. \(\overline{y}=\frac{\Sigma y}{n}\)

  8. The Test Statistic (T-stat) for a regression test is calculated using: \(t=\frac{r}{\sqrt{\frac{1-r^{2}}{n-2}}}\)

    The Critical Values for the Correlation Coefficient are found on the Unit 4 Formula Sheet

Problems

  1. Cricket Chirping Frequency and Temperature: Below is set of chirping frequency data for the striped ground cricket at various ground temperatures. Construct a scatterplot, and run a Regression and Correlation Test using α = 0.01. Determine whether there is sufficient evidence to support a claim of a linear correlation between chirping frequency and ground temperature.
    Chirping Frequency (chirps/sec) Temperature °F
    20.2 88.6
    16 71.6
    19.8 93.3
    18.4 84.3
    17.1 80.6
    15.5 75.2
    14.7 69.7
    17.1 82
    15.4 69.4
    16.2 83.3
    15 79.6
    17.2 82.6
    16 80.6
    17 83.5
    14.4 76.3
    1. The original claim: \(\rho \neq 0\) There exists a linear correlation between chirping frequency and ground temperature.
    2. \(H_0\) : \(\rho = 0\) There is no linear correlation in the population
    3. \(H_A\) :\(\rho \neq 0\) There is some linear correlation in the population
    4. \(\alpha\) = 0.01
    5. Rejection Criteria using r: Reject if r is in shaded region.
    6. Correlation Coefficient:r = 0.8319
    7. Correlation Critical Values: \(r_{cv}= ±0.641\) (From the Table of Critical Values for Correlation Coefficients)
    8. Decision: Reject \(H_0\)
    9. Rejection Criteria using T- dist: Reject if T-Stat is in shaded region.
      1. Critical Values:\(t =±3.012\)
      2. Test Statistics:: \(t = 5.404\)
    10. Rejection Criteria using p-value: Reject \(H_0\) if p-value < 0.01
      1. p-value: p-value < 0.01
      2. Decision: \(0.00012 < 0.01\) Reject \(H_0\)
    11. image with 5 scatter plots all in first quadrant coordinate planes with correlation labeled underneath.  1st: dots are basically in a straight line sloping down from left to right--strong negative correlation .  2nd: Random scattering of points but still largely decreasing from left to right--weak negative correlation  3rd: randomly scattered points with no clear direction--no correlation.  4th: Random scattering of points but  largely increasing from left to right--weak positive correlation.   5th: The dots are basically in a straight line sloping up from left to right--strong positive correlation. 
      In the scatterplots with either weak or strong correlation, there is a line of best fit drawn on the graph.
      Underneath the 5 scatterplots is a number line from -1 to 1. The far left of the number line is labeled r=-1, r=0 is in the middle, and r=1 is on the far right of the number line.  There are shaded rectangles at each end of the graph designating the area between the ends of the number line and the critical values.  The critical values are also marked with dotted vertical lines that separate the strong correlations on each end from the weak and no correlation scatterplots in the middle.  The shaded rectangle on the left side of the graph goes from r=-1 to -0.641.  The shaded rectangle on the right of the graph goes from +0.641 to r=1.

    12. Concluding Statement:There is sufficient evidence to support the claim a linear correlation exists between chirping frequency of the Striped Ground Cricket and ground temperature.
    13. Regression Equation: \(\hat{y}=26.30789+3.22393 x\)
    14. Good predictor? Yes, rejected \(H_0\).

    If you listened in the morning when you woke up and measured a striped ground cricket chirping at a rate of 18 chirps per second, how warm would you say the ground temperature is?

    Good Predictor, use the equation above with x = 18 chirps per second

    \(\hat{y}=26.30789+3.22393\) \((18) = 84.4\) °F

  2. Measuring Seals from Photos: Listed below are the overhead widths (in cm) of seals measured from photographs and the weights (in kg) of the seals. The purpose of the study was to determine if weights of seals could be determined from overhead photographs. Construct a scatterplot and run a Regression and Correlation Test using α = 0.05. Determine whether there is sufficient evidence to support a claim of a linear correlation between overhead widths of seals from photographs and the weights of the seals.
    Overhead Width 7.2 7.4 9.8 8.8 8.4 9.4
    Weight 116 154 245 200 191 202
    1. The original claim: \(\rho \neq 0\) There exists a linear correlation between overhead widths of seals from photographs and the weights of the seals.
    2. \(H_0\) : \(\rho = 0\) There is no linear correlation in the population
    3. \(H_A\) : \(\rho \neq 0\) There is some linear correlation in the population
    4. \(\alpha\) = 0.05
    5. Rejection Criteria using r: Reject if r is in shaded region.
    6. Correlation Coefficient: r = 0.9485
    7. Correlation Critical Values: \(r_{cv}= ±0.811 \) (From the Table of Critical Values for Correlation Coefficients)
    8. Decision: Reject \(H_0\)
    9. Rejection Criteria using T- dist: Reject if T-Stat is in shaded region.
      1. Critical Values: \(t =±2.776\)
      2. Test Statistics: : \(t = 5.986\)
    10. Rejection Criteria using p-value: Reject \(H_0\) if p-value < 0.05
      1. p-value: = 0.00391
      2. Decision: \(0.00391 < 0.05 \) Reject \(H_0\)
    11. image with 5 scatter plots all in first quadrant coordinate planes with correlation labeled underneath.  1st: dots are basically in a straight line sloping down from left to right--strong negative correlation .  2nd: Random scattering of points but still largely decreasing from left to right--weak negative correlation  3rd: randomly scattered points with no clear direction--no correlation.  4th: Random scattering of points but  largely increasing from left to right--weak positive correlation.   5th: The dots are basically in a straight line sloping up from left to right--strong positive correlation. 
    In the scatterplots with either weak or strong correlation, there is a line of best fit drawn on the graph.
    Underneath the 5 scatterplots is a number line from -1 to 1. The far left of the number line is labeled r=-1, r=0 is in the middle, and r=1 is on the far right of the number line.  There are shaded rectangles at each end of the graph designating the area between the ends of the number line and the critical values.  The critical values are also marked with dotted vertical lines that separate the strong correlations on each end from the weak and no correlation scatterplots in the middle.  The shaded rectangle on the left side of the graph goes from r=-1 to -0.811.  The shaded rectangle on the right of the graph goes from +0.811 to r=1.

    12. Concluding Statement:There is sufficient evidence to support the claim of linear correlation between overhead widths of seals from photographs and the weights of the seals.
    13. Regression Equation: \(\hat{y}=-156.9+40.2 x\)
    14. Good predictor? Yes, rejected \(H_0\).

    Find the best predicted weight in kg of a seal if the overhead width measured from the photograph is 9.0 cm.

    \(-156.9+40.2(9)=204.9 \mathrm{kg}\)

    Could the regression equation be used to predict the weight of a seal with a width measurement of 3.1 cm?

    No, because 3.1 cm is significantly lower than the given x-values in the data.
  3. Lemon Imports And Crash Fatality Rates : Listed below are annual data for various years. The data are weights (metric tons) of lemons imported from Mexico and US car crash fatality rates per 100,000 population. Construct a scatterplot, and run a Regression and Correlation Test using α = 0.05. Determine whether there is sufficient evidence to support a claim of a linear correlation between weights of lemon imports from Mexico and US car fatality rates? Source: “The trouble with QSAR (or How I Learned to Stop Worrying and Embrace Fallacy)” by Stephen Johnson, Journal of Chemical Information and Modeling. Vol. 48, No 1).
    LEMON IMPORTS (metric tons) 230 266 359 480 534
    US CRASH FATALITY RATE (per 100,000 population) 15.9 15.7 15.5 15.3 14.8
    1. The original claim: \(\rho \neq 0\) There exists a linear correlation between weights of lemon imports from Mexico and US car fatality rates.
    2. \(H_0\) : \(\rho = 0\) There is no linear correlation in the population
    3. \(H_A\) :\(\rho \neq 0\) There is some linear correlation in the population
    4. \(\alpha\) = 0.05
    5. Rejection Criteria using r: Reject if r is in shaded region.
    6. Correlation Coefficient: \(r = -0.9554\)
    7. Correlation Critical Values: \(r_{cv}= ±0.878 \) (From the Table of Critical Values for Correlation Coefficients)
    8. Decision: Reject \(H_0\)
    9. Rejection Criteria using T- dist: Reject if T-Stat is in shaded region.
      1. Critical Values: \(t =±3.182\)
      2. Test Statistics: : \(t = -5.6\)
    10. Rejection Criteria using p-value: Reject \(H_0\) if p-value < 0.05
      1. p-value: = 0.01125
      2. Decision: \(0.01125 < 0.05 \) Reject \(H_0\)
    11. image with 5 scatter plots all in first quadrant coordinate planes with correlation labeled underneath.  1st: dots are basically in a straight line sloping down from left to right--strong negative correlation .  2nd: Random scattering of points but still largely decreasing from left to right--weak negative correlation  3rd: randomly scattered points with no clear direction--no correlation.  4th: Random scattering of points but  largely increasing from left to right--weak positive correlation.   5th: The dots are basically in a straight line sloping up from left to right--strong positive correlation. 
    In the scatterplots with either weak or strong correlation, there is a line of best fit drawn on the graph.
    Underneath the 5 scatterplots is a number line from -1 to 1. The far left of the number line is labeled r=-1, r=0 is in the middle, and r=1 is on the far right of the number line.  There are shaded rectangles at each end of the graph designating the area between the ends of the number line and the critical values.  The critical values are also marked with dotted vertical lines that separate the strong correlations on each end from the weak and no correlation scatterplots in the middle.  The shaded rectangle on the left side of the graph goes from r=-1 to -0.878.  The shaded rectangle on the right of the graph goes from +0.878to r=1.

    12. Concluding Statement:There is sufficient evidence to support the claim of linear correlation between weights of lemon imports from Mexico and US car fatality rates.
    13. Regression Equation: \(\hat{y}=16.6-0.003 x\)
    14. Good predictor? Yes, rejected \(H_0\).

    Do the results suggest that imported lemons cause car fatalities?

    No. Correlation does NOT imply causation!