4.6 Linear Regression



Correlation


Define correlation

the relationship between 2 variables.

image with 7 scatter plots all in first quadrant coordinate planes with correlation coefficients labeled underneath.  1st: dots are basically in a straight line sloping down from left to right. This graph is labeled -1 .  2nd: dots are still sloping down from left to right but they are more scattered than the 1st plot.  This graph is labeled -0.8 (weak negative correlation).  3rd:Random scattering of points but still largely decreasing from left to right--labeled -0.3.  4th: randomly scattered points with no clear direction--labeled 0.  5th: Random scattering of points but  largely increasing from left to right--labeled 0.3. 6th:  dots are still increasing from left to right but they are more linear than the previous graph.  This graph is labeled 0.8.  7th: The dots are basically in a straight line sloping up from left to right. This graph is labeled 1.

Linear Regression

\(\hat y=b+mx\)

The regression equation expresses a relationship between x and y.


In the regression equation, what do each of these variables represent?


x:

the independent quantity

y:

the dependent quantity

m:

the slope

b:

the y-intercept

StatCrunch enter sample data in two columns, then select Stat → Regression → Simple Linear

  1. Cricket Chirping Frequency and Temperature: Load the chirping frequency data for the striped ground cricket at various ground temperatures. Construct a scatterplot, and run a Regression and Correlation Test using α = 0.01. Determine whether there is sufficient evidence to support a claim of a linear correlation between chirping frequency and ground temperature.
    1. Claim: \(\rho \neq 0\) There is a linear correlation between chirping frequency and ground temperature.
    2. \(H_0\) : \(\rho = 0\) There is no linear correlation.
    3. \(H_A\) :\(\rho \neq 0\) There is a linear correlation.
    4. Significance level α: 0.01
    5. p-value (for slope or model) p < 0.0001
    6. Decision about the null hypothesis: \(p<0.0001\) is less than \(α = 0.01\) so Reject \(H_0\)

    7. Concluding Statement: There is sufficient sample evidence to support the claim of a linear correlation between chirping frequency of the Striped Ground Cricket and ground temperature.
    8. Regression Equation:

      \(\hat{y}=26.9 + 1.5x\)

    9. Is the regression equation a good predictor? Yes, as long as we stay within the scope of the sample data
      • Did you reject \(H_0\)?
      • The regression line in the scatterplot shows that the line fits the points well.
      • Does the correlation coefficient r indicate a linear correlation?
      • Is your prediction within the scope of the available data? Or not too much beyond it?

    10. If you listened in the morning when you woke up and measured a striped ground cricket chirping at a rate of 39 chirps per 15 seconds, how warm would you say the ground temperature is?

      \(\hat{y}=26.9+1.5(39) = 85.4\) °F

    11. If the ground temperature is 78 degrees Fahrenheit, how fast do you predict the crickets to be chirping?

      \(78=26.9+1.5x\) x = 34 chirps per 15 seconds

    12. What is the best predicted ground temperature if crickets are chirping at a rate of 15 chirps every 15 seconds?

      Since 15 seconds is not within the scope of the sample data, we cannot rely on the regression equation to make a prediction.

  2. Exercise and GPA: A student conducts a study to determine whether there is a linear relationship between the number of hours a student exercises each week and the student’s grade point average (GPA). Test the student’s claim at a 0.10 significance level.

    Hours of Exercise 12 3 0 6 10 2 18 14 15 5
    GPA 3.6 4.0 3.9 2.5 2.4 2.2 3.7 3.0 1.8 3.1

    1. Claim in words: There is a linear correlation between hours of exercise and GPA.
    2. Null hypothesis in words: There is no linear correlation.
    3. Alternate hypothesis in words: There is a linear correlation.
    4. Significance level α: 0.01
    5. p-value (for slope or model) 0.6509
    6. Use the p-value to make a conclusion about the usefulness of the least squares regression line and its equation. Since we fail to reject the null hypothesis, we do not have sufficient evidence to support the claim of a linear correlation between hours of exercise and GPA. Therefore, the regression equation is not useful for making predications.
    7. Find the best predicted GPA for a student who exercises 5 hours per week. An alternate method of prediction in this case it to average the GPA’s in the sample. \(\bar{y}=3.02\)
  3. Vocabulary Words: A child psychologist estimated the number of vocabulary words in a sample of 12 children. Each child’s age and the number of words in their vocabulary are paired in the table.

    Age 1 2 3 4 5 6 3 5 2 4 6 7
    Vocab 3 220 540 1100 1620 2600 1250 2200 260 1200 2500 1900

    1. With a significance level of 0.05, can the psychologist use this data to support the claim of a linear correlation between a child’s age and the number of words in their vocabulary? How do you know? Yes, because the p-value is <0.001, so we have sufficient sample evidence of a linear correlation.
    2. What is the slope in the regression equation? What does this slope tell us about the relationship between a child’s age and vocabulary? The slope is 446. According to the sample data, a child gains 446 vocabulary words per year.
    3. What is the y-intercept of the regression line? Is it reasonable to interpret the y-intercept? Why or why not? The y-intercept is -503. It is not reasonable to interpret this y-intercept because a child cannot have a negative number of words in their vocabulary.
    4. What is the best prediction for the number of words in a child’s vocabulary if the child is 7 years old? \(\hat{y}=-503+446\left(7\right)=\) 2619 words
    5. How does the observed vocabulary of the 7-year-old in the sample compare with the predicted value for a 7-year-old? 2619 – 1900 = 719 words
    6. The difference between an observed value and the predicted value of a dependent variable is called a _________ residual
  4. Make a circle outlined with cheerios. Place cheerios along the diameter of the circle. Do not leave space between consecutive cheerios.
    1. How many cheerios did you use for the diameter of your circle?
    2. How many cheerios did you use for the circumference of your circle?
    3. Compile your classroom data and run a regression test to determine if there is a linear correlation between diameter and circumference. What is the p-value for slope?
    4. What is the slope in the regression equation?
    5. If your classmate wanted to make a cheerio circle using 20 cheerios for the diameter, how many cheerios do you predict they would need to outline the circle?
  5. Lemon Imports And Crash Fatality Rates : Listed below are annual data for various years. The data are weights (metric tons) of lemons imported from Mexico and US car crash fatality rates per 100,000 population. Construct a scatterplot, and run a Regression and Correlation Test using α = 0.05. Determine whether there is sufficient evidence to support a claim of a linear correlation between weights of lemon imports from Mexico and US car fatality rates. Source: “The trouble with QSAR (or How I Learned to Stop Worrying and Embrace Fallacy)” by Stephen Johnson, Journal of Chemical Information and Modeling. Vol. 48, No 1).
    LEMON IMPORTS (metric tons) 230 266 359 480 534
    US CRASH FATALITY RATE (per 100,000 population) 15.9 15.7 15.5 15.3 14.8
    1. Claim: \(\rho \neq 0\) There is a linear correlation between lemon imports and car crash fatality rates.
    2. \(H_0\) : \(\rho = 0\) There is no linear correlation in the population
    3. \(H_A\) :\(\rho \neq 0\) There is some linear correlation in the population
    4. Critical Value (from Pearson Correlation Critical Value chart): \(r_{cv}= ±0.878 \)
    5. image with 5 scatter plots all in first quadrant coordinate planes with correlation coefficients labeled underneath.  1st: dots are basically in a straight line sloping down from left to right. This graph is labeled strong negative correlation .  2nd: dots are still sloping down from left to right but they are more scattered than the 1st plot.  This graph is labeled weak negative correlation.  3rd: randomly scattered points with no clear direction--labeled no correlation.  4th: Random scattering of points but  largely increasing from left to right--labeled weak positive. 5th:  dots are still increasing from left to right but they are more linear than the previous graph.  This graph is labeled strong positive.  The critical values are marked between the 1st & 2nd dot plots and between the 4th & 5th.

    6. Correlation Coefficient: \(r = -0.9554\)
    7. Is this correlation coefficient in the critical region Yes
    8. Decision about null hypothesis: Reject \(H_0\)
    9. Concluding Statement:There is sufficient sample evidence to support the claim of a linear correlation between weights of lemon imports from Mexica and U.S. car crash fatality rates.
    10. Do the results suggest that imported lemons cause car fatalities? No. Correlation does not imply causation!