3.4.2 - Correlation

In this course, we will be using Pearson's \(r\) as a measure of the linear relationship between two quantitative variables. In a sample, we use the symbol \(r\). In a population, we use the Greek letter \(\rho\) ("rho"). Pearson's \(r\) can easily be computed using statistical software.

A measure of the direction and strength of the relationship between two variables. Properties of Pearson's r

\(-1\leq r \leq +1\)
For a positive association, \(r>0\), for a negative association \(r
The closer \(r\) is to \(0\) the weaker the relationship and the closer to \(+1\) or \(-1\) the stronger the relationship (e.g., \(r=-0.88\) is a stronger relationship than \(r=+0.60\)); the sign of the correlation provides direction only
Correlation is unit free; the \(x\) and \(y\) variables do NOT need to be on the same scale (e.g., it is possible to compute the correlation between height in centimeters and weight in pounds)
It does not matter which variable you label as \(x\) and which you label as \(y\). The correlation between \(x\) and \(y\) is equal to the correlation between \(y\) and \(x\).

The following table may serve as a guideline when evaluating correlation coefficients:

Absolute Value of \(r\)	Strength of the Relationship
0 - 0.2	Very weak
0.2 - 0.4	Weak
0.4 - 0.6	Moderate
0.6 - 0.8	Strong
0.8 - 1.0	Very strong

Correlation does NOT equal causation. A strong relationship between \(x\) and \(y\) does not necessarily mean that \(x\) causes \(y\). It is possible that \(y\) causes \(x\), or that a confounding variable causes both \(x\) and \(y\).
Pearson's \(r\) should only be used when there is a linear relationship between \(x\) and \(y\). A scatterplot should be constructed before computing Pearson's \(r\) to confirm that the relationship is not non-linear.
Pearson's \(r\) is not resistant to outliers. Figure 1 below provides an example of an influential outlier. Influential outliers are points in a data set that increase the correlation coefficient. In Figure 1 the correlation between \(x\) and \(y\) is strong (\(r=0.979\)). In Figure 2 below, the outlier is removed. Now, the correlation between \(x\) and \(y\) is lower (\(r=0.576\)) and the slope is less steep.

Exam = 4.154 + 6.661 Quiz

Note that the scale on both the x and y axes has changed. In addition to the correlation changing, the y-intercept changed from 4.154 to 70.84 and the slope changed from 6.661 to 1.632.

Exam = 70.84 + 1.632 Quiz

3.4.2.1 - Formulas for Computing Pearson's r

There are a number of different versions of the formula for computing Pearson's \(r\). You should get the same correlation value regardless of which formula you use. Note that you will not have to compute Pearson's \(r\) by hand in this course. These formulas are presented here to help you understand what the value means. You should always be using technology to compute this value.

First, we'll look at the conceptual formula which uses \(z\) scores. To use this formula we would first compute the \(z\) score for every \(x\) and \(y\) value. We would multiply each case's \(z_x\) by their \(z_y\). If their \(x\) and \(y\) values were both above the mean then this product would be positive. If their x and y values were both below the mean this product would be positive. If one value was above the mean and the other was below the mean this product would be negative. Think of how this relates to the correlation being positive or negative. The sum of all of these products is divided by \(n-1\) to obtain the correlation.

Pearson's r: Conceptual Formula

When we replace \(z_x\) and \(z_y\) with the \(z\) score formulas and move the \(n-1\) to a separate fraction we get the formula in your textbook: \(r=\frac\Sigma<\left(\frac\right) \left( \frac\right)>\)

3.4.2.2 - Example of Computing r by Hand (Optional)

Again, you will not need to compute \(r\) by hand in this course. This example is meant to show you how \(r\) is computed with the intention of enhancing your understanding of its meaning. In this course, you will always be using Minitab or StatKey to compute correlations.

In this example we have data from a random sample of \(n = 9\) World Campus STAT 200 students from the Spring 2017 semester. WileyPlus scores had a maximum possible value of 100. Midterm exam scores had a maximum possible value of 50. Remember, the \(x\) and \(y\) variables do not need to be on the same metric to compute a correlation.

ID	WileyPlus	Midterm
A	82	37
B	100	47
C	96	33
D	96	36
E	80	44
F	77	35
G	100	50
H	100	49
I	94	45

Minitab was used to construct a scatterplot of these two variables. We need to examine the shape of the relationship before determining if Pearson's \(r\) is the appropriate correlation coefficient to use. Pearson's \(r\) can only be used to check for a linear relationship. For this example I am going to call WileyPlus grades the \(x\) variable and midterm exam grades the \(y\) variable because students completed WileyPlus assignments before the midterm exam.