# Difference between revisions of "Correlation"

m |
|||

(One intermediate revision by the same user not shown) | |||

Line 1: | Line 1: | ||

− | + | Correlation refers to a statistical relationship between two continuous variables. In common speech if two factors are "correlated" it is often incorrectly taken to mean that one factor causes the other (see [[cause and effect]]). Importantly, however, a correlation only means that two factors tend to occur together and is not proof of causation. The co-occurrence of two factors may arise because one of them causes the other, but a correlation can also occur because both are caused by an unknown third factor (see [[correlation is not causation]]). For example, the prices of two unrelated goods might increase during a period of [[inflation]]; the two prices are correlated with each other because they co-occur but were both caused by the same third factor. | |

− | + | Once a correlation has been noted, it is often possible to determine if there is a causal relationship between the variables, though this usually requires additional research by scientists. Research is particularly necessary if there is not a clear ordering of the factors in time. For example, a correlation between the heights of fathers and sons may be produced by fathers directly causing their sons' heights (e.g. through genetics) but is unlikely to be the result of sons causing their fathers' heights. The latter would require a factor (i.e. height of sons) to cause something that occurred earlier in time (i.e. the height of the fathers), and is therefore implausible. In contrast, a correlation between cancer and consumption of alcohol might imply that alcohol causes cancer but could also imply that cancer causes the consumption of alcohol, for example to reduce the pain of the cancer. As there is no logically necessary ordering to these two factors (i.e. the cancer or the consumption could have started first) additional research is necessary to determine causation. | |

− | + | It is sometimes also possible to reject particular causal arguments based on theoretical or substantive knowledge even if there is no obvious ordering of the events in time. For example, are deaths from respiratory disease causing air pollution, or is air pollution causing deaths from respiratory disease? There is no theoretical or substantive reason to think that human deaths from respiratory disease can degrade air quality, but there are both theoretical and substantive reasons to think that chemicals like sulfur dioxide (typically from coal burning [[power plant]]s) can produce respiratory disease. An association between the two is therefore suggestive that air pollution causes respiratory disease though it remains possible that both might be caused by a third factor. | |

+ | |||

+ | Regulations which restrict air pollution are often made on the basis of these correlations, as well as on the cause and effect relationships which the correlations help scientists to discover. However, activists have sometimes selectively used data to create the appearance of a correlation where none exists. [http://www.robinsoncurriculum.com/view/rc/s31p59.htm] Likewise, manipulation of data can be used to obscure a correlation that does exist. It is therefore important that good practices be used in the collection and analysis of data to ensure that reported correlations, or the lack thereof, are reliable | ||

− | |||

== Formal definition == | == Formal definition == | ||

Line 40: | Line 41: | ||

'''t'''= (r)√(n-2)/√(1-r<sup>2</sup>) | '''t'''= (r)√(n-2)/√(1-r<sup>2</sup>) | ||

− | The results of this test should be evaluated against the standard t-distribution with degrees of freedom equal to n-2. As usual in significance tests, a result described as significant (i.e. a | + | The results of this test should be evaluated against the standard t-distribution with degrees of freedom equal to n-2. As usual in significance tests, a result described as significant (i.e. typically a probability of occurring naturally of less than 5%) means that the chance of getting a correlation coefficient at least as large (whichever the direction) as that observed, if there is in fact no correlation, is small. A statistically significant result can therefore arise if the actual correlation is strong, even if the dataset is small, or if the actual correlation is weak but many data are observed. |

It is important to note that the correlation coefficient should not be calculated when either of the variables is not continuous. That is, when they do not vary continuously and have a meaningful zero. As such, correlating a dichotomous variable (e.g. sex) with a ratio variable (e.g. IQ) is inappropriate and will return uninterpretable results. | It is important to note that the correlation coefficient should not be calculated when either of the variables is not continuous. That is, when they do not vary continuously and have a meaningful zero. As such, correlating a dichotomous variable (e.g. sex) with a ratio variable (e.g. IQ) is inappropriate and will return uninterpretable results. |

## Revision as of 15:24, 24 July 2009

Correlation refers to a statistical relationship between two continuous variables. In common speech if two factors are "correlated" it is often incorrectly taken to mean that one factor causes the other (see cause and effect). Importantly, however, a correlation only means that two factors tend to occur together and is not proof of causation. The co-occurrence of two factors may arise because one of them causes the other, but a correlation can also occur because both are caused by an unknown third factor (see correlation is not causation). For example, the prices of two unrelated goods might increase during a period of inflation; the two prices are correlated with each other because they co-occur but were both caused by the same third factor.

Once a correlation has been noted, it is often possible to determine if there is a causal relationship between the variables, though this usually requires additional research by scientists. Research is particularly necessary if there is not a clear ordering of the factors in time. For example, a correlation between the heights of fathers and sons may be produced by fathers directly causing their sons' heights (e.g. through genetics) but is unlikely to be the result of sons causing their fathers' heights. The latter would require a factor (i.e. height of sons) to cause something that occurred earlier in time (i.e. the height of the fathers), and is therefore implausible. In contrast, a correlation between cancer and consumption of alcohol might imply that alcohol causes cancer but could also imply that cancer causes the consumption of alcohol, for example to reduce the pain of the cancer. As there is no logically necessary ordering to these two factors (i.e. the cancer or the consumption could have started first) additional research is necessary to determine causation.

It is sometimes also possible to reject particular causal arguments based on theoretical or substantive knowledge even if there is no obvious ordering of the events in time. For example, are deaths from respiratory disease causing air pollution, or is air pollution causing deaths from respiratory disease? There is no theoretical or substantive reason to think that human deaths from respiratory disease can degrade air quality, but there are both theoretical and substantive reasons to think that chemicals like sulfur dioxide (typically from coal burning power plants) can produce respiratory disease. An association between the two is therefore suggestive that air pollution causes respiratory disease though it remains possible that both might be caused by a third factor.

Regulations which restrict air pollution are often made on the basis of these correlations, as well as on the cause and effect relationships which the correlations help scientists to discover. However, activists have sometimes selectively used data to create the appearance of a correlation where none exists. [1] Likewise, manipulation of data can be used to obscure a correlation that does exist. It is therefore important that good practices be used in the collection and analysis of data to ensure that reported correlations, or the lack thereof, are reliable

## Formal definition

*This section is at the level of advanced high school maths (e.g. A-level or Baccalaureat) and can be skipped by general readers.*

The **correlation coefficient**, also known as **Pearson's r**, is a statistical measure of association between two continuous variables. It is defined as:

**r**= ΣZ_{x}Z_{y}/n

Where: **Z _{x}**= the Z-score of the independent variable X,

**Z**= the Z-score of the dependent variable Y, and

_{y}**n**= the number of observations of variables X and Y.

Thus, Pearson's r is the arithmetic mean of the products of the variable Z-scores. The Z-scores used in the correlation coefficient must be calculated using the population formula for the standard deviation and thus:

**Z _{x}**= (X-M

_{x})/SD

_{x}

Where: **M _{x}**= the arithmetic mean of the variable x and

**SD**= the standard deviation of the variable x.

_{x}**SD _{x}**= Σ(X-M

_{x})

^{2}/n

Where: **n**= the number of observations of variables x and y

**Z _{y}**= (Y-M

_{y})/SD

_{y}

Where: **M _{y}**= the arithmetic mean of the variable Y and

**SD**= the standard deviation of the variable Y.

_{y}**SD _{y}**= Σ(Y-M

_{y})

^{2}/n

Where: **n**= the number of observations of variables X and Y

Pearson's r varies between -1 and +1. A value of zero indicates that no association is present between the variables, a value of +1 indicates that the strongest possible positive association exists and a value of -1 indicates that there is the strongest possible negative association. In a positive relationship, as variable X increases in value, variable Y will also increase in value (e.g. as number of hours worked (X) increases, weekly pay (Y) also increases). In contrast, in a negative relationship, as variable X increases in value, variable Y will decrease in value (e.g. as number of alcoholic beverages consumed (X) increases, score on a test of hand-eye coordination (Y) decreases).

It is important to note that while a correlation coefficient may be calculated for any set of numbers there are no guarantees that this coefficient is statistically significant. That is to say, without statistical significance we cannot be certain that the computed correlation is an accurate reflection of reality. Significance may be assessed using a variant on the standard Student's t-test defined as:

**t**= (r)√(n-2)/√(1-r^{2})

The results of this test should be evaluated against the standard t-distribution with degrees of freedom equal to n-2. As usual in significance tests, a result described as significant (i.e. typically a probability of occurring naturally of less than 5%) means that the chance of getting a correlation coefficient at least as large (whichever the direction) as that observed, if there is in fact no correlation, is small. A statistically significant result can therefore arise if the actual correlation is strong, even if the dataset is small, or if the actual correlation is weak but many data are observed.

It is important to note that the correlation coefficient should not be calculated when either of the variables is not continuous. That is, when they do not vary continuously and have a meaningful zero. As such, correlating a dichotomous variable (e.g. sex) with a ratio variable (e.g. IQ) is inappropriate and will return uninterpretable results.

Additionally, correlation estimates a linear relationship between X and Y. Thus, an increase in variable X is assumed to exert the same influence on Y across all values of X and Y.

## Correlation and Causation

*Main article: Correlation is not causation*

Correlation is a linear statistic and makes no mathematical distinction between independent and dependent variables. As a consequence, it is impossible to assert causation based on a correlation even if that correlation is statistically significant. Therefore, when a significant correlation has been identified it is possible that X -> Y (i.e. X causes Y), Y -> X (i.e. Y causes X), or that X <- Z -> Y (i.e. Both X and Y are caused by an additional variable, Z). Caution must be exercised in asserting causation using correlation and claims to this effect must be viewed with considerable skepticism.

## References

Aron, Arthur, Elaine N. Aron and Elliot J. Coups. 2008. *Statistics for the Behavioral and Social Sciences: A Brief Course.* 4/e. Upper Saddle River, New Jersey: Prentice Hall.