# Difference between revisions of "Correlation"

(There's a right way and a wrong way to do this - see also right and wrong, a subject no longer taught in public school) |
(Restored the maths (useful for those who want to know the maths...!) plus some gardening) |
||

Line 1: | Line 1: | ||

− | '''Correlation''' | + | '''Correlation''' describes the relationship of two factors to one another (see [[cause and effect]]).. In common usage, it denotes an association of one variable with another in quite general terms; for example, one might say, "success is correlated with hard work". In mathematics, however, and in science and engineering, which make use of mathematical concepts, correlation is a technical term with a precise definition. |

− | + | ||

− | When one factor changes | + | Correlation must be distinguished from causation (see article on [[correlation is not causation]]). When one factor changes and another factor changes with it, there is usually a direct relationship between the two factors, observed as a correlation. Alternatively, both changed could be the result of changes in a third factor. For example, the prices of two unrelated goods might increase during a period of [[inflation]]; the two price rises are correlated with each other but neither has caused the other. |

− | Once a correlation is established, scientists conduct research to determine causation. Are respiration deaths causing air pollution, or is it the other way around? It is easy to determine that sickness among the elderly does not cause air pollution. Rather, it is chemicals like sulfur dioxide (typically from coal burning [[power plant]]s) which are the | + | Once a correlation is established, scientists may conduct research to determine causation. Are respiration deaths causing air pollution, or is it the other way around? It is easy to determine that sickness among the elderly does not cause air pollution. Rather, it is chemicals like sulfur dioxide (typically from coal burning [[power plant]]s) which are the culprits. Cities and states measure the amount of pollutants in the air and epidemiologists can use these data, comparing them to the number of people who develop respiratory diseases. |

Regulations which restrict air pollution are made on the basis of these correlations, and on the cause and effect relationships which the correlations help scientists to discover. However, activists have sometimes created false correlations by selective use of data. [http://www.robinsoncurriculum.com/view/rc/s31p59.htm] | Regulations which restrict air pollution are made on the basis of these correlations, and on the cause and effect relationships which the correlations help scientists to discover. However, activists have sometimes created false correlations by selective use of data. [http://www.robinsoncurriculum.com/view/rc/s31p59.htm] | ||

+ | |||

+ | == Formal definition == | ||

+ | ''This section is at the level of advanced high school maths (e.g. A-level or Baccalaureat) and can be skipped by general readers.'' | ||

+ | |||

+ | The '''correlation coefficient''', also known as '''Pearson's r''', is a statistical measure of association between two continuous variables. It is defined as: | ||

+ | |||

+ | '''r'''= ΣZ<sub>x</sub>Z<sub>y</sub>/n | ||

+ | |||

+ | Where: '''Z<sub>x</sub>'''= the Z-score of the independent variable X, '''Z<sub>y</sub>'''= the Z-score of the dependent variable Y, and '''n'''= the number of observations of variables X and Y. | ||

+ | |||

+ | Thus, Pearson's r is the arithmetic mean of the products of the variable Z-scores. The Z-scores used in the correlation coefficient must be calculated using the population formula for the standard deviation and thus: | ||

+ | |||

+ | '''Z<sub>x</sub>'''= (X-M<sub>x</sub>)/SD<sub>x</sub> | ||

+ | |||

+ | Where: '''M<sub>x</sub>'''= the arithmetic mean of the variable x and '''SD<sub>x</sub>'''= the standard deviation of the variable x. | ||

+ | |||

+ | '''SD<sub>x</sub>'''= Σ(X-M<sub>x</sub>)<sup>2</sup>/n | ||

+ | |||

+ | Where: '''n'''= the number of observations of variables x and y | ||

+ | |||

+ | '''Z<sub>y</sub>'''= (Y-M<sub>y</sub>)/SD<sub>y</sub> | ||

+ | |||

+ | Where: '''M<sub>y</sub>'''= the arithmetic mean of the variable Y and '''SD<sub>y</sub>'''= the standard deviation of the variable Y. | ||

+ | |||

+ | '''SD<sub>y</sub>'''= Σ(Y-M<sub>y</sub>)<sup>2</sup>/n | ||

+ | |||

+ | Where: '''n'''= the number of observations of variables X and Y | ||

+ | |||

+ | Pearson's r varies between -1 and +1. A value of zero indicates that no association is present between the variables, a value of +1 indicates that the strongest possible positive association exists and a value of -1 indicates that there is the strongest possible negative association. In a positive relationship, as variable X increases in value, variable Y will also increase in value (e.g. as number of hours worked (X) increases, weekly pay (Y) also increases). In contrast, in a negative relationship, as variable X increases in value, variable Y will decrease in value (e.g. as number of alcoholic beverages consumed (X) increases, score on a test of hand-eye coordination (Y) decreases). | ||

+ | |||

+ | It is important to note that while a correlation coefficient may be calculated for any set of numbers there are no guarantees that this coefficient is statistically significant. That is to say, without statistical significance we cannot be certain that the computed correlation is an accurate reflection of reality. Significance may be assessed using a variant on the standard [[Student's t-test]] defined as: | ||

+ | |||

+ | '''t'''= (r)√(n-2)/√(1-r<sup>2</sup>) | ||

+ | |||

+ | The results of this test should be evaluated against the standard t-distribution with degrees of freedom equal to n-2. As usual in significance tests, a result described as significant (i.e. a low value of ''P'' in the ''t'' test) means that the chance of getting a correlation coefficient at least as large (whichever the direction) as that observed, if there is in fact no correlation, is small. A statistically significant result can therefore arise if the actual correlation is strong, even if the dataset is small, or if the actual correlation is weak but many data are observed. | ||

+ | |||

+ | It is important to note that the correlation coefficient should not be calculated when either of the variables is not continuous. That is, when they do not vary continuously and have a meaningful zero. As such, correlating a dichotomous variable (e.g. sex) with a ratio variable (e.g. IQ) is inappropriate and will return uninterpretable results. | ||

+ | |||

+ | Additionally, correlation estimates a linear relationship between X and Y. Thus, an increase in variable X is assumed to exert the same influence on Y across all values of X and Y. | ||

+ | |||

+ | ==Correlation and Causation== | ||

+ | |||

+ | ''Main article: [[Correlation is not causation]]'' | ||

+ | |||

+ | Correlation is a linear statistic and makes no mathematical distinction between [[independent variable|independent]] and [[dependent variable]]s. As a consequence, it is impossible to assert [[causation]] based on a correlation even if that correlation is statistically significant. Therefore, when a significant correlation has been identified it is possible that X -> Y (i.e. X causes Y), Y -> X (i.e. Y causes X), or that X <- Z -> Y (i.e. Both X and Y are caused by an additional variable, Z). Caution must be exercised in asserting causation using correlation and claims to this effect must be viewed with considerable skepticism. | ||

+ | |||

+ | ==References== | ||

+ | |||

+ | Aron, Arthur, Elaine N. Aron and Elliot J. Coups. 2008. ''Statistics for the Behavioral and Social Sciences: A Brief Course.'' 4/e. Upper Saddle River, New Jersey: Prentice Hall. | ||

[[Category:statistics]] | [[Category:statistics]] |

## Revision as of 15:55, 5 January 2009

**Correlation** describes the relationship of two factors to one another (see cause and effect).. In common usage, it denotes an association of one variable with another in quite general terms; for example, one might say, "success is correlated with hard work". In mathematics, however, and in science and engineering, which make use of mathematical concepts, correlation is a technical term with a precise definition.

Correlation must be distinguished from causation (see article on correlation is not causation). When one factor changes and another factor changes with it, there is usually a direct relationship between the two factors, observed as a correlation. Alternatively, both changed could be the result of changes in a third factor. For example, the prices of two unrelated goods might increase during a period of inflation; the two price rises are correlated with each other but neither has caused the other.

Once a correlation is established, scientists may conduct research to determine causation. Are respiration deaths causing air pollution, or is it the other way around? It is easy to determine that sickness among the elderly does not cause air pollution. Rather, it is chemicals like sulfur dioxide (typically from coal burning power plants) which are the culprits. Cities and states measure the amount of pollutants in the air and epidemiologists can use these data, comparing them to the number of people who develop respiratory diseases.

Regulations which restrict air pollution are made on the basis of these correlations, and on the cause and effect relationships which the correlations help scientists to discover. However, activists have sometimes created false correlations by selective use of data. [1]

## Formal definition

*This section is at the level of advanced high school maths (e.g. A-level or Baccalaureat) and can be skipped by general readers.*

The **correlation coefficient**, also known as **Pearson's r**, is a statistical measure of association between two continuous variables. It is defined as:

**r**= ΣZ_{x}Z_{y}/n

Where: **Z _{x}**= the Z-score of the independent variable X,

**Z**= the Z-score of the dependent variable Y, and

_{y}**n**= the number of observations of variables X and Y.

Thus, Pearson's r is the arithmetic mean of the products of the variable Z-scores. The Z-scores used in the correlation coefficient must be calculated using the population formula for the standard deviation and thus:

**Z _{x}**= (X-M

_{x})/SD

_{x}

Where: **M _{x}**= the arithmetic mean of the variable x and

**SD**= the standard deviation of the variable x.

_{x}**SD _{x}**= Σ(X-M

_{x})

^{2}/n

Where: **n**= the number of observations of variables x and y

**Z _{y}**= (Y-M

_{y})/SD

_{y}

Where: **M _{y}**= the arithmetic mean of the variable Y and

**SD**= the standard deviation of the variable Y.

_{y}**SD _{y}**= Σ(Y-M

_{y})

^{2}/n

Where: **n**= the number of observations of variables X and Y

Pearson's r varies between -1 and +1. A value of zero indicates that no association is present between the variables, a value of +1 indicates that the strongest possible positive association exists and a value of -1 indicates that there is the strongest possible negative association. In a positive relationship, as variable X increases in value, variable Y will also increase in value (e.g. as number of hours worked (X) increases, weekly pay (Y) also increases). In contrast, in a negative relationship, as variable X increases in value, variable Y will decrease in value (e.g. as number of alcoholic beverages consumed (X) increases, score on a test of hand-eye coordination (Y) decreases).

It is important to note that while a correlation coefficient may be calculated for any set of numbers there are no guarantees that this coefficient is statistically significant. That is to say, without statistical significance we cannot be certain that the computed correlation is an accurate reflection of reality. Significance may be assessed using a variant on the standard Student's t-test defined as:

**t**= (r)√(n-2)/√(1-r^{2})

The results of this test should be evaluated against the standard t-distribution with degrees of freedom equal to n-2. As usual in significance tests, a result described as significant (i.e. a low value of *P* in the *t* test) means that the chance of getting a correlation coefficient at least as large (whichever the direction) as that observed, if there is in fact no correlation, is small. A statistically significant result can therefore arise if the actual correlation is strong, even if the dataset is small, or if the actual correlation is weak but many data are observed.

It is important to note that the correlation coefficient should not be calculated when either of the variables is not continuous. That is, when they do not vary continuously and have a meaningful zero. As such, correlating a dichotomous variable (e.g. sex) with a ratio variable (e.g. IQ) is inappropriate and will return uninterpretable results.

Additionally, correlation estimates a linear relationship between X and Y. Thus, an increase in variable X is assumed to exert the same influence on Y across all values of X and Y.

## Correlation and Causation

*Main article: Correlation is not causation*

Correlation is a linear statistic and makes no mathematical distinction between independent and dependent variables. As a consequence, it is impossible to assert causation based on a correlation even if that correlation is statistically significant. Therefore, when a significant correlation has been identified it is possible that X -> Y (i.e. X causes Y), Y -> X (i.e. Y causes X), or that X <- Z -> Y (i.e. Both X and Y are caused by an additional variable, Z). Caution must be exercised in asserting causation using correlation and claims to this effect must be viewed with considerable skepticism.

## References

Aron, Arthur, Elaine N. Aron and Elliot J. Coups. 2008. *Statistics for the Behavioral and Social Sciences: A Brief Course.* 4/e. Upper Saddle River, New Jersey: Prentice Hall.