Difference between revisions of "Correlation"

Revision as of 21:18, June 15, 2008

The correlation coefficient, also known as Pearson's r, is a statistical measure of association between two ratio variables. It is defined as:

r= ΣZ_xZ_y/n

Where: Z_x= the Z-score of the independent variable X, Z_y= the Z-score of the dependent variable Y, and n= the number of observations of variables X and Y.

Thus, Pearson's r is the arithmetic mean of the products of the variable Z-scores. The Z-scores used in the correlation coefficient must be calculated using the population formula for the standard deviation and thus:

Z_x= (X-M_x)/SD_x

Where: M_x= the arithmetic mean of the variable x and SD_x= the standard deviation of the variable x.

SD_x= Σ(X-M_x)²/n

Where: n= the number of observations of variables x and y

Z_y= (Y-M_y)/SD_y

Where: M_y= the arithmetic mean of the variable Y and SD_y= the standard deviation of the variable Y.

SD_y= Σ(Y-M_y)²/n

Where: n= the number of observations of variables X and Y

Pearson's r varies between -1 and +1. A value of zero indicates that no association is present between the variables, a value of positive one indicates that the strongest possible positive association exists and a value of negative one indicates that the strongest possible negative association exists. In a positive relationship as variable X increases in value, variable Y will also increase in value (e.g. as number of hours worked (X) increases, weekly pay (Y) also increases). In contrast, in a negative relationship as variable X increases in value, variable Y will decrease in value (e.g. as number of alcoholic beverages consumed (X) increases, score on a test of hand-eye coordination (Y) decreases).

It is important to note that while a correlation coefficient may be calculated for any set of numbers there are no guarantees that this coefficient is statistically significant. That is to say, without statistical significance we cannot be certain that the computed correlation is an accurate reflection of reality. Significance may be assessed using a variant on the standard Student's t-test defined as:

t= (r)√(n-2)/√(1-r²)

The results of this test should be evaluated against the standard t-distribution with degrees of freedom equal to n-2.

It is important to note that the correlation coefficient should not be calculated when either of the variables are non-ratio. That is, when they do not vary continuously and have a meaningful zero. As such, correlating a dichotomous variable (e.g. sex) with a ratio variable (e.g. IQ) is inappropriate and will return uninterpretable results.

Additionally, correlation estimates a linear relationship between X and Y. Thus, an increase in variable X is assumed to exert the same influence on Y across all values of X and Y.

Correlation and Causation

Correlation is a linear statistic and makes no mathematical distinction between independent and dependent variables. As a consequence, it is impossible to assert causation based on a correlation even if that correlation is statistically significant. Therefore, when a significant correlation has been identified it is possible that X -> Y (i.e. X causes Y), Y -> X (i.e. Y causes X), or that X <- Z -> Y (i.e. Both X and Y are caused by an additional variable, Z). Caution must be exercised in asserting causation using correlation and claims to this effect (e.g. atheism causes suicide) must be viewed with considerable skepticism.

Correlation and Language

The term "correlation" or "correlated" has entered common usage as meaning "is associated with." It is important to note, however, that correlation is actually a technical term denoting a very specific class of relationship. Many common usages of the term are, therefore, substantively incorrect.

References

Aron, Arthur, Elaine N. Aron and Elliot J. Coups. 2008. Statistics for the Behavioral and Social Sciences: A Brief Course. 4/e. Upper Saddle River, New Jersey: Prentice Hall.

@@ Line 1: / Line 1: @@
-'''Correlation''' is defined by dict.org as:
+The '''correlation coefficient''', also known as '''Pearson's r''', is a statistical measure of association between two ratio variables. It is defined as:
-:a statistical relation between two or more variables such that systematic changes in the value of one variable are accompanied by systematic changes in the other [http://256.com/gray/thoughts/2004/20040511.html]
+'''r'''= ΣZ<sub>x</sub>Z<sub>y</sub>/n
-In other words, if you are studying two variables and both change at the same time (consistently, veritably, and repeatedly), then you still cannot conclude one causes the other. Instead all you can say is that there is a link between the two, but the directionality of that link still needs to be established by negative testing. The design of the system must also be reexamined to assure that the design is not causing a systematic connection between the two variables, by implicit means.
+Where: '''Z<sub>x</sub>'''= the Z-score of the independent variable X, '''Z<sub>y</sub>'''= the Z-score of the dependent variable Y, and '''n'''= the number of observations of variables X and Y.
-==Correlation and causation==
+Thus, Pearson's r is the arithmetic mean of the products of the variable Z-scores. The Z-scores used in the correlation coefficient must be calculated using the population formula for the standard deviation and thus:
+'''Z<sub>x</sub>'''= (X-M<sub>x</sub>)/SD<sub>x</sub>
+Where: '''M<sub>x</sub>'''= the arithmetic mean of the variable x and '''SD<sub>x</sub>'''= the standard deviation of the variable x.
+'''SD<sub>x</sub>'''= Σ(X-M<sub>x</sub>)<sup>2</sup>/n
+Where: '''n'''= the number of observations of variables x and y
+'''Z<sub>y</sub>'''= (Y-M<sub>y</sub>)/SD<sub>y</sub>
+Where: '''M<sub>y</sub>'''= the arithmetic mean of the variable Y and '''SD<sub>y</sub>'''= the standard deviation of the variable Y.
+'''SD<sub>y</sub>'''= Σ(Y-M<sub>y</sub>)<sup>2</sup>/n
+Where: '''n'''= the number of observations of variables X and Y
+Pearson's r varies between -1 and +1. A value of zero indicates that no association is present between the variables, a value of positive one indicates that the strongest possible positive association exists and a value of negative one indicates that the strongest possible negative association exists. In a positive relationship as variable X increases in value, variable Y will also increase in value (e.g. as number of hours worked (X) increases, weekly pay (Y) also increases). In contrast, in a negative relationship as variable X increases in value, variable Y will decrease in value (e.g. as number of alcoholic beverages consumed (X) increases, score on a test of hand-eye coordination (Y) decreases).
+It is important to note that while a correlation coefficient may be calculated for any set of numbers there are no guarantees that this coefficient is statistically significant. That is to say, without statistical significance we cannot be certain that the computed correlation is an accurate reflection of reality. Significance may be assessed using a variant on the standard [[Student's t-test]] defined as:
+'''t'''= (r)√(n-2)/√(1-r<sup>2</sup>)
+The results of this test should be evaluated against the standard t-distribution with degrees of freedom equal to n-2.
+It is important to note that the correlation coefficient should not be calculated when either of the variables are non-ratio. That is, when they do not vary continuously and have a meaningful zero. As such, correlating a dichotomous variable (e.g. sex) with a ratio variable (e.g. IQ) is inappropriate and will return uninterpretable results.
+Additionally, correlation estimates a linear relationship between X and Y. Thus, an increase in variable X is assumed to exert the same influence on Y across all values of X and Y.
+==Correlation and Causation==
+Correlation is a linear statistic and makes no mathematical distinction between independent and dependent variables. As a consequence, it is impossible to assert [[causation]] based on a correlation even if that correlation is statistically significant. Therefore, when a significant correlation has been identified it is possible that X -> Y (i.e. X causes Y), Y -> X (i.e. Y causes X), or that X <- Z -> Y (i.e. Both X and Y are caused by an additional variable, Z). Caution must be exercised in asserting causation using correlation and claims to this effect (e.g. atheism causes suicide) must be viewed with considerable skepticism.
+==Correlation and Language==
+The term "correlation" or "correlated" has entered common usage as meaning "is associated with." It is important to note, however, that correlation is actually a technical term denoting a very specific class of relationship. Many common usages of the term are, therefore, substantively incorrect.
-Correlation is often confused with [[causation]]; see an [[correlation is not causation|explanation]].
 ==References==
-*[http://www.stat.tamu.edu/stat30x/notes/node42.html Texas A&M University]
+Aron, Arthur, Elaine N. Aron and Elliot J. Coups. 2008. ''Statistics for the Behavioral and Social Sciences: A Brief Course.'' 4/e. Upper Saddle River, New Jersey: Prentice Hall.
 [[Category:statistics]]

Difference between revisions of "Correlation"

Revision as of 21:18, June 15, 2008

Correlation and Causation

Correlation and Language

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Popular Links

donate

Edit Console