Difference between revisions of "Correlation"

Revision as of 13:55, July 15, 2016

Correlation describes the relationship of two factors to one another (see cause and effect). When two factors are found to go together, there are only three explanations. First, the two things may be entirely unrelated. Second, they may both be caused by another factor. Third, one of the factors might be causing the other. It is the task of statistics to point out the existence of a correlation, but it is the task of science to discover what relationship, if any, exists.

correlation does not imply causation. This is the most frequent mistake made by people. There are set of principles of causal inference that need to be satisfied in order to imply cause and effect. [1]

Unrelated factors

Often cancer clusters have been said to indicate the existence of a biological hazard, but by the laws of chance in any group of people who get a rare disease (or win the lottery) there will always be some who live close together. This does not mean that something in their environment caused their tragedy or good fortune.

If you flip coins or throw dice, from time to time you will get runs of similar results like 4 heads in a row. You might try to find the cause for these runs in what you were thinking about at the time, hoping to find that you could influence the toss or roll. But this is nonsense.

When unrelated events occur, it is dismissed as a matter of coincidence.

Factors which have a common cause

Two commuters might happen to see each other on a train, not because either intended to meet the other, but simply because each happens to dislike crowded trains and prefer trains with dining cars. When the regular train is crowded or lacks a dining car, both may independently decide to take a later train. Seeing each other in the dining car of one would not be merely a coincidence, but would stem from the conditions on the earlier train.

Cases of shark attacks correlate with sales of ice cream, but no one thinks that the bites make people buy ice cream, or that eating ice cream makes you vulnerable to sharks. Both things go up and down in an annual cycle, because people go to the beach more in the summer. They swim and buy ice cream because it's hot. Swimming exposes them to (rare) shark attacks.

One thing causes another

In scientific subjects like chemistry, physics and astronomy many laws of cause and effect have been discovered. Put two chemicals together, and they form a compound. Push something, and its momentum increases. The scientific method is useful in describing the relationship between events, especially when despite their best efforts, no one has been able to find an exception (see independent review).

Terminology

In common usage, it denotes an association of one variable with another in quite general terms; for example, one might say, "success is correlated with hard work". In mathematics, however, and in science and engineering, which make use of mathematical concepts, correlation is a technical term with a precise definition.

Correlation must be distinguished from causation (see article on correlation is not causation). When one factor changes and another factor changes with it, there is usually a direct relationship between the two factors, observed as a correlation. Alternatively, both changed could be the result of changes in a third factor. For example, the prices of two unrelated goods might increase during a period of inflation; the two price rises are correlated with each other but neither has caused the other.

Once a correlation is established, scientists may conduct research to determine causation. Are respiration deaths causing air pollution, or is it the other way around? It is easy to determine that sickness among the elderly does not cause air pollution. Rather, it is chemicals like sulfur dioxide (typically from coal burning power plants) which are the culprits. Cities and states measure the amount of pollutants in the air and epidemiologists can use these data, comparing them to the number of people who develop respiratory diseases.

Regulations which restrict air pollution are made on the basis of these correlations, and on the cause and effect relationships which the correlations help scientists to discover. However, activists have sometimes created false correlations by selective use of data. [2]

Formal definition

This section is at the level of advanced high school maths (e.g. A-level or Baccalaureat) and can be skipped by general readers.

The correlation coefficient, also known as Pearson's r, is a statistical measure of association between two continuous variables. It is defined as:

r= ΣZ_xZ_y/n

Where: Z_x= the Z-score of the independent variable X, Z_y= the Z-score of the dependent variable Y, and n= the number of observations of variables X and Y.

Thus, Pearson's r is the arithmetic mean of the products of the variable Z-scores. The Z-scores used in the correlation coefficient must be calculated using the population formula for the standard deviation and thus:

Z_x= (X-M_x)/SD_x

Where: M_x= the arithmetic mean of the variable x and SD_x= the standard deviation of the variable x.

SD_x= Σ(X-M_x)²/n

Where: n= the number of observations of variables x and y

Z_y= (Y-M_y)/SD_y

Where: M_y= the arithmetic mean of the variable Y and SD_y= the standard deviation of the variable Y.

SD_y= Σ(Y-M_y)²/n

Where: n= the number of observations of variables X and Y

Pearson's r varies between -1 and +1. A value of zero indicates that no association is present between the variables, a value of +1 indicates that the strongest possible positive association exists and a value of -1 indicates that there is the strongest possible negative association. In a positive relationship, as variable X increases in value, variable Y will also increase in value (e.g. as number of hours worked (X) increases, weekly pay (Y) also increases). In contrast, in a negative relationship, as variable X increases in value, variable Y will decrease in value (e.g. as number of alcoholic beverages consumed (X) increases, score on a test of hand-eye coordination (Y) decreases).

It is important to note that while a correlation coefficient may be calculated for any set of numbers there are no guarantees that this coefficient is statistically significant. That is to say, without statistical significance we cannot be certain that the computed correlation is an accurate reflection of reality. Significance may be assessed using a variant on the standard Student's t-test defined as:

t= (r)√(n-2)/√(1-r²)

The results of this test should be evaluated against the standard t-distribution with degrees of freedom equal to n-2. As usual in significance tests, a result described as significant (i.e. a low value of P in the t test) means that the chance of getting a correlation coefficient at least as large (whichever the direction) as that observed, if there is in fact no correlation, is small. A statistically significant result can therefore arise if the actual correlation is strong, even if the dataset is small, or if the actual correlation is weak but many data are observed.

One should note that for most random variables a correlation of 0 does not imply that they are independent. However, if the two variables come from the Normal Distribution then it can be claimed that a correlation of 0 implies independence.

It is important to note that the correlation coefficient should not be calculated when either of the variables is not continuous. That is, when they do not vary continuously and have a meaningful zero. As such, correlating a dichotomous variable (e.g. sex) with a ratio variable (e.g. IQ) is inappropriate and will return uninterpretable results.

Additionally, correlation estimates a linear relationship between X and Y. Thus, an increase in variable X is assumed to exert the same influence on Y across all values of X and Y.

Correlation and Causation

For a more detailed treatment, see Correlation is not causation.

Correlation is a linear statistic and makes no mathematical distinction between independent and dependent variables. As a consequence, it is impossible to assert causation based on a correlation even if that correlation is statistically significant. Therefore, when a significant correlation has been identified it is possible that X -> Y (i.e. X causes Y), Y -> X (i.e. Y causes X), or that X <- Z -> Y (i.e. Both X and Y are caused by an additional variable, Z). Caution must be exercised in asserting causation using correlation and claims to this effect must be viewed with considerable skepticism.

References

Aron, Arthur, Elaine N. Aron and Elliot J. Coups. 2008. Statistics for the Behavioral and Social Sciences: A Brief Course. 4/e. Upper Saddle River, New Jersey: Prentice Hall.

@@ Line 1: / Line 1: @@
-Correlation refers to a statistical relationship between two continuous variables. In common speech if two factors are "correlated" it is often incorrectly taken to mean that one factor causes the other (see [[cause and effect]]). Importantly, however, a correlation only means that two factors tend to occur together and is not proof of causation. The co-occurrence of two factors may arise because one of them causes the other, but a correlation can also occur because both are caused by an unknown third factor (see [[correlation is not causation]]). For example, the prices of two unrelated goods might increase during a period of [[inflation]]; the two prices are correlated with each other because they co-occur but were both caused by the same third factor.
+'''Correlation''' describes the relationship of two factors to one another (see [[cause and effect]]). When two factors are found to go together, there are only three explanations. First, the two things may be entirely unrelated. Second, they may both be caused by another factor. Third, one of the factors might be causing the other. It is the task of [[statistics]] to point out the existence of a correlation, but it is the task of science to discover what relationship, if any, exists.
-Once a correlation has been noted, it is often possible to determine if there is a causal relationship between the variables, though this usually requires additional research by scientists. Research is particularly necessary if there is not a clear ordering of the factors in time. For example, a correlation between the heights of fathers and sons may be produced by fathers directly causing their sons' heights (e.g. through genetics) but is unlikely to be the result of sons causing their fathers' heights. The latter would require a factor (i.e. height of sons) to cause something that occurred earlier in time (i.e. the height of the fathers), and is therefore implausible. In contrast, a correlation between cancer and consumption of alcohol might imply that alcohol causes cancer but could also imply that cancer causes the consumption of alcohol, for example to reduce the pain of the cancer. As there is no logically necessary ordering to these two factors (i.e. the cancer or the consumption could have started first) additional research is necessary to determine causation.
+* correlation '''does not imply causation'''. This is the most frequent mistake made by people. There are set of principles of [[causal inference]] that need to be satisfied in order to imply cause and effect. [http://www.math.sfu.ca/~cschwarz/Stat-301/Handouts/node46.html]
-It is sometimes also possible to reject particular causal arguments based on theoretical or substantive knowledge even if there is no obvious ordering of the events in time. For example, are deaths from respiratory disease causing air pollution, or is air pollution causing deaths from respiratory disease? There is no theoretical or substantive reason to think that human deaths from respiratory disease can degrade air quality, but there are both theoretical and substantive reasons to think that chemicals like sulfur dioxide (typically from coal burning [[power plant]]s) can produce respiratory disease. An association between the two is therefore suggestive that air pollution causes respiratory disease though it remains possible that both might be caused by a third factor.
-Regulations which restrict air pollution are often made on the basis of these correlations, as well as on the cause and effect relationships which the correlations help scientists to discover. However, activists have sometimes selectively used data to create the appearance of a correlation where none exists. [http://www.robinsoncurriculum.com/view/rc/s31p59.htm] Likewise, manipulation of data can be used to obscure a correlation that does exist. It is therefore important that good practices be used in the collection and analysis of data to ensure that reported correlations, or the lack thereof, are reliable
+==Unrelated factors==
+Often [[cancer clusters]] have been said to indicate the existence of a biological hazard, but by the laws of chance in any group of people who get a rare disease (or win the lottery) there will always be some who live close together. This does not mean that something in their environment caused their tragedy or good fortune.
+If you flip coins or throw dice, from time to time you will get runs of similar results like 4 heads in a row. You might try to find the cause for these runs in what you were thinking about at the time, hoping to find that you could influence the toss or roll. But this is nonsense.
+When unrelated events occur, it is dismissed as a matter of coincidence.
+==Factors which have a common cause==
+Two commuters might happen to see each other on a train, not because either intended to meet the other, but simply because each happens to dislike crowded trains and prefer trains with dining cars. When the regular train is crowded or lacks a dining car, both may independently decide to take a later train. Seeing each other in the dining car of one would not be merely a coincidence, but would stem from the conditions on the earlier train.
+Cases of shark attacks correlate with sales of ice cream, but no one thinks that the bites make people buy ice cream, or that eating ice cream makes you vulnerable to sharks. Both things go up and down in an annual cycle, because people go to the beach more in the summer. They swim and buy ice cream because it's hot. Swimming exposes them to (rare) shark attacks.
+==One thing causes another==
+In scientific subjects like chemistry, physics and astronomy many laws of cause and effect have been discovered. Put two chemicals together, and they form a compound. Push something, and its momentum increases. The [[scientific method]] is useful in describing the relationship between events, especially when despite their best efforts, no one has been able to find an exception (see [[independent review]]).
+==Terminology==
+In common usage, it denotes an association of one variable with another in quite general terms; for example, one might say, "success is correlated with hard work". In mathematics, however, and in science and engineering, which make use of mathematical concepts, correlation is a technical term with a precise definition.
+Correlation must be distinguished from causation (see article on [[correlation is not causation]]). When one factor changes and another factor changes with it, there is usually a direct relationship between the two factors, observed as a correlation. Alternatively, both changed could be the result of changes in a third factor. For example, the prices of two unrelated goods might increase during a period of [[inflation]]; the two price rises are correlated with each other but neither has caused the other.
+Once a correlation is established, scientists may conduct research to determine causation. Are respiration deaths causing air pollution, or is it the other way around? It is easy to determine that sickness among the elderly does not cause air pollution. Rather, it is chemicals like sulfur dioxide (typically from coal burning [[power plant]]s) which are the culprits. Cities and states measure the amount of pollutants in the air and epidemiologists can use these data, comparing them to the number of people who develop respiratory diseases.
+Regulations which restrict air pollution are made on the basis of these correlations, and on the cause and effect relationships which the correlations help scientists to discover. However, activists have sometimes created false correlations by selective use of data. [http://www.robinsoncurriculum.com/view/rc/s31p59.htm]
 == Formal definition ==
@@ Line 41: / Line 65: @@
 '''t'''= (r)√(n-2)/√(1-r<sup>2</sup>)
-The results of this test should be evaluated against the standard t-distribution with degrees of freedom equal to n-2. As usual in significance tests, a result described as significant (i.e. typically a probability of occurring naturally of less than 5%) means that the chance of getting a correlation coefficient at least as large (whichever the direction) as that observed, if there is in fact no correlation, is small. A statistically significant result can therefore arise if the actual correlation is strong, even if the dataset is small, or if the actual correlation is weak but many data are observed.
+The results of this test should be evaluated against the standard t-distribution with degrees of freedom equal to n-2. As usual in significance tests, a result described as significant (i.e. a low value of ''P'' in the ''t'' test) means that the chance of getting a correlation coefficient at least as large (whichever the direction) as that observed, if there is in fact no correlation, is small. A statistically significant result can therefore arise if the actual correlation is strong, even if the dataset is small, or if the actual correlation is weak but many data are observed.
+One should note that for most random variables a correlation of 0 does not imply that they are independent.  However, if the two variables come from the [[Normal Distribution]] then it can be claimed that a correlation of 0 implies independence.
 It is important to note that the correlation coefficient should not be calculated when either of the variables is not continuous. That is, when they do not vary continuously and have a meaningful zero. As such, correlating a dichotomous variable (e.g. sex) with a ratio variable (e.g. IQ) is inappropriate and will return uninterpretable results.
@@ Line 49: / Line 75: @@
 ==Correlation and Causation==
-''Main article: [[Correlation is not causation]]''
+{{Main|Correlation is not causation}}
 Correlation is a linear statistic and makes no mathematical distinction between [[independent variable|independent]] and [[dependent variable]]s. As a consequence, it is impossible to assert [[causation]] based on a correlation even if that correlation is statistically significant. Therefore, when a significant correlation has been identified it is possible that X -> Y (i.e. X causes Y), Y -> X (i.e. Y causes X), or that X <- Z -> Y (i.e. Both X and Y are caused by an additional variable, Z). Caution must be exercised in asserting causation using correlation and claims to this effect must be viewed with considerable skepticism.
@@ Line 57: / Line 83: @@
 Aron, Arthur, Elaine N. Aron and Elliot J. Coups. 2008. ''Statistics for the Behavioral and Social Sciences: A Brief Course.'' 4/e. Upper Saddle River, New Jersey: Prentice Hall.
-[[Category:statistics]]
+[[Category:Statistics]]

Difference between revisions of "Correlation"

Revision as of 13:55, July 15, 2016

Contents

Unrelated factors

Factors which have a common cause

One thing causes another

Terminology

Formal definition

Correlation and Causation

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Popular Links

donate

Edit Console