# Chi-Square test

Statistics Major approaches
Frequency probability
Bayesian inference
Non-parametric statistics
Common methods
Analysis of variance
Chi-Square test
Students t-test
Z test
Linear regression
Bayesian model selection
Bootstrapping

The Chi-square test is a statistical test that relies on the Chi-square distribution. The chi-square test is non-parametric and does not make as many assumptions about the data it is comparing. However, it has less statistical power because of this. The most common usage for the chi-square test is to compare statistical significance of the difference between proportions in data sets. This usually takes the form of a Bivariate tabular analysis, or the intersections of proportional data of an independent variable and a dependent variable.

For example, ones independent variable might be political affiliation and the dependent variable might be support for a particular law. In this hypothetical example the data looks like:

 Oppose Support Liberal 45 5 Conservative 10 40

A chi-square test would be used to answer whether or not the difference between the relative proportions of support and opposition for the law compared to political affiliation was significant.

## Worked Example

To calculate the chi-square statistic, we go through a number of steps.

First, we need to calculate the expected number of people in each category, given the assumption that politics (being liberal or conservative, in this case), does not influence whether someone supports or opposes this measure. From the data, there are 50 liberals, out of 100, so we assume the probability of being a liberal is 50/100 = 1/2, and similarly the probability of being a conservative is also 1/2. Also from the data, 55 people oppose the law, out of 100, so we assume the probability of opposing the law is 55 out of 100, or 11/20. Similarly, the number who support it is assumed to be 45/100 = 9/10.

From this, we can calculated the expected values for each cell in the table. The expected number of liberals who oppose the law is 100*1/2*11/20 = 27.5. This is the number of people in the survey, multiplied by the probability of being a liberal, and multiplied by the probability of opposing the law.

We can use this to calculate the other expected values. There are 55 people who oppose, so the expected number of conservatives who oppose the law is 55 - 27.5 = 27.5, that is the total number of those who oppose the law, minus the expected number of liberals who oppose it.

Similarly, there are 50 liberals, so the sum of the expected number of liberals who support the law, added to the expected number who would oppose (27.5), must be 50 So the expected number of liberals who support the law would be 50 - 27.5 = 22.5. Finally, the number of conservatives is 50, so the sum of the expected number of conservatives who support the issue plus the number who oppose(27.5), must add up to 50, so the expected number of conservatives who support the issue is 50 - 27.5 = 22.5.

This gives the following table of expected values, given the assumption that support for this law is independent of political affiliation.

 Oppose Support Liberal 27.5 22.5 Conservative 27.5 22.5

Note that the row and column totals are the same as in the actual data. This will always be true when calculating expected values, and in effect we used it to calculate all the values apart from the top-left one. Note also that the expected values are quite distinct from the actual values, suggesting that the assumption we made is incorrect.

We now calculate the differences between the two tables. The results are shown below:

 Oppose Support Liberal 17.5 -17.5 Conservative -17.5 17.5

Now, we must square the differences. This makes then positive, which is necessary as we will be adding the differences together, and otherwise the differences would cancel out.Below are the squares of the differences:

 Oppose Support Liberal 306.25 306.25 Conservative 306.25 306.25

The final stage is to divide these squared differences by the corresponding expected values, and sum them. This is important as a difference of 10, for example, from an expected value of 20 would be significant (a 50% difference), but very insignificant compared to an expected value of 20,000 (a 0.05% difference). Below are the squared differences, when divided by the corresponding expected values:

 Oppose Support Liberal 11.14 13.611 Conservative 11.14 13.61

Now, we sum these values to get 49.5 (to one decimal place). To interpret this, we need to consider the number of degrees of freedom. This is the number of cells that can be changed, without changing the row and column totals. In this case, it is 1, as once we know the top-left cell, we can determine all the other cells. In general, with a 2-d table, it is (r-1)*(c-1), where r and c are the number of rows and columns We can compare this with the critical values of the chi-squared statistic. For 1 degree of freedom, the value for p = 0.001 is 10.83. This means that the probability of the chi-square statistics being 10.83 or higher is 0.001 (or 0.1%). Here, the statistics is 49.5, much larger than this, so the probability of the statistic being this high is less than 0.1%, given our assumption.

So, we can very confidently conclude, given this data, that political affiliation is linked to support for this law. Technically, we cannot conclude what sort of link is involved between political affiliation and support, although it seems clear that liberals are more likely to oppose the law, and conservatives are more likely to support it.