# Difference between revisions of "Talk:Significance of E. Coli Evolution Experiments"

(credibility) |
(Removed supported claims.) |
||

Line 144: | Line 144: | ||

:: SJohnson, the way you're calculating the chi-squared statistic implies that you're testing the null hypothesis of a constant mutation rate over time against an alternative hypothesis of a mutation rate which varies over time. [[User:FredFerguson|FredFerguson]] 11:02, 9 March 2009 (EDT) | :: SJohnson, the way you're calculating the chi-squared statistic implies that you're testing the null hypothesis of a constant mutation rate over time against an alternative hypothesis of a mutation rate which varies over time. [[User:FredFerguson|FredFerguson]] 11:02, 9 March 2009 (EDT) | ||

+ | |||

+ | ==Unreferenced Claims== | ||

+ | |||

+ | I deleted the claim that mean mutation generation is an appropriate test statistic because no reference was produced that back that claim. No reference was provided to back the claim that the chi-square test p-values are always conservative, either. [[User:SJohnson|SJohnson]] 12:57, 14 March 2009 (EDT) |

## Revision as of 10:57, 14 March 2009

SJohnson, your assessment, while good in the utilization of the chi-squared test is unfortunately incorrect. The Monte Carlo resampling gives a more accurate p-value than the chi-squared. You may research the literature (i.e. publications in statistical mathematics, many pubs actualy compare Monte Carlo vs Chi Squared) to discover that this method is commonly used in advance statistical work and how it is more accurate than the chi-squared test.--Able806 17:00, 4 March 2009 (EST)

- It doesn’t make sense to compare the chi-square test, which is a specific statistical hypothesis test, to Monte Carlo methods, which can be used for anything from fluid motion modeling to p-value computations. You can use Monte Carlo methods to compute the p-values of the chi-square test!

- Monte Carlo methods involve the generation of random realizations. Your broad claim the Monte Carlo methods are “more accurate” than the chi-square test is obviously incorrect because the accuracy of Monte Carlo methods always depends on the number of random realizations generated. When p-values are small, Monte Carlo methods are notoriously inaccurate unless the number of realizations generated is enormous.

- Which publications compare Monte Carlo to chi-square and show that the former is more accurate? Could you provide specific examples? Thanks. SJohnson 18:50, 4 March 2009 (EST)

- In furtherance of SJohnson's remarks with respect to rarely occurring events, the use of the basic Monte Carlo method is plainly incorrect for modeling a rarely occurring event, as the Lenski paper did. This has long been pointed out in Flaws in Richard Lenski Study. I know evolutionists will never admit a flaw in anything promoting their pet theory, but this (and other) flaws in that paper is undeniable.

- Watch how evolutionists defended obvious errors in the Lenski paper, and then realize why the Piltdown Man fraud was taught for 40 years without evolutionists admitting it was a hoax.--Andy Schlafly 09:55, 5 March 2009 (EST)

- Andy, how exactly is the Monte Carlo method incorrect to use in this case? I have seen it used in publications with much smaller datasets.--Able806 10:29, 5 March 2009 (EST)

- Able806, I'm interested in looking at the publications you mentioned that use Monte Carlo methods to analyze small data sets. Could you provide some examples? Thanks. SJohnson 16:41, 5 March 2009 (EST)

- SJohnson, here are two papers, 1 and 2. Most are in chemistry and genetics where you find the observed to be much smaller and have to use the MCM. You can search on the subject as well and find that how Lenski performed the test is the standard for microbiological genetic analysis.--Able806 10:19, 11 March 2009 (EDT)

- Those papers have nothing to do with hypothesis testing. One is an archeology paper. To be blunt, it seems like you’re just doing internet searches on “Monte Carlo” to find these links. SJohnson 10:10, 12 March 2009 (EDT)

- SJohnson, actually they do, did you read the papers? If so you would see how they used the MCM for their data analysis of small data sets, which indeed was hypothesis testing and answers you inquiry about publications that use MCM for small data set analysis. If you wish I can try to track down some actual mathematical publications, however, I am not as familiar with mathematical journals as I am with science/medical journals (not knowing which mathematical journals are acceptable). I am assuming that you have a background in math and possibly access to mathematical journals, therefore if you know the reputable ones I can do the leg work.
- I believe the thing that needs to be looked at is there truly a problem with the choice of test and if so what is an alternative. Bayesian might be an option but seems to be difficult to employ for this situation.--Able806 12:36, 12 March 2009 (EDT)

- Able806, you still seem to miss the point about how inappropriate the Monte Carlo method (as used in the Lenski paper) is for evaluating rarely occurring events. You need to open your mind to be productive. If you simply cling to a view that Lenski (who I don't think has any meaningful education in statistics) must somehow be right, then you're not going to make any progress in understanding the flaws.--Andy Schlafly 17:07, 5 March 2009 (EST)

- Andy, you still have not answered what you find inappropriate about his use of the Monte Carlo method? I am a reasonable person and with evidence I do have an open mind. I provided examples last week, with a working model, showing that Monte Carlo is better than the chi-square in this case. I have also shown where the Chi-Square was inappropriate due to the occurrence size as well. So if you have any evidence that Monte Carlo should not be used in the way that Lenski used please let it be shown.--Able806 10:19, 11 March 2009 (EDT)

Sjohnson, I believe you just proved my point. In the literature of mean and covariance structure analysis, non-central chi-square distribution is commonly used to describe the behavior of the likelihood ratio statistic under alternative hypothesis; it is widely believed that the non-central chi-square distribution is justified by statistical theory. Actually, when the null hypothesis is not trivially violated, the non-central chi-square distribution cannot describe the LR statistic well even when data are normally distributed and the sample size is large. Monte Carlo results compare the strength of the normal distribution against that of the non-central chi-square distribution. In an association analysis comparing cases and controls with respect to allele frequencies at a highly polymorphic locus, a potential problem is that the conventional chi-squared test may not be valid for a large, sparse contingency table. Reliance on statistics with known asymptotic distribution is unnecessary, as Monte Carlo simulations can be performed to estimate the significance level of the test statistic.

Here is a link to a great page the provides an interactive example as to why the Chi Squared test would provide poor results compared to the Monte Carlo in relation to the Lenski data workup.

Something you may have overlooked was that the data set is actually too small to use the chi square method correctly. It is often accepted that is any of the analyzed data falls under 10 for a particular cell of the data set then the Yates correction needs to be applied; unfortunately the Yates correction can over correct thus skewing the p-value. Lenksi seemed to understand this by supporting his Monte Carlo p-value results with the Fisher z-transformation p-value.

I hope this helps.--Able806 10:27, 5 March 2009 (EST)

- I’m still waiting to hear which literature says that “Monte Carlo resampling” is “more accurate than the chi-squared test”. The page mentioned above [1] is a discussion of why statisticians “fail to reject the null” rather than “accepting the null” when the p-value is above 0.05 or so. The page says nothing about superiority of Monte Carlo methods. Why were alternate hypothesis distributions mentioned? Only the null hypothesis distribution is used to calculate a p-value. Yates’s correction is for 2x2 contingency tables [2]. It doesn’t apply in this case. Finally, what the heck do “covariance structure analysis” and “allele frequencies at a highly polymorphic locus” have to do with this problem? SJohnson 16:38, 5 March 2009 (EST)

- SJohnson, I am looking for this paper for you, I cited it for one of my past publications dealing with allele frequencies (I believe it came from the Duke Biostatistics group). To answer your question about allele frequencies, that is the issue at hand, more about the genetics than the math, but it is the item being studied. So you stated that Yates can not be used and statistics says the number of occurrences is too small to evaluate using the Chi-Squared test so what would you recommend instead of the Monte-Carlo Method?

- Regarding the "Fisher z-transformation p-value" from the paper, garbage in garbage out. If the p-values were bad to begin with, then why would a combination of them be meaningful? SJohnson 10:49, 9 March 2009 (EDT)

- You are assuming that p-values are wrong based on a test that is inappropriate in this case due to data limitations. Did you perform a z-transformation on the chi-squared for the three data groups?--Able806 10:19, 11 March 2009 (EDT)

- You asked about the “Fisher z-transformation p-value”. The z-transformation test and Fisher’s method are actually two different things (see Whitlock's 2005 paper - Ref. 49 in Blount et al.). But no, I haven’t tried either. SJohnson 10:10, 12 March 2009 (EDT)

- There's a large literature on various kinds of Monte Carlo test, a very short summary of which is that they're inevitably more accurate than parametric tests (e.g. F, t, chi-squared, etc) because they don't make assumptions about the distribution of the data under the null hypothesis. See for example
*Introduction to the Bootstrap*by B. Efron and R. Tibshirani and*The Jack-knife, the Bootstrap and Other Resampling Plans*, also by Efron. They're certainly applicable to small datasets and their accuracy is really only limited by the number of samples you care to take. E.g. 1000 M-C samples would give you a pretty accurate idea about significance at the alpha<1% level (That book should answer SJohnson's questions of 18:50 on 4/3/09 and 16:38 on 5/3/09 about accuracy and Aschalfly's comment of 17:07 on 5/3/09 about appropriateness of Monte Carlo tests.) FredFerguson 16:53, 11 March 2009 (EDT)

- There's a large literature on various kinds of Monte Carlo test, a very short summary of which is that they're inevitably more accurate than parametric tests (e.g. F, t, chi-squared, etc) because they don't make assumptions about the distribution of the data under the null hypothesis. See for example

- Your claim that Monte Carlo methods are “inevitably more accurate” than other tests is obviously wrong because the accuracy of MC methods always depends on the number of realizations used. You should have written , not . If 1,000 random realizations are generated, the number of realizations above the true level is binomial with mean 10 and variance about 10. Thus, the standard deviation of the MC estimate is >0.003. In this example, a Monte Carlo p-value could be off by 30% and still be within a standard deviation. Is that really “pretty accurate”?

- Using one million MC realizations (as done in the paper) at the level means the standard deviation is about 10%. The paper reported a p-value of less than 0.001 (experiment two). It wouldn’t surprise me to find out that the experiment two p-value for the flawed test is off because only one million realizations were used. My original statement, “When p-values are small, Monte Carlo methods are notoriously inaccurate unless the number of realizations generated is enormous” is correct. SJohnson 10:10, 12 March 2009 (EDT)

- You're talking about miniscule differences in the accuracy of a test. 0.013 isn't very different from 0.007. In either case, it's very unlikely the experimenter would have obtained that result if the null hypothesis were true. If you're bothered about differences in P-values to the third decimals (which would make you unusual!), just run more MC realisations, that's all. Not really a problem. FredFerguson 11:53, 12 March 2009 (EDT)

There’s still confusion about the difference between test statistics and Monte Carlo methods. Before you find a Monte Carlo estimate of a p-value, you need to select a test statistic to reduce the data set to a scalar. I am interested in hearing which test statistic you believe should be used in place of the chi-square test and why. SJohnson 10:10, 12 March 2009 (EDT)

Quick question for SJohnson: How many degrees of freedom did you choose when calculating the p-value? I'd like to know upon what condition you base that number. Thanks.--Argon 11:05, 5 March 2009 (EST)

- The degree of freedom for a contingency table is rows minus one times columns minus one. That is, . Here’s a pretty good tutorial I came across: [3]. For the experiments from [4], the DOFs are 11, 11, and 13. For experiment one, the chi-square test statistic is
- where is the observed value and is the expected null hypothesis value. So if you have MS Excel, another way to arrive at the p-value of 0.19 is to type “=CHIDIST(14.82,11)” into a cell. Cheers! SJohnson 16:38, 5 March 2009 (EST)

- OK, thanks for the info. From what I'd calculated and looked up in tables, the numbers seemed close to a df=11 for a chi-square of ~14. (Aside: With terms having 17/3 in the denominator in the figures above, were you using the test of independence? I was using Pearson's test for fit of a distribution which returns a chi-squared value of 14 and roughly matched the p-values you reported, assuming the df was 11).

- Also, the first sentence of the article reads: "Blount, Borland, and Lenski[1] claimed that a key evolutionary innovation was observed during a laboratory experiment. That claim is false." A small correction: There were several claims in the paper. The 'key evolutionary innovation' was acquiring the ability to utilize citrate as a food source. That claim was demonstrated multiple times. The claim, which pertains to this statistics discussion was that the Cit+ phenotype arose in a multi-step process, first requiring a rare, pre-adaptive mutation before additional mutation(s) lead to the subsequent development of citrate utilization.--Argon 20:46, 5 March 2009 (EST)

- My biology-degreed wife assures me that mutation does not necessarily mean that evolution occurred. What the paper claimed is that evolution (a “key innovation”) occurred in the lab. The key innovation supposedly increased the mutation rate. In the experiments, the observed mutation rate increased after generation 31,000, but not enough to make a statistically significant claim that the rate is not constant. The analysis in the paper was similar to flipping a coin ten times, counting six heads and claiming that the coin must be biased against tails. In reality, there’s nothing surprising about a fair coin producing slightly more of one outcome than the other. Just like there's nothing surprising about there being slightly more mutations in later generations than early generations given the null hypothesis (constant mutation rate). SJohnson 10:46, 9 March 2009 (EDT)

- SJohnson, not to say anything about your wife, but has she had a 400 level molecular genetics course (most general biology degrees do not cover the detail unless they are specialized)? If so, she would have mentioned that if the mutation passes to the offspring and is selectively beneficial to the population then it is a step of evolution as along as the conditions continue through the sharing of the mutation with the population and the environment is such that reduces the growth rate of the non-transformed population. While not all mutations are signs that evolution occurred the mutations that pass to offspring and provide a benefit compared to other offspring are very strong indicators. In the case of this paper the population that evolved the cit+ was able to metabolize a chemical in their environment which allowed for an adaptation advantage compared to the non-transformed colonies.--Able806 10:19, 11 March 2009 (EDT)

Let’s go back to the beginning. There appears to be confusion about the difference between test statistics and methods for computing p-values. As is noted at the beginning of the page [5], the fundamental problem with the paper is that it used a flawed test statistic, not that it used Monte Carlo methods to find the p-value for that flawed statistic.

Every hypothesis test uses a test statistic to reduce the data to a single number. The p-value for the test statistic can be calculated analytically (as I’ve done for the chi-square test statistic) or by Monte Carlo methods. In the paper, Monte Carlo methods were used to compute the p-value of the “mutation generation” test statistic. The key problem with the analysis from the paper is that it doesn’t work to use a weighted average to test for variations in mutation rate. This is like trying to use the sample variance to test for an increase in the mean in Gaussian-distributed data. A statistic should be selected based on the null and alternate hypothesis distributions of the data. The chi-square test (unlike the weighted average from the paper) is a reasonable choice for data that mutates at a constant rate under the null hypothesis, but mutates at varying rates under the alternate hypothesis.

Able806, you made a good point about the contingency table cell frequencies being relatively low, but were wrong when you said ”the data set is actually too small to use the chi square method correctly”. In the low cell frequency case the chi-square test is still effective, but the null hypothesis distribution of the chi-square statistic starts to look less like the chi-square distribution. Thus, p-values calculated using the chi-square distribution may be a bit off. However, Monte Carlo p-values are always imperfect as well because it's impossible to generate an infinite number of random realizations. There are imperfections in p-values generated by analytic and Monte Carlo methods. However, low cell frequencies does not explain the >20x and >2.5x differences between chi-square p-values and p-values from the paper for experiments one and three. The reason for those huge differences was the use of the flawed test statistic (“mutation generation”) in the paper. SJohnson 16:38, 5 March 2009 (EST)

- SJohnson, the chi-squared test is a valuable statistical tool, but the limitations of the test must be acknowledged. The chi-squared test can only produce valid results if the assumptions that underly the test are not violated. As an analogy, Newtonian models of motion fail to produce accurate results as velocities approach the speed of light; under those circumstances one must switch to a theory that accounts for relativistic effects.

- It seems that you have simply dismissed the widely-acknowledged fact that the chi-squared test is inappropriate for use in situations where n in any cell is less less than a threshold number. Different authors set different thresholds, but all are well above the numbers seen in your chi-squared analysis - even the most liberal guidelines advise against the chi-squared test when any expected cell frequency is less than one or more than 20% of the table cells are less than 5; others require that expected values in all cells must be more than 5. With smaller amounts of data, the test is insensitive and errs on the side of rejecting the hypothesis. If you attempt your chi-squared statistical analysis with a program that is more sophisticated than MS Excel (as I did), you get an error message indicating that the results are invalid due to low expected cell counts.

- That issue aside, there are other reasons that the chi-squared test is inappropriate here. As the links above point out, the categories tested must be truly independent; one example is that you can't use the chi-squared test to compare age and ability to kick a field goal by testing the same experimental group twice, one year apart; you have to test one group of age A and a different group of age B. In the case of the Blount paper, the categories are not independent. Even if there were adequate numbers to address the low-expected-frequency problem, this would make the chi-squared an invalid test in this case.

- There are other significant problems with the use of the chi-squared test in this circumstance, but they can wait until you address these first major problems.--ElyM 12:18, 11 March 2009 (EDT)

- Wackerly et al. says in general it’s assumed that the cell frequencies are above five so that the chi-square statistic (under the null) is approximately chi-square distributed (see p. 703). That book does not say chi-square test results are invalid if frequencies are five or less. Your example of a chi-square test warning message (it said "warning" not "error" as you stated) in Minitab [6] said “approximation probably invalid” referring to the chi-square distribution approximation to the chi-square test statistic’s distribution. Your example did not say “chi-square test invalid”. I agree that when cell frequencies are low, the chi-square test statistic’s distribution starts to deviate from the chi-square distribution. I maintain that this deviation is not enough to explain the >2.5x and >20x differences in the chi-square test p-values and the p-values from the paper.

- As the numerous links in your post proved, the chi-square test is widely-used by statisticians. Can you give examples of statisticians using mean mutation generation as a test statistic? Also, did your software agree with the chi-square test p-values I presented? Thanks. SJohnson 10:10, 12 March 2009 (EDT)

- Thank you for giving page references for Wackerly; however it seems we have different editions, since page 703 in my copy (5th ed, 1996) does not deal with chi-squared issues at all. My copy does state the following, on page 622: "Although the mathematical proof is beyond the scope of this text, it can be shown that, when n is large [chi-squared] will possess approximately a chi-square probability distribution in repeated sampling." Then, on page 624: "Experience has shown that cell counts [n sub i] should not be too small in order that the chi-square distribution provide an accurate approximation to the distribution of [chi squared]. As a rule of thumb we require that all expected cell counts equal or exceed 5, although Cochran (1952) has noted that this value can be as low as 1 for some situations." Wackerly then goes on, in the problems sections, to describe the use of the chi-squared test as a "violation of good statistical practice" when "some expected counts [are] <5."

- It seems that you are already aware that the [chi-square] statistic under the null is no longer chi-square distributed for small n; this is precisely why the test should not be used under those conditions. I can claim to be able to accelerate a 1-kg mass to 10 times the speed of light by applying 1 N of force for 95 years by using F=ma and t= (vf-vi)/a. Plugging the numbers into those equations will produce the same result every time, but the answer is illegitimate because those equations are only valid under certain assumptions, which are violated as velocities approach the speed of light. Similarly, having a statistical program calculate a chi-squared value given the Blount data will produce a number result, but since the assumptions of the test are violated the result is not legitimate. Yes, if I put the Blount data in SAS 9.2, I get the same numerical answer as you do, but I also get the following message: "WARNING: >89% of the cells have expected counts less than 5. Chi-square may not be a valid test." You may argue that that's a warning, not an error; that's a semantic distinction. The reason that the program says that it MAY not be valid is that the chi-squared test skews in the direction of being too conservative at low n values; the test has an acceptable rate of false positives but an unacceptably high rate of false negatives. Comparing the results of the Monte Carlo and chi-squared results in this case is like comparing the results of Newtonian and relativistic equations of motion: they can produce very different results from the same input data.

- Your last paragraph has a major non sequitur in it: yes, many statisticians use the chi-square test. As long as the assumptions of the test are not violated, it is a valuable tool. That has nothing to do with the validity of using mean mutation generation as a test statistic. 'Mean number of werewolf attacks in Mumbai in the week centered on the new moon, by month, from 1654 to 1798' is a valid test statistic. I am quite sure that it has never been used in a peer-reviewed paper before. That does not mean that I can't perform valid statistical tests on that statistic. If, however, the incorrect test is applied, the results of the analysis will be flawed. Papers apply a (relatively small) standard repertoire of valid tests to a (potentially infinite) number of test statistics. The particular test statistic used in a paper may never have been used before and may never be used again; that does not address the validity of the analysis. In Blount's case, the test is the Monte Carlo analysis, which is also "widely-used by statisticians".

- We still haven't touched on the issue of the categories not being independent, which by itself is sufficient to invalidate the chi-squared technique. I'm new to this site, so I'm unsure as to the etiquette of making changes to the articles of another person - but the article here should at the very least mention that the chi-square test is being used here in a manner that violates its underlying assumptions in at least two fundamental ways, and the results are therefore suspect.--ElyM 17:34, 12 March 2009 (EDT)

- It looks to me as though SJohnson has misinterpreted the application of the chi-squared test in quite a fundamental way. His/her analysis of Blount's data are therefore close to meaningless, regardless of whether the test used by Blount is appropriate or not. In my opinion, the entire page should therefore be deleted. FredFerguson 08:18, 13 March 2009 (EDT)

- "Fred", perhaps you mistakenly think this is Wikipedia, where censorship and deletion of pages for ideological reasons are common. Not here.--Andy Schlafly 10:23, 14 March 2009 (EDT)

- Umm... I'm suggesting deletion for mathematical reasons, not ideological reasons. Using an argument filled with mathematical errors to try to support your case only detracts from your credibility. FredFerguson 10:38, 14 March 2009 (EDT)

- Actually, I think correction is better than deletion. So that's what I've done. FredFerguson 11:01, 14 March 2009 (EDT)

- I find no credibility in your denial of having ideological reasons.--Andy Schlafly 11:04, 14 March 2009 (EDT)

## Misinterpretation of test

SJohnson, Your analysis misinterprets the test. You say the null hypothesis is that this mutation cannot happen. They saw a mutation (4 mutations, in fact, in the data set you show) so the null hypothesis (as you state is) is disproved. That's perfectly straightforward.

I don't know what the "mean mutation generation" test is but you're doing when you apply a chi-squared test to this dataset is to test if the mutations are evenly distributed throughout the generations. Your test says they are, so there's no strong evidence to suppose that mutations are likely to occur in one generation rather than another in the series of tests. Blount's test says thay aren't, so it's more likely that the mutation will occur later in the series of tests. I can't tell which test is right without knowing more about the test that Blount used.

But that point (the foregoing paragraph) has no bearing at all on the null hypothesis, as you describe it. The mutation appeared, so that means the hypothesis that the mutation can't happen is disproved. Very simple. FredFerguson 21:10, 8 March 2009 (EDT)

- I never said that “the null hypothesis is that this mutation cannot happen”. The chi-square test statistic I'm using wouldn’t be defined if the null hypothesis mutation rate was zero because the term in the denominator of the statistic (see above equation) would be zero.

- The test statistic from the paper is the average of the generation numbers of observed mutations. For experiment one this number is
- The same number is shown in Table 2 of the paper. SJohnson 10:46, 9 March 2009 (EDT)

- SJohnson, the way you're calculating the chi-squared statistic implies that you're testing the null hypothesis of a constant mutation rate over time against an alternative hypothesis of a mutation rate which varies over time. FredFerguson 11:02, 9 March 2009 (EDT)

## Unreferenced Claims

I deleted the claim that mean mutation generation is an appropriate test statistic because no reference was produced that back that claim. No reference was provided to back the claim that the chi-square test p-values are always conservative, either. SJohnson 12:57, 14 March 2009 (EDT)