# Talk:Significance of E. Coli Evolution Experiments

SJohnson, your assessment, while good in the utilization of the chi-squared test is unfortunately incorrect. The Monte Carlo resampling gives a more accurate p-value than the chi-squared. You may research the literature (i.e. publications in statistical mathematics, many pubs actualy compare Monte Carlo vs Chi Squared) to discover that this method is commonly used in advance statistical work and how it is more accurate than the chi-squared test.--Able806 17:00, 4 March 2009 (EST)

It doesn’t make sense to compare the chi-square test, which is a specific statistical hypothesis test, to Monte Carlo methods, which can be used for anything from fluid motion modeling to p-value computations. You can use Monte Carlo methods to compute the p-values of the chi-square test!
Monte Carlo methods involve the generation of random realizations. Your broad claim the Monte Carlo methods are “more accurate” than the chi-square test is obviously incorrect because the accuracy of Monte Carlo methods always depends on the number of random realizations generated. When p-values are small, Monte Carlo methods are notoriously inaccurate unless the number of realizations generated is enormous.
Which publications compare Monte Carlo to chi-square and show that the former is more accurate? Could you provide specific examples? Thanks. SJohnson 18:50, 4 March 2009 (EST)
In furtherance of SJohnson's remarks with respect to rarely occurring events, the use of the basic Monte Carlo method is plainly incorrect for modeling a rarely occurring event, as the Lenski paper did. This has long been pointed out in Flaws in Richard Lenski Study. I know evolutionists will never admit a flaw in anything promoting their pet theory, but this (and other) flaws in that paper is undeniable.
Watch how evolutionists defended obvious errors in the Lenski paper, and then realize why the Piltdown Man fraud was taught for 40 years without evolutionists admitting it was a hoax.--Andy Schlafly 09:55, 5 March 2009 (EST)
Andy, how exactly is the Monte Carlo method incorrect to use in this case? I have seen it used in publications with much smaller datasets.--Able806 10:29, 5 March 2009 (EST)
Able806, I'm interested in looking at the publications you mentioned that use Monte Carlo methods to analyze small data sets. Could you provide some examples? Thanks. SJohnson 16:41, 5 March 2009 (EST)
SJohnson, here are two papers, 1 and 2. Most are in chemistry and genetics where you find the observed to be much smaller and have to use the MCM. You can search on the subject as well and find that how Lenski performed the test is the standard for microbiological genetic analysis.--Able806 10:19, 11 March 2009 (EDT)
Those papers have nothing to do with hypothesis testing. One is an archeology paper. To be blunt, it seems like you’re just doing internet searches on “Monte Carlo” to find these links. SJohnson 10:10, 12 March 2009 (EDT)
SJohnson, actually they do, did you read the papers? If so you would see how they used the MCM for their data analysis of small data sets, which indeed was hypothesis testing and answers you inquiry about publications that use MCM for small data set analysis. If you wish I can try to track down some actual mathematical publications, however, I am not as familiar with mathematical journals as I am with science/medical journals (not knowing which mathematical journals are acceptable). I am assuming that you have a background in math and possibly access to mathematical journals, therefore if you know the reputable ones I can do the leg work.
I believe the thing that needs to be looked at is there truly a problem with the choice of test and if so what is an alternative. Bayesian might be an option but seems to be difficult to employ for this situation.--Able806 12:36, 12 March 2009 (EDT)
Able806, you still seem to miss the point about how inappropriate the Monte Carlo method (as used in the Lenski paper) is for evaluating rarely occurring events. You need to open your mind to be productive. If you simply cling to a view that Lenski (who I don't think has any meaningful education in statistics) must somehow be right, then you're not going to make any progress in understanding the flaws.--Andy Schlafly 17:07, 5 March 2009 (EST)
Andy, you still have not answered what you find inappropriate about his use of the Monte Carlo method? I am a reasonable person and with evidence I do have an open mind. I provided examples last week, with a working model, showing that Monte Carlo is better than the chi-square in this case. I have also shown where the Chi-Square was inappropriate due to the occurrence size as well. So if you have any evidence that Monte Carlo should not be used in the way that Lenski used please let it be shown.--Able806 10:19, 11 March 2009 (EDT)

Sjohnson, I believe you just proved my point. In the literature of mean and covariance structure analysis, non-central chi-square distribution is commonly used to describe the behavior of the likelihood ratio statistic under alternative hypothesis; it is widely believed that the non-central chi-square distribution is justified by statistical theory. Actually, when the null hypothesis is not trivially violated, the non-central chi-square distribution cannot describe the LR statistic well even when data are normally distributed and the sample size is large. Monte Carlo results compare the strength of the normal distribution against that of the non-central chi-square distribution. In an association analysis comparing cases and controls with respect to allele frequencies at a highly polymorphic locus, a potential problem is that the conventional chi-squared test may not be valid for a large, sparse contingency table. Reliance on statistics with known asymptotic distribution is unnecessary, as Monte Carlo simulations can be performed to estimate the significance level of the test statistic.

Here is a link to a great page the provides an interactive example as to why the Chi Squared test would provide poor results compared to the Monte Carlo in relation to the Lenski data workup.

Something you may have overlooked was that the data set is actually too small to use the chi square method correctly. It is often accepted that is any of the analyzed data falls under 10 for a particular cell of the data set then the Yates correction needs to be applied; unfortunately the Yates correction can over correct thus skewing the p-value. Lenksi seemed to understand this by supporting his Monte Carlo p-value results with the Fisher z-transformation p-value.

I hope this helps.--Able806 10:27, 5 March 2009 (EST)

I’m still waiting to hear which literature says that “Monte Carlo resampling” is “more accurate than the chi-squared test”. The page mentioned above [1] is a discussion of why statisticians “fail to reject the null” rather than “accepting the null” when the p-value is above 0.05 or so. The page says nothing about superiority of Monte Carlo methods. Why were alternate hypothesis distributions mentioned? Only the null hypothesis distribution is used to calculate a p-value. Yates’s correction is for 2x2 contingency tables [2]. It doesn’t apply in this case. Finally, what the heck do “covariance structure analysis” and “allele frequencies at a highly polymorphic locus” have to do with this problem? SJohnson 16:38, 5 March 2009 (EST)
SJohnson, I am looking for this paper for you, I cited it for one of my past publications dealing with allele frequencies (I believe it came from the Duke Biostatistics group). To answer your question about allele frequencies, that is the issue at hand, more about the genetics than the math, but it is the item being studied. So you stated that Yates can not be used and statistics says the number of occurrences is too small to evaluate using the Chi-Squared test so what would you recommend instead of the Monte-Carlo Method?
Regarding the "Fisher z-transformation p-value" from the paper, garbage in garbage out. If the p-values were bad to begin with, then why would a combination of them be meaningful? SJohnson 10:49, 9 March 2009 (EDT)
You are assuming that p-values are wrong based on a test that is inappropriate in this case due to data limitations. Did you perform a z-transformation on the chi-squared for the three data groups?--Able806 10:19, 11 March 2009 (EDT)
You asked about the “Fisher z-transformation p-value”. The z-transformation test and Fisher’s method are actually two different things (see Whitlock's 2005 paper - Ref. 49 in Blount et al.). But no, I haven’t tried either. SJohnson 10:10, 12 March 2009 (EDT)
There's a large literature on various kinds of Monte Carlo test, a very short summary of which is that they're inevitably more accurate than parametric tests (e.g. F, t, chi-squared, etc) because they don't make assumptions about the distribution of the data under the null hypothesis. See for example Introduction to the Bootstrap by B. Efron and R. Tibshirani and The Jack-knife, the Bootstrap and Other Resampling Plans, also by Efron. They're certainly applicable to small datasets and their accuracy is really only limited by the number of samples you care to take. E.g. 1000 M-C samples would give you a pretty accurate idea about significance at the alpha<1% level (That book should answer SJohnson's questions of 18:50 on 4/3/09 and 16:38 on 5/3/09 about accuracy and Aschalfly's comment of 17:07 on 5/3/09 about appropriateness of Monte Carlo tests.) FredFerguson 16:53, 11 March 2009 (EDT)
Your claim that Monte Carlo methods are “inevitably more accurate” than other tests is obviously wrong because the accuracy of MC methods always depends on the number of realizations used. You should have written $\alpha=1\%$, not $\alpha<1\%$. If 1,000 random realizations are generated, the number of realizations above the true $\alpha=1\%$ level is binomial with mean 10 and variance about 10. Thus, the standard deviation of the MC estimate is >0.003. In this example, a Monte Carlo p-value could be off by 30% and still be within a standard deviation. Is that really “pretty accurate”?
Using one million MC realizations (as done in the paper) at the α = 0.001 level means the standard deviation is about 3%. The paper reported a p-value of less than 0.001 (experiment two). It wouldn’t surprise me to find out that the experiment two p-value for the flawed test is off because only one million realizations were used. My original statement, “When p-values are small, Monte Carlo methods are notoriously inaccurate unless the number of realizations generated is enormous” is correct. SJohnson 10:10, 12 March 2009 (EDT)
You're talking about miniscule differences in the accuracy of a test. 0.013 isn't very different from 0.007. In either case, it's very unlikely the experimenter would have obtained that result if the null hypothesis were true. If you're bothered about differences in P-values to the third decimals (which would make you unusual!), just run more MC realisations, that's all. Not really a problem. FredFerguson 11:53, 12 March 2009 (EDT)

There’s still confusion about the difference between test statistics and Monte Carlo methods. Before you find a Monte Carlo estimate of a p-value, you need to select a test statistic to reduce the data set to a scalar. I am interested in hearing which test statistic you believe should be used in place of the chi-square test and why. SJohnson 10:10, 12 March 2009 (EDT)

Quick question for SJohnson: How many degrees of freedom did you choose when calculating the p-value? I'd like to know upon what condition you base that number. Thanks.--Argon 11:05, 5 March 2009 (EST)

The degree of freedom for a contingency table is rows minus one times columns minus one. That is, (r − 1)(c − 1). Here’s a pretty good tutorial I came across: [3]. For the experiments from [4], the DOFs are 11, 11, and 13. For experiment one, the chi-square test statistic is
$X^2 =\sum\limits_i\sum\limits_j \frac{\left(n_{i,j}-E\left[n_{i,j}\right]\right)^2} {E\left[n_{i,j}\right]}$
$=\frac{\left(0-1/3\right)^2}{1/3} +\frac{\left(6-17/3\right)^2}{17/3} +\frac{\left(0-1/3\right)^2}{1/3} +\ldots+$
$+\frac{\left(2-1/3\right)^2}{1/3} +\frac{\left(4-17/3\right)^2}{17/3}$
$\approx 14.82$
where ni,j is the observed value and $E\left[n_{i,j}\right]$ is the expected null hypothesis value. So if you have MS Excel, another way to arrive at the p-value of 0.19 is to type “=CHIDIST(14.82,11)” into a cell. Cheers! SJohnson 16:38, 5 March 2009 (EST)
OK, thanks for the info. From what I'd calculated and looked up in tables, the numbers seemed close to a df=11 for a chi-square of ~14. (Aside: With terms having 17/3 in the denominator in the figures above, were you using the test of independence? I was using Pearson's test for fit of a distribution which returns a chi-squared value of 14 and roughly matched the p-values you reported, assuming the df was 11).
Also, the first sentence of the article reads: "Blount, Borland, and Lenski[1] claimed that a key evolutionary innovation was observed during a laboratory experiment. That claim is false." A small correction: There were several claims in the paper. The 'key evolutionary innovation' was acquiring the ability to utilize citrate as a food source. That claim was demonstrated multiple times. The claim, which pertains to this statistics discussion was that the Cit+ phenotype arose in a multi-step process, first requiring a rare, pre-adaptive mutation before additional mutation(s) lead to the subsequent development of citrate utilization.--Argon 20:46, 5 March 2009 (EST)
My biology-degreed wife assures me that mutation does not necessarily mean that evolution occurred. What the paper claimed is that evolution (a “key innovation”) occurred in the lab. The key innovation supposedly increased the mutation rate. In the experiments, the observed mutation rate increased after generation 31,000, but not enough to make a statistically significant claim that the rate is not constant. The analysis in the paper was similar to flipping a coin ten times, counting six heads and claiming that the coin must be biased against tails. In reality, there’s nothing surprising about a fair coin producing slightly more of one outcome than the other. Just like there's nothing surprising about there being slightly more mutations in later generations than early generations given the null hypothesis (constant mutation rate). SJohnson 10:46, 9 March 2009 (EDT)

>>Inserting a later comment first<< SJohnson, the paper's title is: "Historical contingency and the evolution of a key innovation in an experimental population of Escherichia coli" As I mentioned earlier, the key innovation is the evolution of the Cit+ phenotype and not the timing or rate of its acquisition. And yes, it *is* evolution (call it microevolution, if you wish). Blount et al went on further to speculate how this evolutionary innovation arose and they proposed the historical contingency hypothesis in which 'pre-adaptive' mutations were required before the Cit+ phenotype developed. It is only this latter hypothesis that you are attempting to address with your chi-square analysis, not the fact that Cit+ mutants arose (which is the evolutionary innovation).--Argon 21:57, 18 March 2009 (EDT)

SJohnson, not to say anything about your wife, but has she had a 400 level molecular genetics course (most general biology degrees do not cover the detail unless they are specialized)? If so, she would have mentioned that if the mutation passes to the offspring and is selectively beneficial to the population then it is a step of evolution as along as the conditions continue through the sharing of the mutation with the population and the environment is such that reduces the growth rate of the non-transformed population. While not all mutations are signs that evolution occurred the mutations that pass to offspring and provide a benefit compared to other offspring are very strong indicators. In the case of this paper the population that evolved the cit+ was able to metabolize a chemical in their environment which allowed for an adaptation advantage compared to the non-transformed colonies.--Able806 10:19, 11 March 2009 (EDT)

Let’s go back to the beginning. There appears to be confusion about the difference between test statistics and methods for computing p-values. As is noted at the beginning of the page [5], the fundamental problem with the paper is that it used a flawed test statistic, not that it used Monte Carlo methods to find the p-value for that flawed statistic.

Every hypothesis test uses a test statistic to reduce the data to a single number. The p-value for the test statistic can be calculated analytically (as I’ve done for the chi-square test statistic) or by Monte Carlo methods. In the paper, Monte Carlo methods were used to compute the p-value of the “mutation generation” test statistic. The key problem with the analysis from the paper is that it doesn’t work to use a weighted average to test for variations in mutation rate. This is like trying to use the sample variance to test for an increase in the mean in Gaussian-distributed data. A statistic should be selected based on the null and alternate hypothesis distributions of the data. The chi-square test (unlike the weighted average from the paper) is a reasonable choice for data that mutates at a constant rate under the null hypothesis, but mutates at varying rates under the alternate hypothesis.

Able806, you made a good point about the contingency table cell frequencies being relatively low, but were wrong when you said ”the data set is actually too small to use the chi square method correctly”. In the low cell frequency case the chi-square test is still effective, but the null hypothesis distribution of the chi-square statistic starts to look less like the chi-square distribution. Thus, p-values calculated using the chi-square distribution may be a bit off. However, Monte Carlo p-values are always imperfect as well because it's impossible to generate an infinite number of random realizations. There are imperfections in p-values generated by analytic and Monte Carlo methods. However, low cell frequencies does not explain the >20x and >2.5x differences between chi-square p-values and p-values from the paper for experiments one and three. The reason for those huge differences was the use of the flawed test statistic (“mutation generation”) in the paper. SJohnson 16:38, 5 March 2009 (EST)

SJohnson, the chi-squared test is a valuable statistical tool, but the limitations of the test must be acknowledged. The chi-squared test can only produce valid results if the assumptions that underly the test are not violated. As an analogy, Newtonian models of motion fail to produce accurate results as velocities approach the speed of light; under those circumstances one must switch to a theory that accounts for relativistic effects.
It seems that you have simply dismissed the widely-acknowledged fact that the chi-squared test is inappropriate for use in situations where n in any cell is less less than a threshold number. Different authors set different thresholds, but all are well above the numbers seen in your chi-squared analysis - even the most liberal guidelines advise against the chi-squared test when any expected cell frequency is less than one or more than 20% of the table cells are less than 5; others require that expected values in all cells must be more than 5. With smaller amounts of data, the test is insensitive and errs on the side of rejecting the hypothesis. If you attempt your chi-squared statistical analysis with a program that is more sophisticated than MS Excel (as I did), you get an error message indicating that the results are invalid due to low expected cell counts.
That issue aside, there are other reasons that the chi-squared test is inappropriate here. As the links above point out, the categories tested must be truly independent; one example is that you can't use the chi-squared test to compare age and ability to kick a field goal by testing the same experimental group twice, one year apart; you have to test one group of age A and a different group of age B. In the case of the Blount paper, the categories are not independent. Even if there were adequate numbers to address the low-expected-frequency problem, this would make the chi-squared an invalid test in this case.
There are other significant problems with the use of the chi-squared test in this circumstance, but they can wait until you address these first major problems.--ElyM 12:18, 11 March 2009 (EDT)
Wackerly et al. says in general it’s assumed that the cell frequencies are above five so that the chi-square statistic (under the null) is approximately chi-square distributed (see p. 703). That book does not say chi-square test results are invalid if frequencies are five or less. Your example of a chi-square test warning message (it said "warning" not "error" as you stated) in Minitab [6] said “approximation probably invalid” referring to the chi-square distribution approximation to the chi-square test statistic’s distribution. Your example did not say “chi-square test invalid”. I agree that when cell frequencies are low, the chi-square test statistic’s distribution starts to deviate from the chi-square distribution. I maintain that this deviation is not enough to explain the >2.5x and >20x differences in the chi-square test p-values and the p-values from the paper.
As the numerous links in your post proved, the chi-square test is widely-used by statisticians. Can you give examples of statisticians using mean mutation generation as a test statistic? Also, did your software agree with the chi-square test p-values I presented? Thanks. SJohnson 10:10, 12 March 2009 (EDT)
Thank you for giving page references for Wackerly; however it seems we have different editions, since page 703 in my copy (5th ed, 1996) does not deal with chi-squared issues at all. My copy does state the following, on page 622: "Although the mathematical proof is beyond the scope of this text, it can be shown that, when n is large [chi-squared] will possess approximately a chi-square probability distribution in repeated sampling." Then, on page 624: "Experience has shown that cell counts [n sub i] should not be too small in order that the chi-square distribution provide an accurate approximation to the distribution of [chi squared]. As a rule of thumb we require that all expected cell counts equal or exceed 5, although Cochran (1952) has noted that this value can be as low as 1 for some situations." Wackerly then goes on, in the problems sections, to describe the use of the chi-squared test as a "violation of good statistical practice" when "some expected counts [are] <5."
It seems that you are already aware that the [chi-square] statistic under the null is no longer chi-square distributed for small n; this is precisely why the test should not be used under those conditions. I can claim to be able to accelerate a 1-kg mass to 10 times the speed of light by applying 1 N of force for 95 years by using F=ma and t= (vf-vi)/a. Plugging the numbers into those equations will produce the same result every time, but the answer is illegitimate because those equations are only valid under certain assumptions, which are violated as velocities approach the speed of light. Similarly, having a statistical program calculate a chi-squared value given the Blount data will produce a number result, but since the assumptions of the test are violated the result is not legitimate. Yes, if I put the Blount data in SAS 9.2, I get the same numerical answer as you do, but I also get the following message: "WARNING: >89% of the cells have expected counts less than 5. Chi-square may not be a valid test." You may argue that that's a warning, not an error; that's a semantic distinction. The reason that the program says that it MAY not be valid is that the chi-squared test skews in the direction of being too conservative at low n values; the test has an acceptable rate of false positives but an unacceptably high rate of false negatives. Comparing the results of the Monte Carlo and chi-squared results in this case is like comparing the results of Newtonian and relativistic equations of motion: they can produce very different results from the same input data.
For a finite amount of data, the chi-square statistic is never chi-square distributed under the null. The p-values are always approximate regardless of cell frequencies. The approximation becomes more accurate as the amount of data increases, but I don’t believe that this inaccuracy will change p-values that are about 0.2 (for experiments 1 and 3) into statistically significant p-values. How much do you expect the p-values to change if an exact computation is used in place of the chi-square distribution approximation? SJohnson 20:49, 18 March 2009 (EDT)
Your last paragraph has a major non sequitur in it: yes, many statisticians use the chi-square test. As long as the assumptions of the test are not violated, it is a valuable tool. That has nothing to do with the validity of using mean mutation generation as a test statistic. 'Mean number of werewolf attacks in Mumbai in the week centered on the new moon, by month, from 1654 to 1798' is a valid test statistic. I am quite sure that it has never been used in a peer-reviewed paper before. That does not mean that I can't perform valid statistical tests on that statistic. If, however, the incorrect test is applied, the results of the analysis will be flawed. Papers apply a (relatively small) standard repertoire of valid tests to a (potentially infinite) number of test statistics. The particular test statistic used in a paper may never have been used before and may never be used again; that does not address the validity of the analysis. In Blount's case, the test is the Monte Carlo analysis, which is also "widely-used by statisticians".
There are an infinite number of ways to reduce a data set to a single number. However, it’s foolish to think every method would be effective. I gave an example of a flawed test statistic in an earlier post [7]. Another example of a flawed test statistic is the one used in the paper because it does not always detect deviations from the null hypothesis (see: Significance of E. Coli Evolution Experiments#Test Statistics).
Test statistics are typically derived. The likelihood ratio test is a common method used to derive them. The chi-square test for independence is an approximation to the LRT. Where is the derivation saying that mean mutation generation is an appropriate test statistic for this problem? SJohnson 20:49, 18 March 2009 (EDT)
We still haven't touched on the issue of the categories not being independent, which by itself is sufficient to invalidate the chi-squared technique. I'm new to this site, so I'm unsure as to the etiquette of making changes to the articles of another person - but the article here should at the very least mention that the chi-square test is being used here in a manner that violates its underlying assumptions in at least two fundamental ways, and the results are therefore suspect.--ElyM 17:34, 12 March 2009 (EDT)
When generating random realizations of experiment outcomes, the authors assumed that the total number of mutants was fixed. Thus the paper assumed the numbers of mutants per generation are statistically dependent. Does this seem like a realistic model, or do you think that if the experiments were recreated that the total number of mutants could vary? For example, if experiment one were recreated, would the total number of mutants always be exactly four? SJohnson 20:49, 18 March 2009 (EDT)
It looks to me as though SJohnson has misinterpreted the application of the chi-squared test in quite a fundamental way. His/her analysis of Blount's data are therefore close to meaningless, regardless of whether the test used by Blount is appropriate or not. In my opinion, the entire page should therefore be deleted. FredFerguson 08:18, 13 March 2009 (EDT)
"Fred", perhaps you mistakenly think this is Wikipedia, where censorship and deletion of pages for ideological reasons are common. Not here.--Andy Schlafly 10:23, 14 March 2009 (EDT)
Umm... I'm suggesting deletion for mathematical reasons, not ideological reasons. Using an argument filled with mathematical errors to try to support your case only detracts from your credibility. FredFerguson 10:38, 14 March 2009 (EDT)
Actually, I think correction is better than deletion. So that's what I've done. FredFerguson 11:01, 14 March 2009 (EDT)
I find no credibility in your denial of having ideological reasons.--Andy Schlafly 11:04, 14 March 2009 (EDT)

## Misinterpretation of test

SJohnson, Your analysis misinterprets the test. You say the null hypothesis is that this mutation cannot happen. They saw a mutation (4 mutations, in fact, in the data set you show) so the null hypothesis (as you state is) is disproved. That's perfectly straightforward.

I don't know what the "mean mutation generation" test is but you're doing when you apply a chi-squared test to this dataset is to test if the mutations are evenly distributed throughout the generations. Your test says they are, so there's no strong evidence to suppose that mutations are likely to occur in one generation rather than another in the series of tests. Blount's test says thay aren't, so it's more likely that the mutation will occur later in the series of tests. I can't tell which test is right without knowing more about the test that Blount used.

But that point (the foregoing paragraph) has no bearing at all on the null hypothesis, as you describe it. The mutation appeared, so that means the hypothesis that the mutation can't happen is disproved. Very simple. FredFerguson 21:10, 8 March 2009 (EDT)

I never said that “the null hypothesis is that this mutation cannot happen”. The chi-square test statistic I'm using wouldn’t be defined if the null hypothesis mutation rate was zero because the $E\left[n_{i,j}\right]$ term in the denominator of the statistic (see above equation) would be zero.
The test statistic from the paper is the average of the generation numbers of observed mutations. For experiment one this number is
$\frac{1}{4}\left(30500+31500+2\times32500\right)= 31750.$
The same number is shown in Table 2 of the paper. SJohnson 10:46, 9 March 2009 (EDT)
SJohnson, the way you're calculating the chi-squared statistic implies that you're testing the null hypothesis of a constant mutation rate over time against an alternative hypothesis of a mutation rate which varies over time. FredFerguson 11:02, 9 March 2009 (EDT)

As it currently stands, the article makes the following statement: "The expected outcomes under the null hypothesis (no evolutionary innovation occurs) are also shown." This misstates the null hypothesis of the paper, which is elaborated in the Introduction section of the paper, and repeated in the section Statistical Analysis of the Replay Experiments:

For each experiment, we compared the observed mean generation of those clones that yielded Cit+ variants to the mean expected under the null hypothesis that clones from all generations have equal likelihood. The null thus corresponds to the rare-mutation hypothesis laid out in the Introduction." Block quote
[1]

The article also continues to describe 'mean mutation generation' as a test rather than a statistic to which the Monte Carlo test was applied.--ElyM 17:24, 14 March 2009 (EDT)

I'm a bit confused about why the Chi-squared test, which we're told compares the results to a null hypothesis of a constant mutation rate, seems insensitive to which generations the Cit+ mutations are found. Instead, the chi-square test seems only to be evaluating whether the frequencies of Cit+ mutations in any particular generation are 'expected'. Thus the test is asking whether finding a distribution (e.g. in the first experiment) across nine periods that have no mutations, two periods that have one mutation and one period with two mutations is a statistically significant deviation from what you'd expect of the mutations were randomly distributed. The number returned from the function is the same regardless of the order of Cit+ results. The number of mutations per bin is not the only question being asked. Instead it's the order and temporal distribution of Cit+ mutants that the analyses probably need to confront. It's not whether one can get nine no-mutants, two single mutants and one double-mutant result, it's a matter of when they occur and whether that distribution affects the significance of the results. Blount's hypothesis is that mutations should appear later in the experiment. When formulating a suitable null hypothesis, wouldn't one want to take the timing of Cit+ mutants into consideration too?--Argon 22:25, 18 March 2009 (EDT)

## Unreferenced Claims

I deleted the claim that mean mutation generation is an appropriate test statistic because no reference was produced that back that claim. No reference was provided to back the claim that the chi-square test p-values are always conservative, either. SJohnson 12:57, 14 March 2009 (EDT)

The reference is Everitt. I'll check I put it in the right place. FredFerguson 13:30, 14 March 2009 (EDT)

There was a typo in my edit summaries on the talk page and the main page. I meant to say "Removed unsupported claims" rather than "Removed supported claims". SJohnson 13:13, 14 March 2009 (EDT)

I do not believe that anyone has claimed that 'chi-square test p-values are always conservative'. The claim that has been made is that under certain circumstances, namely low n and low individual cell values, the chi-square test is an invalid test; that under those circumstances the power of the test is low and it becomes impossible to reject the null hypothesis even when it is false. You may have missed the pertinent sections in my links above, so I will directly quote the relevant sections. All the quoted sections refer to chi-square testing in particular. Any bolding below is mine.

This edit claimed that chi-square test p-values are conservative, but didn't back that claim with a reference: [8]. SJohnson 20:49, 18 March 2009 (EDT)
"Assumptions:
Even though a nonparametric statistic does not require a normally distributed population, there still are some restrictions regarding its use.
1. Representative sample (Random)
2. The data must be in frequency form (nominal data) or greater.
3. The individual observations must be independent of each other.
4. Sample size must be adequate. In a 2 x 2 table, Chi Square should not be used if n is less than 20. In a larger table, no expected value should be less than 1, and not more than 20% of the variables can have expected values of less than 5.
5. Distribution basis must be decided on before the data is collected.
6. The sum of the observed frequencies must equal the sum of the expected frequencies."
"Assumptions:
• Random sample data are assumed. As with all significance tests, if you have population data, then any table differences are real and therefore significant. If you have non-random sample data, significance cannot be established, though significance tests are nonetheless sometimes utilized as crude "rules of thumb" anyway.
• A sufficiently large sample size is assumed, as in all significance tests. Applying chi-square to small samples exposes the researcher to an unacceptable rate of Type II errors. There is no accepted cutoff. Some set the minimum sample size at 50, while others would allow as few as 20. Note chi-square must be calculated on actual count data, not substituting percentages, which would have the effect of pretending the sample size is 100.
• Adequate cell sizes are also assumed. Some require 5 or more, some require more than 5, and others require 10 or more. A common rule is 5 or more in all cells of a 2-by-2 table, and 5 or more in 80% of cells in larger tables, but no cells with zero count. When this assumption is not met, Yates' correction is applied.
• Independence. Observations must be independent. The same observation can only appear in one cell. This means chi-square cannot be used to test correlated data (ex., before-after, matched pairs, panel data).
• Similar distribution. Observations must have the same underlying distribution.
• Known distribution. The hypothesized distribution is specified in advance, so that the number of observations that are expected to appear each cell in the table can be calculated without reference to the observed values. Normally this expected value is the crossproduct of the row and column marginals divided by the sample size.
• Non-directional hypotheses are assumed. Chi-square tests the hypothesis that two variables are related only by chance. If a significant relationship is found, this is not equivalent to establishing the researcher's hypothesis that A causes B, or that B causes A.
* Finite values. Observations must be grouped in categories. * Normal distribution of deviations (observed minus expected values) is assumed. Note chi-square is a nonparametric test in the sense that is does not assume the parameter of normal distribution for the data -- only for the deviations. * Data level. No assumption is made about level of data. Nominal, ordinal, or interval data may be used with chi-square tests."
"Assumptions:
-None of the expected values may be less than 1
-No more than 20% of the expected values may be less than 5"
[4]
"When performing a chi-square test, your data must satisfy important assumptions. Although these assumptions may be stated differently in different textbooks, they generally assert that:
1)The sample must be randomly drawn from the population
2)The sample size, n, must be large enough so that the expected cell count in each cell is greater than or equal to 5.
Both assumptions must be met in the process of collecting your data, and violations of the second assumption will appear in the Minitab output when you run the analysis.
...
You may wonder why the second assumption is necessary for performing the chi-square test. The second assumption arises because the distribution of counts under the null hypothesis is multinomial, and the normal distribution can be used to approximate the multinomial distribution if the sample size is sufficiently large and the probability parameters aren't too small. It can be shown via the Central Limit Theorem that the multinomial distribution converges to the normal distribution as the sample size approaches infinity; however, there is no easy way to show mathematically how and when the convergence fails."
"The chi-square test is simpler to calculate but yields only an approximate P value. ... You should definitely avoid the chi-square test when the numbers in the contingency table are very small (any number less than about six)."
[6]
"The most important things to remember to get a valid χ2 test are that the expected values are not too small in any bin (certainly 5 or more), and that the degrees of freedom are properly evaluated. Unless you have a very large amount of data, the test is not very sensitive and errs on the side of safety. If you get a significant result, however, it is not likely to be wrong."
[7]
"The critical assumptions of the chi-square test for k independent samples are similar to those for the chi-square test for two independent samples.
...
4. No more than 20% of the cells may have expected frequencies of less than 5, and no cell should have an expected frequency of less than 1.
The rule given in Assumption 4 is particularly important for a contingency table that is larger than 2X2"
[8]
"Special problems with small expected cell frequencies for the chi-square test:
The chi-square test involves using the chi-square distribution to approximate the underlying exact distribution. The approximation becomes better as the expected cell frequencies grow larger, and may be inappropriate for tables with very small expected cell frequencies.
For tables with expected cell frequencies less than 5, the chi-square approximation may not be reliable. A standard (and conservative) rule of thumb (due to Cochran) is to avoid using the chi-square test for tables with expected cell frequencies less than 1, or when more than 20% of the table cells have expected cell frequencies less than 5.
Another rule of thumb (due to Roscoe and Byars) is that the average expected cell frequency should be at least 1 when the expected cell frequencies are close to equal, and 2 when they are not. (If the chosen significance level is 0.01 instead of 0.05, then double these numbers.)
Koehler and Larntz suggest that if the total number of observations is at least 10, the number categories is at least 3, and the square of the total number of observations is at least 10 times the number of categories, then the chi-square approximation should be reasonable.
Care should be taken when cell categories are combined (collapsed together) to fix problems of small expected cell frequencies. Collapsing can destroy evidence of non-independence, so a failure to reject the null hypothesis for the collapsed table does not rule out the possibility of non-independence in the original table.
As with most statistical tests, the power of the chi-square test increases with a larger number of observations. If there are too few observations, it may be impossible to reject the null hypothesis even if it is false."
[9]--ElyM 17:24, 14 March 2009 (EDT)
Thanks for this really excellent contribution, ElyM. The only thing I'd like to add is in relation to your initial statement, "I do not believe that anyone has claimed that 'chi-square test p-values are always conservative'". The question of whether a test is conservative in a particular situation is probabilistic. One can determine whether a test is likely to generate a p-value which is too high in a particular situation (e.g. for a chi-squared test, when there are lots of small expected values) but one needs an exact test (such as an appropriate Monte Carlo randomisation test) to determine whether the p-value in any particular test is in fact excessively high.
This edit also claimed that chi-square test p-values are conservative, but didn't back that claim with a reference: [9]. SJohnson 20:49, 18 March 2009 (EDT)
I hope careful reading of your very clear description will put SJohnson's mind at rest on this subject. FredFerguson 18:11, 14 March 2009 (EDT)

ElyM, you've provided nothing to address the basic flaw that "The paper incorrectly applied a Monte Carlo resampling test to exclude the null hypothesis for rarely occurring events." See Flaws in Lenski Study. Also, do not impose your view on the content page until after SJohnson has had an opportunity to respond to your posting. As to "Fred", his put-downs are getting tiresome and I'm going to review his edit pattern now to see if he's been contributing anything of value to this site.--Andy Schlafly 14:06, 15 March 2009 (EDT)

Mr. Schlafly, per your request I have not added anything to the content page as SJohnson has not yet responded to my posts. Since all of my comments have been in regards to SJohnson's use of the chi-square test in this particular article, I'm not sure why you expect me to address Blout's use of Monte Carlo - that issue seems to be addressed on the Flaws in Lenski Study page. SJohnson has added a reformulation of the chi-square test for two possible outcomes, and stated that the chi-square test is at a minimum when all success probabilities are equal. He then extrapolates from this to claim that the chi-square test is an effective test for the data from Blount.
The reformulation of the equations for two possible outcomes does not address the underlying problem that the chi-square test has universally accepted parameters outside of which it is considered an invalid test; I have provided references for these parameters and shown that the data from Blount lies outside them. None of the expected cells in SJohnson's analysis have values above one, and the total n is four. SJohnson's own reference states that the application of the chi-square test in this circumstance is a "violation of good statistical practice". Analogously, combining F=ma and t=(vf-vi)/a into t=(vf-vi)m/F and showing that t is a minimum when m approaches zero does not address the fact that those Newtonian equations do not apply as velocities approach the speed of light. The legitimacy of Blount's arguments cannot be determined by the application of illegitimate counterarguments. If SJohnson or others can point to references from the statistical literature that show that Blount has made methodological errors - as I have been able to do with SJohnson's chi-square analysis - I would welcome their input, and no doubt Conservapedia's other readers would as well, and this page would be greatly improved.
I have not seen a rebuttal from SJohnson in the four days since my last post, although he has added new material to the content page since then. In light of this, I would appreciate some guidelines as to when it is appropriate for me to add my information and references to the content page. I can add citations from the primary mathematical literature if necessary, but in general I find that these are less helpful as they are not easily accessible by readers without access to academic libraries.--ElyM 18:07, 18 March 2009 (EDT)
You say, "I'm not sure why you expect me to address Blout's use of Monte Carlo." The reason is obvious: the title of the content page is the "Significance of E. Coli Evolution Experiments." You haven't addressed the inappropriateness of using Monte Carlo simulations for assessing the significance rarely occurring events, which was central to Lenski's statistical claims. I suggest you address this flaw if you want to be taken seriously.--Andy Schlafly 23:12, 18 March 2009 (EDT)
ASchlafly, again per your request, and based on your statement regarding the "inappropriateness of using Monte Carlo simulations for assessing the significance of rarely occurring events", I have spent the last several days reviewing the literature available to me on Monte Carlo and other resampling techniques, looking for ways in which Blount may have made a methodological error of the sort that SJohnson has made. I have been unable to find any examples of authors suggesting that Monte Carlo be avoided for low n, or for events with low probability regardless of n, much less providing specific cutoff numbers as are seen in the references that I provided for the chi-square test. Similarly, the technique that Blount used does not require/assume that categories are unrelated, as the chi-square test does. Of course, the absence of evidence is not evidence of absence, and I may have misinterpreted the basis of your objection. At this point I'll need you to explain your objection in more detail if you wish me to find the appropriate literature addressing your concerns. Do you believe that the number of resamplings was too low in Blount's paper? That the analysis should have been performed with a software package other than Statistics101? Some other procedural issue? Some issue of interpretation?
The statistical problem that Blount must address is straightforward: given a distribution of mutant cultures that appears to be skewed toward the higher generations, what is the probability that this same amount of skew (or a greater degree) could arise by chance, given the null hypothesis that every generation is equally likely to produce a mutant? Interestingly, in the case of the first replay experiment, the total number of ways to randomly select (equal probability, no replacement) four cultures from seventy-two is 72x71x70x69, or 24,690,960. This number is small enough that a program can brute-force-calculate the 'mean generation number' of all possible combinations of four cultures in a reasonable amount of time. An experimentally-derived 'mean generation number' can be checked against this exhaustive list, and the number of means equal to or larger than the experimental mean can be found exactly. Converting this number to a percentage of 24,690,960 provides an exact p-value for any given experimental 'mean generation number'. This exhaustive approach is different than the Monte Carlo technique, in that all possible outcomes are examined, rather than a random subset of all possible outcomes. For the first replay experiment, it provides a way to independently check Blount's Monte Carlo results. This approach is not possible for the second and third replay experiments, in which the total number of possible combinations becomes impractically large: 340!/335! = 4.41 x10^12 and 2800!/2792! = 3.74 x 10^27, respectively.
I asked a colleague to run just such a brute-force program for me on the first replay data. I also ran several Monte Carlo simulations (not using Statsistics101) with Blount's data, using twenty-five million, one hundred million, and 493,819,200 resamplings - note that this last is twenty times the number of all possible combinations of 4 samples drawn without replacement from 72. The p-values from the 25M, 100M, and 493M Monte Carlo resamplings (0.00844, 0.00846, and 0.00846, respectively) compare favorably with Blount's 1M value of 0.0085 and the non-Monte-Carlo brute-force exact calculation, which provides a p-value of 0.008457. Thus it appears that Blount's statistical results are confirmed by a non-Monte Carlo technique, at least for the first replay experiment.
What do you get for the experiment two p-value using the method from the paper and at least ten million realizations? SJohnson 08:48, 25 March 2009 (EDT)
For the second replay experiment, Blount reports that one million resamplings gives a p-value of 0.0007. When I run the Monte Carlo simulations, ten million resamplings give a p-value of 0.00060; one hundred million resamplings give a p-value of 0.00062, and one billion resamplings give a p of 0.00061.
I got 0.0006 using ten million realizations and the flawed test statistic. The paper had 0.0007. The authors obviously didn't use enough Monte Carlo realizations. I'm going to add this to the list of flaws in the paper. [10] SJohnson 08:51, 26 March 2009 (EDT)
As to the brute-force method for the second replay: the 4.41x10^12 combinations of five cultures picked from 340 actually represents 'only' 36.8 billion unique combinations, since for the purposes of calculating a mean generation value, the ordering of the cultures does not matter: 0, 0, 0, 0, 10 gives the same mean as 10, 0, 0, 0, 0 and 0, 10, 0, 0, 0. With brute force, it turns out that out of the 36,760,655,568 unique combinations possible in the second replay, 22,536,306 have means that are greater than or equal to 32,100.
22,536,306 / 36,760,655,568 = 0.000613 = the exact p-value derived from exhaustive evaluation rather than Monte Carlo.
The third replay has 9.27 x 10^22 unique combinations; at a billion comparisons a minute it would take over 170,000,000 years to check them all.--ElyM 12:36, 25 March 2009 (EDT)
----------
Here are pointers to the freely available Statistics 101 package [10] and the actual programs run through the package by Blount et al. [11]. The stats package is written in Java and should run under many operating systems. A 10 million trial run of the second experiment took a bit of time and yielded a p-value of 0.00061. Ten separate, one-million trial runs produced an average p-value of 0.00061 (std.dev=0.00002, n=10). Even with trial sizes of 5K, the numbers averaged about 0.0006 (std.dev=0.0004 n=10).--Argon 20:52, 25 March 2009 (EDT)
My intention is not to get caught up in a digression about Monte Carlo, though - I'd rather keep the focus on the fact that the main article should acknowledge that SJohnson is using chi-square in a way that violates accepted guidelines; this remains true whether Blount's analysis is valid or not.--ElyM 12:13, 23 March 2009 (EDT)

## Caveat

If SJohnson can provide citations to authors who support the use of chi-square where all expected cell counts are less than one, or where categories are not independent, I look forward to evaluating them.--ElyM 12:35, 27 March 2009 (EDT)

Wackerly et al. does not say to avoid the test because of low cell frequencies. You're still making a false claim that p-values are always high if cell frequencies are low. The last paragraph you added is just your opinions about the test being inappropriate. Modeling each trial as a statistically independent Bernoulli trial is reasonable. Thus, the chi-square test is appropriate. The assumption from the paper that the numbers of mutants per experiment would never change is an example of a bad way to model an experiment. Note that all p-values in Blount et al. were calculated under that unreasonable assumption. Do you think that if these experiments were recreated, that the total number of mutants would always be exactly the same? SJohnson 08:58, 28 March 2009 (EDT)