Difference between revisions of "Talk:PNAS Response to Letter"
(Point 5 + reviewing)
|Line 162:||Line 162:|
Maybe the Journal of Nature can be your next letter submission source. ''International Weekly Journal of Science'' -- [[Image:50 star flag.png|14px]] [[User:Jpatt|jp]] 21:17, 22 September 2008 (EDT)
Maybe the Journal of Nature can be your next letter submission source. ''International Weekly Journal of Science'' -- [[Image:50 star flag.png|14px]] [[User:Jpatt|jp]] 21:17, 22 September 2008 (EDT)
== References ==
== References ==
Revision as of 15:36, 20 October 2008
Notice: misrepresentations are not going to be allowed on this page. Substantive comments only, please.
Point 5 Confirmed
I would like to contribute to this discussion because I have taught statistics to graduate biology students for 16 years.
The combination of data from several experiments is a specialist and sometimes difficult area of statistical theory but a simple example shows why Aschafly’s concern about combining the results of three different experiments is not justified and why this aspect of his criticism of Lenski’s recent paper in PNAS is not valid.
Suppose we want to conduct a test of whether or not men are taller than women on average. For the sake of the example, I generated random heights of people from a population in which men had an average height of 175cm (5’10’) and women of 165cm (5’6”). The standard deviations of height in both sexes were 7cm. I think these numbers are approximately correct for people in the UK but the details aren’t important.
Suppose we take 5 samples of 2 men and 2 women. Here are the numbers I generated:
|Man1||Man2||Woman1||Woman2||Men mean||Women mean||Mean difference||t||P|
P in the last column is the t-test probability for a one-side test of women being shorter than men. (Formally, it’s the probability of getting a value of t greater than that calculated from the data if women are in fact taller than men on average.)
Should the fact that, in the fourth sample, the average height of the women is taller than that of the men make us doubt that men are in fact taller on average? Should we be concerned about the last sample, in which the difference in height of the two sexes is rather small, though in the expected direction? No, in both cases. When we combine the data on all 10 men and all 10 women, we get this:
|Men mean||Women mean||Mean difference||t||P|
Clearly, combining the data from several similar experiments strengthens the conclusions considerably, as shown by the fact that P is much smaller for the combined data than for any individual sample.
Although the combination of data from several experiments is a specialised area of statistics, I see nothing particularly incorrect about the approach used by Lenski and his colleagues. The general point is that it is valid to combine the results of different experiments if it is scientifically meaningful to do so. (For example: A. Combining the results of five samples of the heights of men and women is clearly valid. B. Combining three samples of heights of men and women with two samples of lengths of male and female squid clearly isn’t.) Generally speaking, the outcome of a combined analysis of several small experiments which all point in the same direction (or at least in a similar direction) will be more significant than that of any one of those experiments, as is shown in the larger table above.
I hope this clarifies the extensive discussion on this point and puts Aschafly’s mind at rest on this subject. KennyMac 08:20, 18 September 2008 (EDT)
- That's very nicely put, thanks. You should work on some of the stats pages here. Of course, technically any sample is ultimately just a combination of n samples of size 1. MikeR 13:28, 18 September 2008 (EDT)
- I'll take a look at this Friday. It's not immediately obvious what the point is to your analysis above.--Aschlafly 23:46, 18 September 2008 (EDT)
- "The general point is that it is valid to combine the results of different experiments if it is scientifically meaningful to do so"--KingOfNothing 00:57, 19 September 2008 (EDT)
- This makes no sense as an argument. It may be true in this simple case that you can do one large or several small samples and get similar results - which is quite obvious and wouldn't need such a detailed rant. However, you provide no mathematical proof, just one example. Etc 01:08, 19 September 2008 (EDT)
- "It's not immediately obvious what the point is to your analysis above". No surprise, Aschlafly, really. Maybe you should take your own advice: "I suggest you try harder with an open mind". --CrossC 02:46, 19 September 2008 (EDT)
It is with great sadness that I note that the author of this - the only significant statistical explanation and discussion in this entire fiasco- has just been blocked for five years. Even his email is blocked, so he can't even appeal the action. I don't see such manouvers as having contributed to the much vaunted "open mind" of which various people here speak. BenHur 10:27, 19 September 2008 (EDT)
REPLY: I have now reviewed the above analysis, and it supports Point 5 rather than the PNAS paper. Point 5 stated, "The Third Experiment was erroneously combined with the other two experiments based on outcome rather than sample size, thereby yielding a false claim of overall statistical significance." The analysis above does nothing more than reinforce Point 5 by combining experiments based on sample size.
In Pavlovian manner, some Lenski types nod their head here in agreement at the above analysis, apparently unaware that it reinforces Point 5.
When combining results from samples that are vastly different in sample size, it is necessary to factor in the different sample sizes. Apparently the PNAS paper failed to do that, which helps explain why it refuses to provide a meaningful response to Point 5.--Aschlafly 19:24, 19 September 2008 (EDT)
(rants below were deleted for being non-substantive in violation of this page's rules.)--Aschlafly 19:24, 19 September 2008 (EDT)
- I understand point 5 now, or at least I think I do. What we have as a "sample" is either:
- 1. Individual cultures (Schlafly)
- 2. Cultures that developed cit+. (Lenski)
- Schlafly contends that the sample should be all the cultures and that Lenski has, improperly, filtered the sample by excluding the vast majority of it (i.e. all those cultures that did not become cit+). Am I right in thinking this is the argument? --Toffeeman 19:57, 19 September 2008 (EDT)
- No, we're talking about how Lenski combined a large study (which did not really support Lenski's hypothesis) with small studies (which Lenski claims does support his hypothesis). The studies were not combined in a logical manner with proper weighting given to the much bigger size of the large study.--Aschlafly 23:15, 19 September 2008 (EDT)
- You say "statistical technique used," but you should have said "statistical technique cited." In fact, a close reading of the Z-transform paper provides more support for Point 5: combined studies must be weighted based on sample size:
- "When there is variation in the sample size across studies, there can be a noticeable difference in the power of the two methods, with the weighted Z-approach being superior in all cases. As such, we should always prefer the weighted Z to the unweighted Z-approach when the independent studies test the same hypothesis."see p. 1371.
- In other words, the cited paper actually supports Point 5.--Aschlafly 09:34, 20 September 2008 (EDT)
(unindent)Lenski used the weighted method. See note 49 to the paper and the text around the combination. Of course there is the question of on what basis Lenski weighted the results. Lenski weighted the results on the basis of the Cit+ numbers and we may think it would have been better to weight on the basis of the number of replicates. I have below the calculations (not mine) of combined P-values based on 1) no weighting, 2) weighting on the basis of Cit+ and 3) weighting on the basis of replicates. The weighted Z-transformed = SUM(Weight x Z-score for each run)/SQRT(SUM(Weight^2 for each run))
Applying the formula described above..
|By total Cit+||3.576||<0.001|
|By total replicates||1.825||0.034|
So weighting on the basis of the number of replicates considerably increases the P-value. It remains, however, well within the range of statistical significance (0<P<0.05). If we hold that Lenski should have weighted on the basis of replicates then he should have rejected the null hypothesis and reached exactly the same conclusions that he did. The entire paper would have been exactly the same except the sentence “the result is extremely significant (P<0.0001) whether or not….” Would read “the result is significant (P<0.04) whether or not”. Point 5 establishes one number and an “extremely”. Point 5, therefore, has no weight (excuse the pun). --Toffeeman 15:18, 20 September 2008 (EDT)
- Sorry, Toffeeman, a falsehood is still a falsehood. Based on your own posting, if Lenski had applied the Whitlock Z-transform paper in a logical manner, the results would not have been nearly as striking as Lenski claimed (his paper said the results were "extremely significant"). Moreover, I found Lenski's description of his application of the Whitlock paper to be particularly misleading. Lenski's use of "whether or not" obscures the basic error that he did not apply Whitlock's paper in the straightforward, correct manner. I think the wording in the Lenski paper deliberately obscures this falsehood from the reader.
- People have free will to embrace and defend falsehoods. I don't expect them to change quickly or admit they were wrong. But you'll find me defending and promoting the truth.
- Point 5 remains valid and the falsehood remains uncorrected by PNAS or Lenski. Four other points in higher priority remain uncorrected by them also.--Aschlafly 16:47, 20 September 2008 (EDT)
- "a falsehood is still a falsehood". Precisely, the null hypothesis should be considered false and the conclusions of the paper stand.
- Oh? Do you mean Lenski's falsehood? And what falsehood is that? Lenski said that he had calculated the P-value without weighting and that had come out at <0.0001. That is true, not false. Lenski said he had calculated the P-value weighting on the basis of the Cit+ replicates and that had come out at <0.0001. That is true, not false. There is no falsehood. Lenski did not mention the results of weighting on the basis of replicate numbers. He thus made no claim about weighting on the basis of replicate numbers. If he made no claim he cannot have made a false claim.
- How do the words Lenski uses "mislead". If you read them as written what conclusion do you come to? You are lead to the conclusion that the mutation was not "rare-but-equal", instead it was contingent. That is the right conclusion. If Lenski had presented the data in a different manner (perhaps by including the results of weighting on the basis of replicate numbers) what conclusion do you come to? You are again lead to the conclusion that the mutation was not "rare-but-equal", instead it was contingent. That is the right conclusion. To mislead you must be lead to a conclusion that is incorrect. By Lenski's paper you are not lead to a conclusion that is incorrect. Thus is cannot be said to be "misleading".
- I shall not comment on your second paragraph, the temptation to "Tu Quoque" would be too great.
--Toffeeman 17:13, 20 September 2008 (EDT)
- The falsehood consists of pretending to apply the Whitlock Z-transform in a straightforward, logical and correct manner. I think the Lenski paper is intentionally misleading by using the "whether or not" wording, when both alternatives are nonsensical. Point 5 has been proven above to be correct in identify an error in the Lenski paper.
- "Toffeeman", your blocking history suggests you have been less than straightforward yourself. Go elsewhere if you seek to be deceitful. You're not fooling anyone here.--Aschlafly 17:54, 20 September 2008 (EDT)
- Lenski did apply the Z-transformation correctly, both weighted and unweighted. The data points extracted from each replay are the generation numbers of those replicates that gave rise to Cit+ mutants. Thus Replay 1 produced four data points: 30,500 31,500 32,500 32,500. Replay 2 produced five data points: 32,000 32,000 32,000 32,000 32,500. Replay 3 produced eight data points: 20,000 20,000 27,000 27,000 31,000 31,500 32,000 32,000. Thus the N for replay 1 is 4; replay 2 is 5 and replay 3 is 8. The fact that replay 3 used 38 times as many replicates as replay 1 does not mean that it should be weighted 38 times as much; it only produced twice as much data, not 38 times as much.
- Suppose I want to find out what the average age of a murderer is in three cities. In L.A. I interview 72 random people and find that 4 of them were convicted of murder; I record the ages of the four. In Seattle I interview 340 people and find that 5 of them are convicted murderers; likewise in Singapore I interview 2800 and find 8 murderers. In the end, I have 4,5, and 8 data points from the three cities; the number of people I had to interview to obtain those data points doesn't factor into the analysis of what the average age of the murderers is.--Brossa 00:06, 21 September 2008 (EDT)
- In your first paragraph you simply repeat the error underlying Lenski's paper. You, like the paper, incorrectly apply Whitlock's Z-transform.
- The quality and reliability of data is proportional to sample size, and when different studies are combined they need to be weighted accordingly. The results from a very large sample size would not be weighted equally with the results from a small sample size, as you and Lenski have done. That's basic logic, though I'm not optimistic that you or Lenski will admit it. Open-minded people who respect logic have no difficulty elevating logic over personal whim.-Aschlafly 11:30, 21 September 2008 (EDT)
- Andy if you read Whitlock's paper you would see it say, and I quote, "Ideally each study is weighted proportional to the inverse of its error variance, that is, by the reciprocal of its squared standard error." It says nothing about weighting according to sample size, which is what you seem to be insisting should be done.
- Also Whitlock acknowledges in the paper that there is no preference for weighted versus equal weighting, so the fact that both equal weighting and weighting by the standard error give a statistically significant result shows that the 3 experiments combined support rejection of the null hypothesis. DanB 20:39, 21 September 2008 (EDT)
- ASchlafly, you state that I incorrectly apply "Whitlock's Z-transform" (actually the test belongs to Mosteller & Bush and/or Liptak). Whitlock describes weighting by the reciprocal of the squared standard error. The standard error of the mean is proportional to 1/sqrt(N), so the reciprocal of the squared standard error is proportional to N. Thus larger studies are given more weight. I maintain that the sample sizes N of the three replays are 4, 5, and 8 respectively. Weighting based on those three N does not weight all three replays equally as you claim: it gives replay 2 25% more weight and replay 3 100% more weight than replay 1. Rather than simply repeating that I am wrong, will you please state what you think the sample sizes of the three replay experiments are, and, in your opinion, what the correct application of the Z-transformation would be?--Brossa 18:00, 22 September 2008 (EDT)
- The Lenski paper states how it weighted the experiments, and that weighting is incorrect. Admit it. Moreover, the incorrect weighting in the Lenski paper was not likely an inadvertent error, as it inflated the significance of the results. I found the wording used by the Lenski paper to describe its (incorrect) weighting to be artfully misleading.
- Provide me with federal funding as Lenski received, and I'll write a paper for you. But I don't have to write an alternative paper to point out glaring errors in Lenski's paper.--Aschlafly 08:35, 23 September 2008 (EDT)
- I've been following this discussion for a while and I have to agree with ASchlafly. It hardly seems fair that he should have to, in his spare time, replicate an experiment done by a professional just to "earn" the right to criticize it. I am unfamiliar with statistics, but if some complicated transform goes against common sense, common sense should prevail. After all, there are lies, damned lies, and statistics... AndyM 10:57, 23 September 2008 (EDT)
(unindent)I'm not asking anyone to write a paper or replicate an experiment. I'm asking ASchlafly to support his statement "The results from a very large sample size would not be weighted equally with the results from a small sample size, as you and Lenski have done"(bolding mine). I have stated publicly, subject to challenge by others, that the sample sizes (n) of the three replays are four, five, and eight respectively. Furthermore, using n of 4, 5, and 8 in the weighted Z-method DOES NOT weight all the replay experiments equally - it weights replay 3 twice as much as replay 1 and 8/5 as much as replay 2. Tell you what: I'll drop all my questions about Monte Carlo and the Z-transform, and simply ask ASchlafly one question: what is the sample size, n, of the second replay experiment? He need not even do any calculations - a statement in words that will allow someone else to do the calculation will suffice. This is not a complicated question to answer; the paper states how many replicate cultures there were (340), how many cells there were in each replicate (3.9x10^8), how many replicates gave rise to Cit+ cells (5), and which generations those Cit+ replicates came from (4 from 32,000 and one from 32,500). I will even give my answer: five. Furthermore, I will say why I believe that, using the murderer/age analogy: performing the 340 replicates is the same as interviewing 340 people in order to find out if any of them are convicted murderers. Finding that five replicates gave rise to Cit+ mutants is the same as the survey finding that 5 of those 340 people were convicted murderers. Finding that the Cit+ mutants arose from 4 replicates from generation 32,000 and 1 from generation 32,500 is the same as finding the ages of the murderers. The five data points in the Lenski study allow one to calculate the 'mean generation of clones yielding Cit+': 32,100. This is the same as finding the mean age of the five murderers. If I want to compare this hypothetical murderer age study to some other study of the mean age of murderers, I would weight the studies based on how many murderers were in each study, not on how many non-murderers were included in the initial survey.
Surely ASchlafly can say what he thinks the n of the second replay is, even if he won't say why he thinks it. Is it five? 340? The number of replicates times the number of cells per replicate? Something else? No analysis need be performed on the resulting number.--Brossa 15:54, 23 September 2008 (EDT)
- OK - would you care to put in writing that after ASchlafly gives you his response, you won't start obfuscating the issue with Monte Carlo and Z-transform issues? You understand: it is typical of liberals to, after being proven wrong, to start pretending that they were talking about an entirely different issue altogether. After ASchlafly states the sample size of the second replay experiment you will consider yourself answered. Correct?
- Brossa, your rant is misplaced. One cannot salvage an error in logic by questioning which of superior alternatives should be used instead. The sample size of an experiment is the number that comprises the underlying sample used in the experiment, not the number of a certain outcome from the experiment. Maybe you can debate yourself over what the correct underlying sample size is, but it is plainly not the number of a certain outcome from the experiment.--Aschlafly 19:28, 23 September 2008 (EDT)
Maybe the Journal of Nature can be your next letter submission source. International Weekly Journal of Science -- jp 21:17, 22 September 2008 (EDT)
Point 5 Rejected
1. Before I make my main point, would either Aschlafly or Bugler like to comment on why they've repeatedly reverted my contributions to this Talk Page. I thought we weren't supposed to delete other people's stuff on Talk pages unless it's for a very good reason (obscenity, racism, etc). In this case, what are they embarrassed about?
2. I do have specialist knowledge of this area, despite Bugler's assertions to the contrary (made without any evidence whatsoever, may I say). But I repeat, it really doesn't matter whether I have particular qualifications or not. In science, what matters is what is said, not who says it. The most outstanding scientific discoveries of the 20th century were made by a patent clerk who had no academic position at the time.
3. I repeat my cut-and-paste from Longstop's contribution last week. If Aschlafly or Bugler don't agree, kindly discuss it on this page, don't just delete it with no explanation.
"The main point he [Kennymac] makes is quite correct, that combining experiments will tend to give a more significant result (i.e. a lower P-value) than any single experiment on its own. "The point that ASchlafly makes (headed REPLY, shortly after KennyMac's example) is slightly misleading. You don't choose which experiments to combine on the basis of their outcomes - you either combine all relevant experiments or none. In this case, Professor Lenski chose to combine all the experiments and present a single analysis of them. That's fine. "The issue of whether to weight the Z-test according to sample size is not a straightforward issue. Before Whitlock's recent paper, the consensus was that the P-value already depends on sample size so it should not be weighted according to sample size. (Basically, a high degree of significance, i.e. a low P-value, can be achieved by having either a large difference between the treatments or a large experiment or, of course, both.) Whitlock's work, based on computer simulations, is an interesting contribution but cannot be regarded as the last word on the subject because any computer simulation involves assumptions about the structure of the particular experiment simulated. I cannot imagine any editor of a scientific journal rejecting a paper because it used an unweighted Z-test rather than the weighted version (or vice-versa). "The conclusion, therefore, is that while there is a quantitative difference in the level of significance obtained by weighted and unweighted Z-tests, there is undeniably a significant biological effect. ASchlafly and other contributors should not be concerned that there is anything incorrect about the biological conclusions of Professor Lenski and his students or that there was anything at all underhand about their analysis or presentation of their data." DavyJones 16:36, 20 October 2008 (EDT)
Referring of Scientific Papers
Another point that Aschlafly and Bugler reverted with no explanation was the explanation of why CP readers shouldn't be concerned about the fact that the Lenski paper was reviewed within 14 days. This is perfectly normal practice for a leading journal. If you're invited to review a paper for a leading journal and they give you 7/10/14/etc days, you only agree to review it if you're sure you're going to have time before the journal's deadline. So there's absolutely no reason to suppose the review wasn't done properly.
Again, if Aschlafly or Bugler disagrees, please explain why, on this page. Don't just delete what I've written with no explanation. DavyJones 16:36, 20 October 2008 (EDT)
- "We also used the Z-transformation method (49) to combine the probabilities from our three experiments, and the result is extremely significant (P < 0.0001) whether or not the experiments are weighted by the number of independent Cit+ mutants observed in each one." (Lenski paper at 7902).
- Mosteller, F. & Bush, R.R. 1954. Selected quantitative techniques. In: Handbook of Social Psychology, Vol. 1 (G. Lindzey, ed.)
- Liptak, T. 1958. On the combination of tests. Magyar Tud. Akad. Mat. Kutato Int. Kozl. 3: 171-197