|Analysis of variance|
|Bayesian model selection|
Statistics can be described as "the practice or science of collecting and analysing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample." It involves all stages of data collection and processing from the initial collection, to the analysis and ultimately to the conclusions and interpretations of the data. It is used in all research oriented disciplines from physics, chemistry and biology to economics, anthropology and psychology as well as many thousands of other fields. It is also used in businesses and governments.
Statistics analyzes data in two primary ways, the first is called descriptive statistics which describes and summarizes the data. Often this will include things like: the mean, standard error, or standard deviation each of these is an example of a statistic. Also statistics can attempt to infer relationships between the data collected and various hypothesis or populations, this is called inferential statistics. Both descriptive and inferential statistics comprise applied statistics. There is also a discipline called mathematical statistics, which is concerned with the theoretical basis of the subject.
Statistics takes its name from the fact that it was traditionally taught to monarchs to enable them to manage affairs of state.
Frequentist approaches are often referred to as classical approaches because it is the oldest and most used method of statistical analysis. The heart of this approach is to try and understand data as a relative frequency or ratio of a particular occurrence out of a total possible number of occurrences. For example, a frequentist would describe the number of times a coin turns up heads as a ratio of total number of heads out of total number of flips.
Frequentist approaches to descriptive statistics mostly involve averaging. For example, the mean of a sample is calculated as the total value of all observations divided by total number of observations, the standard deviation as times the square root of the mean2, and the standard error as the T* or Z* of the statistic times μ divided by the square root of N.
These methods stem from the view of data as ratios and probabilities.
Frequentist approaches to inferential statistics primarily involve trying to compare descriptive statistics of two data sets to determine if they are significantly different. One of the most common approaches is to test a given data set against a null hypothesis or the data set that would be created if the values were the result of random chance alone. For example, if a given head came up 9 times as heads and 1 time as tails you would compare the number of heads, 9, to the number of heads that would be expected if chance alone was operating, or 5.
Testing against the null hypothesis is sometimes referred to as an omnibus test since it is testing the idea that a given data set is the result of anything other than chance. Often it is much more desirable to test specific data sets against each other.
Common frequentist methods
Chi-Square test (the test could be of independence/association, homogeneity, or goodness-of-fit, depending on the circumstance)
Bayesian statistics is a method of applying Bayes theorem to data analysis. One of the biggest difference between Bayesian approaches and frequentist approaches is that Bayesians attempt to determine the probability that a given hypothesis is true given the data, while frequentist attempt to define the probability of getting the data given that a particular hypothesis is true.
Bayesian approaches are becoming more and more popular in science because what most people are interested in is the probability of the proposed hypothesis, not the probability of the data. It also does not need to make prior assumptions about the data such as normality and homogeneity of variance. However, Bayesian methods have come under fire from many frequentist proponents. This has led to very heated debate in statistical circles, though this has largely died now, about the respective validity of both methods. The primary complaint leveled at Bayesian statistics is that it must use a prior probability of a hypothesis in its analysis. This prior is intended to build contextual information into the analysis, but it may be seen by its critics as subjective or arbitrary. Commonly used prior distributions include the uniform distribution and beta distribution.
Bayesian methods all use Bayes' equation, this applies for both descriptive and inferential statistics. To find such things as the mean and standard deviation first a prior probability for all means and standard deviations must be assigned. In practice this usually means assigning uniform probabilities to values equally spaced between what we think is the minimum and maximum values for the statistic we are interested in (the number of values depends on the grid density, which is proportional to accuracy and inversely proportional to computation time). Then a likelihood of each value is then calculated based on the data and then Bayes equation is used to assign a posterior probability for each value. These posterior probabilities can be plotted as a probability density function (PDF) to see the various probabilites for the value given the data, or often simply the value with the highest posterior probability is simply chosen.
Inferential statistics in Bayesian methods looks much the same as descriptive statistics since both use the Bayes equation and the same basic approach. To compare to means you would calculate the PDF for each data set then subtract them from each other to figure out the probability that they differ.
In order to compare hypothesis Bayesian model selection is often used. This is when each hypothesis you want to test is assigned a prior probability, and then the likelihood of the data given each hypothesis being test is calculated. You can then us Bayes equation to determine the relative probabilities that each hypothesis is correct. This method is almost always testing relative probabilites since to calculate an absolute probability would require knowing every possible hypothesis. Usually this is not possible, but sometimes the subset is finite enough it can be tested.
Because of the large number of calculations needed for model selection Bayesian approaches have only became practical and popular with the advent of computers. But even with the most modern computers available many Bayesian models remain computational intractable. Recent developments in applying Markov chain Monte Carlo methods to these problems have led to promising results.
Non-parametric and Bootstrapping methods
One of the greatest problems in frequentist approaches to statistics is that it often relies on making prior assumptions about how the data looks and was collected. Most commonly the data must be a normal distribution and have homogeneity of variance. Different statistical methods are more or less robust to violations of these assumptions, and some techniques have attempted to avoid them all together. The end result though is usually a significant loss of power and increased likelihood of error.
Non-parametric statistics are any one of many methods that attempt to define descriptive characteristics or make inferential claims with out the need of tightly confined parameters. The main goals is to try and eliminate the need for assumptions without sacrificing power and accuracy.
Bootstrapping statistics is a particularly popular non-parametric approach. Bootstrapping is computationally costly and has only recently become feasible for most data sets. It involves sampling with replacement from the given data set perhaps as many as 100,000 times in order to determine mean, error, best fits and comparisons of data sets.
Misuse of statistics
Statistical data can often be manipulated to make it seem like it proves a certain hypothesis, whereas in actuality it does not. Because of the advanced mathematics involved in computing some statistics, people can sometimes be deceived by this. This type of deceit has been used by politicians and their supporters to give a false impression of voter preferences, for example. It can also be used by scientists with their own agendas to try to "prove" various otherwise unsupported theories. Since it is also possible to misuse statistics by accident, statisticians must always be very careful; for example, polls can be skewed if the wording of questions or other polling techniques unintentionally result in bias.
- Soanes, C. and Stevenson, A. (eds.) (2005) 'Oxford Dictionary of English (2nd edition revised)' Oxford University Press, Oxford, U.K.