# Information entropy

Information theory is the study of "information" or data, in a statistical manner.

## Overview

The well known entity in the theory of information, seemingly founded by Claude Shannon, is the information entropy defined as



where the index i runs through all positive integers up to n, for instance (i = 1, 2, …, n). Note that the logarithm is taken to base 2. This entropy may calculated for any statistical frequency function, such that . The maximum in H (= log(n)) is obtained when all p(i) are equal. Mathematically, H is also equivalent to entropy and disorder in physics.

H may also be interpreted as an average information based on a certain axiom that when an event occurs with probability p(i), then we get the self-information - log [p(i) ]. This information is only about probability and has no mental meaning.

Shannon derived the formula from axioms, which in a natural way may be interpreted as information. When we toss a coin we may have two different messages, each with probability 0.5 which may be denoted 0 and 1. If the coin is tossed two times in a sequence we may have the following four possible messages ‘00’, ‘01’, ‘10’ and ‘11’, each with probability 0.25, which may be seen as the information from a pair of tosses.

A reasonable postulate is that the sum of information from any two different messages should equal the total information in the joined messages. And one way to accomplish this is to let the self-information equal the negative logarithm of the probability. Thus, the information in a pair of tosses becomes

 or 

Another reasonable postulate is that when a message increases in length, then its information increases. For instance, a triple of tosses gives messages with probability = 0.125 and a quadruple with probability = 0.0625. and



And an event occurring with probability = 1 is a certain event, which will carry no information. Consequently, -log(1) = 0.

If all these properties should be maintained for all possible divisions of messages written in different alphabets, then the negative logarithm of the probability is a very good candidate for being a measure of information.

More generally - the word “we” may be followed by “are”, but hardly ever by “am” – some part Y of a message may depend on another part X, making the average information H(Y|X) less than H(XY), i. e. the average information when X and Y are statistically independent. We have



and the only functions satisfying this formula are those proportional to the entropy H, Åslund, 1961. Then, entropy may be seen as average information.

## The continuous case

In the continuous case the discrete frequency function is replaced by a continuous probability density function, (p. d. f.), f(x), and H may be written



A problem here is that H becomes uncertain because of the infinitesimally small dx. A way to evade the problem is to define H in such way that H equals the logarithm of the volume – in analogy to the cardinality n in the discrete case - covered by a uniform p. d. f. over the volume. Thus H may be defined as



The “volume” covered by a Gaussian p. d. f. is proportional to the volume of the ellipsoid of concentration and equal to , where M is the moment matrix of the Gaussian.

The average information may be maximized by a certain genetic algorithm, Gaussian adaptation, keeping the mean fitness – i. e. the average probability of becoming a parent to new individuals in the progeny – constant (and without any need for any knowledge about average information as a criterion function). This is illustrated by the figure below, showing Gaussian adaptation climbing a mountain crest in a phenotypic landscape.

The lines in the figure are part of a contour line enclosing a region of acceptability in the landscape. At the start the cluster of red points represents a very homogeneous population with small variances in the phenotypes. Evidently, even small environmental changes in the landscape, may cause the process to become extinct.

After a sufficiently large number of generations, the increase in average information may result in the green cluster. Actually, the mean fitness is the same for both red and green cluster (about 65%). The effect of this adaptation is not very salient in a 2-dimensional case, but in a high-dimensional case, the efficiency of the search process may be increased by many orders of magnitude. Besides, a Gaussian distribution has the highest average information as compared to other distributions having the same second order moment matrix, Middleton, 1960.

## History of applications

A comprehensive outline of applications such as telecommunication, channel coding, computer science, physics, neurobiology, electrical engineering and so on may be found elsewhere on the Internet.

Less well known is the use of information theory in simulated evolution by random search. For instance, the average speed of a random walk with Gaussian distributed steps in a hypercube or simplex is proportional to -P log( P ), where P is the average probability that a step will lead to a new point inside the hypercube, Kjellström, 1969. This may be interpreted as the self-information –log(P) divided by the work or time – proportional to 1/P – needed to get the information on the average.

The function -P log( P ) is positive in the P-interval, 0 <= P <= 1, equals zero at the end points and attains its maximum at P = 1/e = 37%.

A more general theorem of efficiency based on information theory may be found in a paper by Kjellström, 1991.

## Applications to creationism

The German physicist Dr. Werner Gitt has built on the work of Claude Shannon and Norbert Wiener in information theory in his article Information, science and biology, extending it with fourteen mathematical theorems from which it follows that evolution cannot account for the information encoded in living systems. He has, in effect, offered irrefutable 'a priori' proof that evolution is impossible.