Difference between revisions of "Information entropy"

Revision as of 14:50, February 18, 2008

The well known entity in the theory of information, seemingly founded by Claude Shannon, is the information entropy defined as

  H = - sum { p(i) log [p(i) ] }

where the index i runs through all positive integers up to n, for instance (i = 1, 2, …, n). This entropy may calculated for any statistical frequency function, such that the sum { p(i) } = 1. The maximum in H (= log(n)) is obtained when all p(i) are equal. Mathematically, H is also equivalent to entropy and disorder in physics.

H may also be interpreted as an average information based on a certain axiom that when an event occurs with probability p(i), then we get the self-information - log [p(i) ]. This information is only about probability and has no mental meaning.

Shannon derived the formula from axioms, which in a natural way may be interpreted as information. When we toss a coin we may have two different messages, each with probability 0.5 which may be denoted 0 and 1. If the coin is tossed two times in a sequence we may have the following four possible messages ‘00’, ‘01’, ‘10’ and ‘11’, each with probability 0.25, which may be seen as the information from a pair of tosses.

A reasonable postulate is that the sum of information from any two different messages should equal the total information in the joined messages. And one way to accomplish this is to let the self-information equal the negative logarithm of the probability. Thus, the information in a pair of tosses becomes

  -log(0.5) - log(0.5) = -log(0.25)   or   log(2) + log(2) = log(4)

Another reasonable postulate is that when a message increases in length, then its information increases. For instance, a triple of tosses gives messages with probability = 0.125 and a quadruple with probability = 0.0625. and

  -log(0.0625)  > -log(0.125).

And an event occurring with probability = 1 is a certain event, which will carry no information. Consequently -log(1) = 0.

If all these properties should be maintained for all possible divisions of messages written in different alphabets, then the negative logarithm of the probability is a very good candidate for being a measure of information.

More generally - the world “we” may be followed by “are”, but hardly by “am” – some part Y of a message may depend on another part X, making the average information H(Y|X) less than H(XY), i. e. the average information when X and Y are statistically independent. We have

  H(Y|X) = H(XY) – H(X)

and the only functions satisfying this formula are those proportional to the entropy H, Åslund, 1961. Then, entropy may be seen as average information.

The continuous case

In the continuous case the discrete frequency function is replaced by a continuous probability density function, p. d. f., f(x), and H may be written

  H = integral{ f(x) log[ f(x) dx ] dx }.

A problem here is that H becomes uncertain because of the infinitely small dx. A way to evade the problem is to define H in such way that H equals the logarithm of the volume – in analogy to the cardinality n in the discrete case - covered by a uniform p. d. f. over the volume. Thus H may be defined as

  H = integral{ f(x) log[ f(x) ] dx }.

The “volume” covered by a Gaussian p. d. f. is proportional to the volume of the ellipsoid of concentration and equal to the square-root{ (2 pi e)^n determinant(M) }, where M is the moment matrix of the Gaussian.

The average information may be maximized by a certain genetic algorithm, Gaussian adaptation, keeping the mean fitness – i. e. the average probability of becoming a parent to new individuals in the progeny – constant (and without any need for any knowledge about average information as a criterion function). This is illustrated by the figure below, showing Gaussian adaptation climbing a mountain crest in a phenotypic landscape.

The lines in the figure are part of a contour line enclosing a region of acceptability in the landscape. At the start the cluster of red points represents a very homogeneous population with small variances in the phenotypes. Evidently, even small environmental changes in the landscape, may cause the process to become extinct.

web.telia.com/~u91131915/mountain_crest.GIF

After a sufficiently large number of generations, the increase in average information may result in the green cluster. Actually, the mean fitness is the same for both red and green cluster (about 65%). The effect of this adaptation is not very salient in a 2-dimensional case, but in a high-dimensional case, the efficiency of the search process may be increased by many orders of magnitude. Besides, a Gaussian distribution has the highest average information as compared to other distributions having the same second order moment matrix, Middleton, 1960.

References

Brooks, D. R. & Wiley, E. O. Evolution as Entropy, Towards a unified theory of Biology. The University of Chicago Press, 1986.
Kjellström, G. Network Optimization by Random Variation of component values. Ericsson Technics, vol. 25, no. 3, pp. 133-151, 1969.
Kjellström, G. On the Efficiency of Gaussian Adaptation. Journal of Optimization Theory and Applications, vol. 71, no. 3, Dec. 1991.
Shannon, C.E. A Mathematical Theory of Communication, Bell Syst. Techn. J., Vol. 27, pp 379-423, (Part I), 1948.
Åslund, N. The fundamental theorems of information theory (Swedish). Nordisk Matematisk Tidskrift, Band 9, Oslo 1961.

@@ Line 49: / Line 49: @@
 == See also ==
 * [[Information]]
+== References ==
+* Brooks, D. R. & Wiley, E. O. Evolution as Entropy, Towards a unified theory of Biology. The University of Chicago Press, 1986.
+* Kjellström, G. Network Optimization by Random Variation of component values. Ericsson Technics, vol. 25, no. 3, pp. 133-151, 1969.
+* Kjellström, G. On the Efficiency of Gaussian Adaptation. Journal of Optimization Theory and Applications, vol. 71, no. 3, Dec. 1991.
+* Shannon, C.E. A Mathematical Theory of Communication, Bell Syst. Techn. J., Vol. 27, pp 379-423, (Part I), 1948.
+* Åslund, N. The fundamental theorems of information theory (Swedish). Nordisk Matematisk Tidskrift, Band 9, Oslo 1961.

Difference between revisions of "Information entropy"

Revision as of 14:50, February 18, 2008

The continuous case

See also

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Popular Links

donate

Edit Console