7. Statistics

Stefan Kober

How Formal Systems Reorganize Belief

From Probability To Statistics

Probability theory started with a random device. The probabilities governing its outcomes can be inferred (for example from symmetry) or are given, and from this device we can derive what kinds of data should appear by calculating forward.

In statistics we also work with random devices, but the probabilities involved are not known in advance. Instead we observe data and try to estimate which probabilities the device must have had in order to generate such observations. Think of it as throwing an asymmetric die and inferring the probabilities of each face.

In both cases we consider a random device together with one of the two other elements: either the probabilities, or the data it produces. In each case we try to infer the missing part.

Thus in statistics the direction is reversed: from observed outcomes back to the probabilities governing the random device that produced them.

The Weaker Convictional Force Of Statistics

Statistics does not exert strong convictional force immediately. Most of its convincing power arises later, from its success in the construction of knowledge.

Probability theory already relied on the concept of a random device, and logic itself built on earlier patterns of conviction. The same is probably true for geometry, before humans started extensive construction of buildings and roads, although we tend to take this for granted today. There is no direct access to completely unshaped convictional forces. We can only reconstruct them in hindsight, and they are always forces in a context.

Statistical reasoning is even further removed from immediate sources of conviction. It presupposes the conceptual tools of probability theory and logic, and must additionally construct convincing ways of reasoning backwards.

Different statistical traditions carry this construction out in different ways. Three influential approaches illustrate, in simplified form, how this is achieved.

The Fisher Model: Randomization

The statistician Ronald Aylmer Fisher developed a method based on the following idea. Suppose we compare two treatments and assume that they are equally effective. If the treatments are assigned randomly to the subjects, then every possible assignment of treatment labels to the observed outcomes is equally probable. Randomization ensures that hidden differences between subjects cannot systematically favor one treatment over the other.

We focus on the outcomes because they are the quantities of interest, while the subjects merely carry those outcomes. In the simple case each subject receives one of two treatments, and that produces, or fails to produce, an outcome. Treatments may take many forms: medicines in clinical trials, fertilizers in agricultural experiments, different manufacturing procedures, or any other repeatable intervention of interest. The subjects receiving the treatments may be people, plants, machines, or materials.

The assumption of equal effectiveness lets us do two things at once. First, it lets us assign well-defined probabilities: under that assumption, every possible assignment of labels to outcomes has the same probability. Giving two equally effective treatments is like giving everybody the same thing. All differences must be based on random fluctuations. Second, it lets us test the assumption using these probabilities. To do this, we order the possible outcomes according to how strongly they contradict the assumption. We agree to reject the assumption if the actual outcome lies among the most extreme results opposing it.

That gives us a method for reasoning backwards from data to probabilities. If the data is too improbable given our assumption, we take the assumption to be untenable. Otherwise, we do not reject it.

The statistician Phillip I. Good often illustrated this idea with a small experiment in which he nearly gathered good data for the longevity effects of vitamin E (for example in Resampling Methods, ch. 3.1 ff.).

The experiment was simple. Good prepared eight petri dishes with cells, four filled with a conventional nutrient solution and four with vitamin E added.

After three weeks three dishes of each kind had survived, leaving six observations. The story has at least one other twist that I will not spoil here, but the essential point is that Good ended up with six longevity measurements and six treatment labels.

121 (vitamin E)
118 (vitamin E)
110 (vitamin E)
34 (control)
12 (control)
22 (control)

Under the assumption that the treatments are equivalent, any assignment of treatment labels (3 times "vitamin E", 3 times "control") to the observed outcomes (121, 118, 110, 34, 22, 12) is equally probable.

We can therefore create a table with all possible assignments of the labels and compute, for each case, the difference between the sums of the measurements assigned to the two treatments.

There are 20 distinct assignments, and each is equally probable under the assumption that the treatments have the same effect.

To assess how strongly the observed result contradicts that assumption, we compare the observed difference with the differences that arise in the other possible cases.

This gives us the following table, ordered by the magnitude of the difference between the sums of the measurements assigned to the two treatments (compare to Resampling Methods, ch. 3.2):

vitamin Econtrolsum(vitamin E) - sum(control)probability scale
1121 118 11034 22 122815%
2121 118 34110 22 1212910%
3121 110 34118 22 1211315%
4118 110 34121 22 1210720%
5121 118 22110 34 1210525%
6121 110 22118 34 128930%
7121 118 12110 34 228535%
8118 110 22121 34 128340%
9121 110 12118 34 226945%
10118 110 12121 34 226350%
11121 34 22118 110 12-6355%
12118 34 22121 110 12-6960%
13121 34 12118 110 22-8365%
14110 34 22121 118 12-8570%
15118 34 12121 110 22-8975%
16110 34 12121 118 22-10580%
17121 22 12118 110 34-10785%
18118 22 12121 110 34-11390%
19110 22 12121 118 34-12995%
2034 22 12121 118 110-281100%

The table lists all outcomes that could have occurred under the assumption that the treatments are equivalent.

In this example we are interested in whether the vitamin E treatment performs better than the control, so we ignore outcomes that would instead suggest that the control performs better. Large positive differences count as stronger evidence against the assumption that the treatments are equivalent.

The table is ordered from the strongest evidence against the assumption, in light of the question whether vitamin E performs better, to the weakest such evidence.

The probability column shows how often a difference at least this favorable to vitamin E would appear in this experiment if it was repeated again and again, and if the treatments were actually equal. Under that assumption all 20 assignments are equally probable, so each row corresponds to an additional probability of $\frac{1}{20}$ or $5\%$.

The observed result appears in the first row, and only one out of the 20 possible outcomes provides evidence at least this strong against the assumption of equal treatments, given the question whether vitamin E is better. The probability of such an outcome under the assumption of equal treatments is therefore $5\%$. That means it is inside the group of the $5\%$ most opposing outcomes.

If the observed result appeared in the second row, two out of the 20 possible outcomes provide evidence at least this strong, giving a probability of $10\%$. If it appeared in the third row, the probability would be $15\%$, and so on.

In statistical practice it is common to reject the assumption of equal treatments if the probability of the observed result is below $5\%$. This does not mean that the probability of the assumption is below $5\%$. When this rule is followed, false rejections of this assumption due purely to chance will occur only rarely, in at most $5%$ of such experiments in the long run.

Fisher's method combines an assumption that introduces stable probabilities with a test of this assumption using said probabilities. The convincing force of this idea depends on probability and logic.

Sampling And Distributions

In sampling-based statistics the mechanism can be pictured using the image of an urn filled with colored balls. Drawing a ball corresponds to choosing one individual from a population.

The urn may contain many kinds of balls. Some represent high values of a measurement, others low ones. Or simply the presence or absence of an attribute.

The exact composition (the probabilities) of the urn (the random device) is unknown. What we observe are the balls that happen to be drawn (the data). These observations restrict the probable compositions of the urn, and the more observations, the stronger the restrictions.

Election surveys provide a familiar example. Before an election it is usually impossible to ask every voter how they intend to vote. Instead a smaller group of people is interviewed. Each interview can be viewed as drawing one ball from a very large urn representing the electorate.

If the urn contains 60% red balls and 40% blue balls, repeated draws will tend to reproduce that proportion. (In the small depicted urn one would need to put the ball back and shake after a draw. In a really large urn that is already mixed this does not matter.) A single draw tells us almost nothing. Ten draws may still be misleading. But if many draws are taken, the relative frequencies begin to stabilize, and the mathematics behind it allows us to estimate the possible error for a given number of samples.

This stabilization is captured mathematically by the law of large numbers. When many observations are taken from the same population, quantities such as averages or proportions tend to approach their population values.

The convincing force of sampling-based reasoning arises from this stabilization, and from how it makes some patterns of data likely while rendering others increasingly improbable. A similar pattern already appeared in the discussion of measurement error, where repeated measurements cluster around a value even though individual observations vary. Sampling-based statistics provides a systematic way of analyzing such patterns.

While Fisher’s model asked whether an observed difference could plausibly arise by chance under the assumption of equivalence of treatments, sampling models instead ask what a limited set of observations reveals about the properties of a larger population.

Both approaches rely on the idea of repeated well-defined experiments and long-run frequencies. That is part of their convictional setup. For this reason they are usually grouped together under the name frequentist statistics.

Bayesian Updating

In the sampling model we imagined an urn filled with colored balls that have a certain ratio. The task was to estimate this ratio using properties of repeated experiments. But the true composition of the urn is never directly observed.

Bayesian reasoning reframes the situation. Instead of assuming a single unknown ratio, it considers many possible compositions of the urn and assigns probabilities to them.

Before seeing any data, some compositions may appear more plausible than others. A ratio of $60\%-40\%$ might receive a relatively high probability, while more extreme possibilities such as $70\%-30\%$ or a perfectly balanced $50\%-50\%$ receive less weight.

Each of these possibilities represents a theory about the composition of the urn.

At this point two kinds of probabilities appear. First, there are the probabilities of observing certain data given a particular composition of the urn. These follow directly from the mechanism of drawing balls from the urn. Second, there are probabilities assigned to the different possible compositions of the urn themselves. These represent our current beliefs about which urn is more plausible. They are subjective and can vary between observers.

Each possible composition is a theory about the real mechanism producing the data.

When we draw a ball and observe its color, some of these theories become more plausible and others less so. If we repeatedly observe red balls, theories with many red balls gain weight, while theories dominated by blue balls lose credibility.

The rule governing this adjustment is Bayes' theorem, which we encountered earlier among the probability rules (as definition of probability of A given B). In Bayesian reasoning it becomes a rule for updating theories in light of new observations. It connects the probability of the data under each theory with the probability we assign to the theories themselves:

$P(theory|data) = \frac{P(data|theory)\times P(theory)}{P(data)}$.

Let's look at an example to understand how to apply this rule.

A magician has two coins that look identical. One is a normal coin that produces heads and tails with probability 0.5 each. The other is biased and produces heads with probability 0.9 and tails with probability 0.1.

One day he finds a coin on the floor of his study and is unsure which one it is. To find out, he flips it five times and observes the sequence:

head, head, tail, head, tail.

In Bayesian terms he considers two competing theories about the mechanism producing the data:

T1: $P(head|T1) = 0.5, P(tail|T1) = 0.5$.

T2: $P(head|T2) = 0.9, P(tail|T2) = 0.1$.

Since he initially has no reason to prefer one theory over the other, he assigns them equal probabilities:

$P(T1) = 0.5$, and $P(T2) = 0.5$.

But what is the probability of the data? The overall probability of head is $P(head|T1) \times P(T1) + P(head|T2) \times P(T2)$ because these are the only two ways in which a head can occur under our assumptions, weighted by the plausibility of the corresponding hypotheses.

When the first head appears, Bayes' rule updates the plausibility of the two theories (we simply plug in all the values):

$P(T1|head) = \frac{0.5 \times 0.5}{0.5 \times 0.5 + 0.9 \times 0.5} = \frac{0.25}{0.25 + 0.45} = \frac{0.25}{0.7} \approx 0.3571$

$P(T2|head) = \frac{0.9 \times 0.5}{0.5 \times 0.5 + 0.9 \times 0.5} = \frac{0.45}{0.25 + 0.45} = \frac{0.45}{0.7} \approx 0.6429$

The magician's initial assumptions were $P(T1) = 0.5$ and $P(T2) = 0.5$. After seeing the first head, he should now update his assumptions and believe that $P(T1) \approx 0.3571$ and $P(T2) \approx 0.6429$.

The same update rule is applied for each additional observation. Carrying out this process for the entire sequence gives the following result:

P(T1) (before seeing data)P(T2) (before seeing data)dataP(T1|data)P(T2|data)
0.50.5head0.35710.6429
0.35710.6429head0.23580.7642
0.23580.7642tail0.60680.3932
0.60680.3932head0.46160.5384
0.46160.5384tail0.81080.1892

Result: $P(T1) \approx 0.81$, $P(T2) \approx 0.19$.

Heads are more likely under the biased coin, so each head shifts probability toward that theory. Tails are unlikely for the biased coin, so they shift probability back toward the normal one.

After the five observed flips, Bayesian updating assigns a probability of roughly 0.8 to the hypothesis that the coin is the normal one.

In Bayesian reasoning the random device is internalized and multiplied. Two kinds of probabilities are assigned: probabilities that describe how each random device generates data, and probabilities that represent our degree of belief in each model. The data is used to update these degrees of belief.

Convictional force arises from the coherence of the updating rule and from its agreement with our intuition that evidence should favor the theories that best predict it. But like in all theories on that level of abstraction, the convictional force comes from arguments, procedures, successes, anecdotes, explanations, endorsements, and is diminished by counterexamples and counter-endorsements.

Both Bayesian and frequentist approaches use probability theory to reason from data toward the mechanisms that may have produced them. They differ mainly in where probability is placed. Frequentist methods attach probabilities to outcomes of well-defined experiments. Bayesian reasoning attaches probabilities as degrees of belief to the possible mechanisms themselves.

In practice they are frequently used side by side, each suited to different kinds of problems.

The Exemplarity Of Statistics For Modern Formal Systems

Conviction formation theory holds that conviction comes naturally to us: "There is a house", "I am hungry", "Stones, when thrown into the air, fall back to the ground". Like breathing, conviction formation operates autonomously in the background, but can also be taken up deliberately. Formal systems such as logic, probability, and statistics refine and constrain the latter process from within, shaping explicit conviction formation rules. They can also influence the more automatic processes over time.

Of all the formal systems we have examined, statistical reasoning occupies the most indirect position with respect to convincing force. Its convincing force does not arise from a single immediately compelling operation. Instead it develops mainly through the construction of methods for reasoning backwards from data to probabilities and through the success of those mechanisms across many applications.

This indirectness is increased by another feature. In counting, measuring, geometry, and somewhat less in logic and probability, errors are often comparatively easy to spot, and different people can usually repeat the procedure for themselves without much difficulty. In statistics this is much less often the case. A result may fail because an assumption was violated, because the experiment was poorly designed, because the wrong model was chosen, or because an error entered later in calculation or interpretation. Such failures are often not locally visible, the mathematical machinery may be too involved for easy checking, and the underlying experiments may be costly, slow, or impossible to repeat.

Many modern formal systems across science and engineering work in a similar way and share these traits. Their credibility derives from mathematical models whose reliability becomes visible through repeated success in empirical use.

Conviction formation theory highlights that formal systems do not stand on their own. They grow out of patterns of conviction that are already present in everyday reasoning. Even mathematics is not independent of this, though it is often presented as if it were. It is built on patterns of conviction that were established long before formal systems made them explicit. For this reason, formal derivation alone rarely suffices to establish conviction in practice, especially as increasing complexity makes stable conviction harder to achieve directly.

But practice often runs ahead of reflection. Works such as Judea Pearl's The Book of Why (which shows how statistical reasoning can be extended into a formal theory of causality), E. T. Jaynes' Probability Theory: The Logic of Science (which argues that probability provides a general logic of scientific reasoning), or George Pólya's How to Solve It rarely rely on formal derivation alone. Instead they build conviction gradually through carefully chosen examples that connect unfamiliar ideas to simpler forms of reasoning the reader already trusts.