Reading note: Everything is Predictable
Everything is Predictable | Tom Chivers (2024)
I thought I already knew Bayes, I was wrong
Tom Chivers’ Everything is Predictable goes beyond popular science. It’s really a crash course in the history and development of Bayes’ theorem.
Dry you say? Quite the opposite. I devoured this book over the past summer in Australia then gave away my copy to a friend. When I moved to London I was referencing it so often I had to go out and buy another.
Who is Bayes again?
Bayes’ theorem is named after 18th-century British statistician and theologian, Thomas Bayes. His remarkable (yet simple) theorem describes how to update the probability of a hypothesis based on new evidence.
Why is this important? Well, in traditional frequentist statistics (that you get taught in economics, psychology, medicine, and engineering), probability is interpreted as the long-run frequency of events in repeated trials (i.e. sampling).
A p-value from a sampling distribution tells us:
What is the chance of seeing this result, given some hypothesis?
Bayesian statistics, on the other hand, allows us to incorporate prior knowledge or beliefs into our analysis and update those beliefs as we gather more data. The central question of Bayes’ Theorem therefore is:
How likely is the hypothesis to be true, given the data I’ve seen?
This isn’t wordplay or semantics - it’s a completely different way of seeing the world.
We can write Bayes’ theorem using probability notation.
Where:
P(A|B) is the posterior probability (updated belief about A after observing B)
P(B|A) is the likelihood (probability of observing B, given that A is true)
P(A) is the prior probability (initial belief about A, before seeing B)
P(B) is the evidence (total probability of observing B, across all possible values of A)
Flat priors aren’t to be taken for granted
Chivers does a marvellous job in stepping through the history of Bayes’ theorem, and where it departs from the frequentist canon.
We see in the denominator P(A) above, Bayes’ theorem relies on our initial belief about A before seeing B. This is implicit in the frequentists approach too. It’s just that they - without thinking about it - put equal possibility on each possible belief (we’ll start calling them priors from here).
This creates a novel issue, known as Boole’s objection. Sometimes, equally weighting ‘A before seeing B’ involves making big assumptions.
Chivers explains:
Imagine an urn filled with balls, either black or white. If you have a flat prior on the total number of black balls in the urn, then any given mix of black and white balls is equally likely. (If there are only four balls in the urn, you know have three possibilities - two black, one black and zero black - and they’re all equally likely).
But if you assume that each ball is equally likely to be black or white - a flat prior on the probability of drawing a white or black each time - then your prior probability favours (very strongly if there are lots of balls) a roughly fifty-fifty mix in the urn as a whole.
What’s in a prior anyway?
The importance of priors becomes obvious when thinking about disease diagnosis.
Say you go to a doctor for a routine cancer screening. The disease (say, a rare cancer) occurs in 1 out of every 1,000 people.
The test is very accurate (only gives a false positive 2% of the time), but not perfect:
If you have cancer, the test is positive 100% of the time.
If you don’t have cancer, the test still gives a false positive 2% of the time.
You take the test, and your result is positive. What is the probability you actually have cancer? Most of us bush mathematicians default to ‘well of course it’s 98%’.
The truth is far more interesting when we plug in the numbers into Bayes formula:
Even though the test is 98% accurate, the rarity of the disease means that most positives are false alarms.
In a group of 1,000 people:
1 person has cancer → and they test positive.
999 people don’t have cancer → and the test produces 20 false positives (i.e. 2% of 999).
In total the test shows 21 positives, but only only 1 out of 21 people who receive positive test results (4.8%) actually have cancer.
From first principles
There’s one thought experiment that Bayes covered in his seminal work The Doctrine of Changes that really cracks this whole priors thing open. Chivers explains:
To make his point, Bayes used a metaphor of a table, upon which balls are rolled…The table is hidden from your view, and a white ball is rolled on it in such a way that its final position is entirely random: ‘There shall be the same probability that it rests upon any one equal part of the plane as another.’
When the white ball comes to a rest, it is removed, and a line is drawn across the table where it had been. You are not told where it is. Then a number of red balls are also rolled onto the table. All you are told is how many of the balls lie to the left of the line, and how many to the right. You have to estimate where the line is.
Imagine that five balls are thrown, and you’re told that two of them landed to the left of the line, and three of them landed to the right.
Where do you think the line ought to be? Bayes said that the most likely place is three-sevenths of the way up the table from the left. Intuitively you might think that it should be two-fifths. After all, you’ve just rolled five balls, and two of them ended up on one side, and three on the other.
But Bayes said that you must take into account the prior probability – your best guess of what the situation was, before you got any information. But do you have a best guess? You don’t know anything, do you? The line could be anywhere. But that in itself is a form of prior information: it is equally likely (from your subjective point of view) that the line is right up against the left-hand cushion, or right up against the right-hand cushion, or anywhere in between.
You could draw a graph of the distribution of probability – how likely the line is to be in a given place on the table, before you had rolled another ball. If you have absolutely zero idea where the line is, then the probability of the next ball landing to the left of it is 0.5 – 50 per cent. After all, the line could be far to the right, so the ball would definitely land left; it could be far to the left, so the ball would definitely land right; it could be in the middle, so it would be fifty-fifty; or it could be anywhere else, with corresponding probabilities. The average position is exactly in the middle.
Essentially, Bayes’ big insight was that you must add any new information you get to the information you already have. In this case, you don’t have very much information. But it is something. What that means is that instead of just saying, ‘The most likely position of the line is two-fifths of the way along the table,’ you have to take account of your prior.
So Bayes said that the equation for working out the probability here is not ‘the number of red balls on the left divided by the total number of red balls’ – 2/5 – but the number of red balls on the left of the line PLUS ONE, divided by the total number of red balls PLUS TWO. It is, says Spiegelhalter, ‘equivalent to having already thrown two “imaginary” red balls, and one having landed each side of the dashed line’. That might seem odd, but it makes sense when you think of what it would look like if all the balls had landed on one side or the other. If all five had landed left, and we didn’t include those extra imaginary balls, then we’d say the probability of the next ball landing left was 5/5, or 1, or complete certainty.
But that’s silly – obviously you don’t know for sure that the next ball isn’t going to be to the right. With Bayes’ extra balls, your estimate would be 6/7. And no matter how many balls land on one side, you never end up with absolute certainty – if a million balls land to the left, then your estimate of the next ball’s probability of landing right would be 1/1,000,002. Each piece of new information pushes you closer to certainty, but you never quite get there.
What if I don’t know the prior? Yes, that’s an issue
Okay. The balls rolled on a table thought experiment makes enough sense. Priors are essential.
But, if we know very little about the world, our priors are going to be fuzzy (or at the very least, subjective). This matters.
Chivers drives this home by expanding on the ‘balls in an urn’ example:
In Bayes’ imaginary not-in-fact-a-billiard table, he assumed that it was equally likely that the white ball could be anywhere on the table. That’s called a uniform prior. That’s defensible – you can imagine that if you throw the ball hard enough it’ll be essentially random where it lands. But what about situations where you’re completely ignorant and don’t have good reasons to assume any prior?
A more technical objection was that of the mathematician and logician George Boole, who pointed out that there are different kinds of ignorance. A simplified example taken from Clayton: say that you have an urn with two balls in it. You know the balls are either black or white. Do you assume that two black balls, one black ball and zero black balls are all equally likely outcomes? Or do you assume that each ball is equally likely to be black or white?
This really matters. In the first example, your prior probabilities are ⅓ for each outcome. In the second, you have a binomial distribution: there’s only one way to get two black balls or zero black balls, but two ways to get one of each. So your prior probabilities are ¼ for two blacks, ½ for one of each, ¼ for two whites.
Your two different kinds of ignorance are completely at odds with each other. If you imagine your urn contains not four but 10,000 balls, under the first kind of ignorance, your urn is equally likely to contain one black and 9,999 whites as it is 5,000 of each. But under the second kind of ignorance, that would be like saying you’re just as likely to see 9,999 heads out of 10,000 coin-flips as you are 5,000, which of course is not the case. Under that second kind of ignorance, you know you’re far more likely to see a roughly 50–50 split than a 90–10 or 100–0 split in a large urn with hundreds or thousands of balls, even though you’re supposed to be ignorant.
So which prior do we assume? Do we think the colour of the balls is independent or correlated? You may say that you assume perfect ignorance, but there are different kinds of ‘ignorance’, and you have to pick one.
But the underlying problem of Bayesian priors is a philosophical one: they’re subjective. As we said earlier, they’re a statement not about the world, but about our own knowledge and ignorance. And that’s… uncomfortable.
Think priors are subjective? P-values aren’t much better
A lot of the criticism of Bayesian methods over the years has been the subjective nature of assigning priors. The frequentists have elegance on their side, even if there is an argument that they are measuring entirely the wrong thing (i.e. what is the chance of seeing this result, given some hypothesis?).
Chivers bluntly describes that p-values are wholly arbitrary and subjective too:
If a study finds that a p-value of 0.05 or less, that means that you’d only see those results (or more extreme ones) by chance one time in twenty at most, doesn’t it? So surely, if every study is using that yardstick, you wouldn’t expect to see many false positives?
That is the idea, sure. But it’s not as straightforward as that. The easiest way to get a p<0.05 result - that is, something that you’d only see by coincidence one time in twenty - is to do twenty experiments, and then publish the one that comes up.
Who’s methods are subjective now?
What now?
This is one of those books that left me with more questions than answers. Where do I find good base rate data? Can I still apply Bayesian methods if there’s asymmetric information on priors? Aren’t most LLMs (that rely on predicting the next token) just using a Bayesian approach?
While Chivers’ book didn’t cover everything you could want to know on Bayes, priors, and the limitations of the frequentist approach, there’s a few glaringly obvious takeaways:
Priors (e.g. base rates) are crucial, and if you don’t think about them actively you are assuming they’re implicitly flat (i.e. 50/50) anyway
P-values are bullshit and arbitrary
It’s difficult to make Bayesian methods the default wiring of my brain
Chivers’ book brain wormed me good, and I’m better for it.


