Coin flips and the Hot Hand

The Hot Hand is the belief that an actor, such as a basketball player, is more likely to be successful if he has just experienced success than if he has just experienced a failure. For example, a basketball player is believed to be ‘hot’ when he has just made one (or more) shots in a row – and as a result, is more likely to make his next shot than if he was not in this ‘hot’ state. In statistical language, the hot hand is the claim that the conditional probability of a success given a prior success (p(Success | Success)) is higher than the conditional probability of a success given a previous failure (p(Success | Failure)).

While belief in the hot hand is pervasive, several decades of research, notably one by Gilovich, Vallone and Tversky has suggested that it in fact does not exist. Rather, it is the result of a cognitive fallacy wherein people expect random sequences to have fewer strings of successes than are actually expected by chance.

However, in the past few weeks, an unpublished manuscript written by Joshua Miller and Adam Sanjurjo has apparently put an atom bomb on this debate. It was even picked up by the Wall Street Journal! The fact that the WSJ is writing about the results from an unpublished manuscript shows how big of a deal this is. So what are Miller and Sanjujo claiming that set off this massive reaction? They claim that in fact scientists, rather than basketball fans, have been making a statistical mistake. Specifically, Miller and Sanjurjo claim that, for a fair coin, the probability of a head given a previous head may not be 0.50!

Miller and Sanjurjo take the following example:

Take a coin and flip it 4 times, calculate the percentage of times where a head was followed by a head. Call this the “Head Repetition Probability.” Repeat this procedure 1 million times. What will the average Head Repetition Probability be?

If your answer is 50%, then you’re similar to most people. And like most people, you’re wrong. In fact, the probability is 41.7% – substantially lower than the intuitive 50% number that most people, and prior researchers, had believed. Why is this important? Well, if researchers find that, in fact, real basketball players have a Basket Repetition Probability of 50% (they have), then they have a higher RP than is expected by chance (i.e.; 41.7%)! Thus, the often detected 50% repetition probability in real people may in fact be evidence for the Hot Hand!

But were does this 41.7% figure come from? Does it really reflect the expected probability of getting a Head given that the previous outcome is a head? If so, then coins should be inherently anti-streaky, which would pretty much contradict everything we know from Probability 101. Thankfully, that’s not what the 41.7% number reflects. Rather, the number comes from an improper statistic which will equate sequences that are actually quite different.

Where the 41.7% figure comes from (and how to fix it)

To see where the 41.7% figure comes from, I will follow Andrew Gelman’s example and simulate the result in R using flips of a virtual fair coin. Using the coin analogy, we want to know “What is the probability of flipping a H given that a H occurred on the previous flip?”

However, in contrast to Gelman, I will not randomly simulate coin flips. Instead, I’ll use R to calculate all solutions analytically.

First, we’ll create a dataframe called possible showing of all possible sequences of 3 flips of a fair coin (though you can easily change the number of flips in the code by changing the n.flips value). We’ll use 1 to denote a Head and 0 to denote a Tail:

[code language=”r”]
n.flips <- 3
possible <- expand.grid(lapply(1:n.flips, function(x) {c(0, 1)}))
names(possible) <- paste("f", 1:n.flips, sep = "")

possible

## f1 f2 f3
## 1 0 0 0
## 2 1 0 0
## 3 0 1 0
## 4 1 1 0
## 5 0 0 1
## 6 1 0 1
## 7 0 1 1
## 8 1 1 1
[/code]

As you can see, there are 8 possible sequences (2 to the power of 3). However, we need to remove the two cases where there are no heads in the first 2 flips. Why? Because if there are no heads in the first two flips, then there is no possibility for a sequence of two heads anywhere in the sequence.

[code language=”r”]
possible <- possible[rowMeans(possible[,1:(n.flips – 1)]) != 0,]
possible

## f1 f2 f3
## 2 1 0 0
## 3 0 1 0
## 4 1 1 0
## 6 1 0 1
## 7 0 1 1
## 8 1 1 1
[/code]

Now we have 6 possible sequences that are equally likely to occur. For each of these sequences, how often is there an HH (a Head-Head repetition?). To do this, let’s calculate three statistics:

streak.opportunities: How many opportunities are there for an HH streak? In other words, how many H values are there in the first 2 flips?

actual.streaks: How many actual HH streaks are there?

p.HH: Of those possible streak opportunities, what percentage of the time was there an actual streak?

[code language=”r”]
for(i in 1:nrow(possible)) {

seq <- possible$flips[i]
seq.values <- as.numeric(possible[i,1:n.flips])

heads1 <- seq.values[1:(n.flips -1)] == 1
heads2 <- seq.values[2:(n.flips)] == 1

streak.opportunities.i <- sum(heads1)
actual.streaks.i <- sum(heads1 & heads2)

p.hh <- actual.streaks.i / streak.opportunities.i

possible$hh.streak.opportunities[i] <- streak.opportunities.i
possible$hh.streaks[i] <- actual.streaks.i
possible$p.hh[i] <- p.hh

}

possible

## f1 f2 f3 hh.streak.opportunities hh.streaks p.hh
## 2 1 0 0 1 0 0.0
## 3 0 1 0 1 0 0.0
## 4 1 1 0 2 1 0.5
## 6 1 0 1 1 0 0.0
## 7 0 1 1 1 1 1.0
## 8 1 1 1 2 2 1.0
[/code]

We’re almost to the 41.7% number! So, to calculate the expected probability of getting HH, we can calculate the average p.HH value across all 6 sequences:

[code language=”r”]
mean(possible$p.hh)
## [1] 0.4166667
[/code]

Wow! The value is indeed 41.67%! But does this number really represent the conditional probability of getting a H given a H? No, it doesn’t

The Problem

Why is the 41.7% number misleading? Well, in our previous calculation, we treated all flip sequences as equally informative. That is, we said each sequence is equally likely, therefore, we can just take the average proportion of hits given hits (p.HH) across all 6 sequences. However, this is a false assumption! Indeed, some of these sequences are more informative than others.

To see this, we’ll take an extreme example. Consider flipping a coin 10 times. Here are two possible sequences. Assuming the coin is fair, both of these sequences are equally likely. For each sequence, I’ve calculated the number of HH streak opportunities, the number of actual HH streaks, and the percentage of HH streaks.

Sequence Streak opportunities Number of Streaks Streak Percentage
1, 1, 1, 1, 1, 1, 1, 1, 1, 1 9 9 1.0
0, 0, 0, 0, 0, 0, 0, 0, 1, 1 1 1 1.0

In the first sequence, there are 9 opportunities for a streak (in the 2nd through the 10th flip). Additionally, there were 9 actual streaks, so the proportion of streaks was 1.0.

Now, consider the second sequence. In this sequence, there is only one opportunity for an HH streak (on the final 10th flip). And indeed, there was indeed an H on the 10th and final flip. Therefore, the proportion of HH streaks out of all possible HH streaks was 100% – just like in the first sequence! But are these sequences really equally informative? I think not. In the first sequence, we considered all 10 data points while in the second sequence, we ignored the first 8 flips and only considered the last two.

Think about a real world example: if you saw a player hit 10 shots in a row, would’t you think she provided stronger evidence for a hot-hand than a player who missed 8 shots and then made only 2 in a row?

Thankfully, there is an easy fix to the problem

Fixing the Problem (and finally getting p(H|H) = 50%)

To fix the problem, we need to make sure we weight sequences according to how many opportunities there were for an HH sequence. To do this, we can simply calculate the total number of HH streaks across all sequences, and divide it by the total number of HH streak opportunities across all sequences. Here, we’re ignoring differences between people.

Let’s go back to our coin flip dataframe and make this calculation. If I’m right, we should get a p(H|H) probability of 50%:

f1 f2 f3 hh.streak.opportunities hh.streaks p.hh
1 0 0 1 0 0.0
0 1 0 1 0 0.0
1 1 0 2 1 0.5
1 0 1 1 0 0.0
0 1 1 1 1 1.0
1 1 1 2 2 1.0

To calculate this, we’ll calculate the total possible number of streaks (sum of the 5th column) and divide it by the total possible number of streaks (sum of the 4th column). We should get 4 / 8 = 50%:

[code language=”r”]
sum(possible$hh.streaks) / sum(possible$hh.streak.opportunities)
## [1] 0.5
[/code]

Phew, we got it! Again, the reason why our 50% number is larger than the previous 41.7% number is because we’ve given more weight to sequences with more opportunities for HH events than those with fewer opportunities.

Conclusion

While the 41.7% number reported by Joshua Miller and Adam Sanjurjo (and subsequently confirmed by Andrew Gelman sounds like a scary violation of the laws of statistics, it is in fact an artefact of an improper statistical procedure. As I’ve shown, taking the simple average proportion of HH events across many sequences is improper because it gives equal weight to sequences with many opportunities for HH events and to those with only a few opportunities. By analogy, consider two basketball players who shoot 100 shots. Player A misses his first 97 shots, and makes his last 3 shots. In contrast, player B makes all 100 shots. Which player was more hot? According to the improper statistic – both are equally hot because both have a p(H|H) value of 100%! Following this logic, both provide equivalent (positive) evidence for the hot-hand. Of course, this is poor reasoning, we should give player B’s performance more weight. Departing from this example and moving back to the more general example of coin flips, we can avoid this bias by simply calculating the total number of HH events divided by all possible HH events, you will reach the correct (and intuitive) 50% figure.

Addendum: Using Bayes to help correct the error

After finishing the previous analysis, I wondered if we could use Bayesian statistics to correct the bias inherent in the improper statistic. Specifically, if we allow Bayes theorem to weight evidence according to the number of opportunities for success (which as I’ve shown to be critical), can we return to using the improper statistic?

To answer this question, I will repeat the improper statistic procedure, but I won’t calculate the mean of the sample pHH proportions. Instead, I’ll first calculate the Bayesian posterior highest density of the pHH value for each player (e.g.; coin), and then calculate the average Bayesian posterior highest density across all players.

I’m make this example a bit more interesting by simulating the results of baskets in a shooting tournament. I’ll generate outcomes of 10 sequential shots from 10,000 basketball players. Importantly, each player will have a true 50% chance of success on each shot.

[code language=”r”]

n.players <- 1000
n.shots <- 10
shots <- as.data.frame(matrix(sample(c(0, 1), size = n.players * n.shots, replace = T),
nrow = n.players, ncol = n.shots))

[/code]

Again, let’s delete cases where the first 9 shots are all misses

[code language=”r”]
shots <- shots[rowMeans(shots[,1:9]) != 0,]
[/code]

Now, for each player, let’s calculate 4 statistics

  • possible.HH: The number of hits in shots 1 – 9; this equates to the number of possible streaks in the sequence
  • actual.HH: The number of actual HH streaks in the sequence
  • actual.HH.p: The actual proportion of HH streaks in the sequence
  • posterior.HH.p: The Bayesian posterior estimate of the probability of an HH sequence given

[code language=”r”]
for(i in 1:nrow(shots)) {

player.seq <- shots[i,]

possible.HH.sequences <- sum(player.seq[1:(n.shots – 1)] == 1)
actual.HH.sequences <- sum(player.seq[1:(n.shots – 1)] == 1 & player.seq[2:(n.shots)] == 1)

new.player.seq <- c(rep(1, actual.HH.sequences), rep(0, possible.HH.sequences – actual.HH.sequences))

n <- possible.HH.sequences
p <- actual.HH.sequences / possible.HH.sequences

post.a <- 1 + p * n
post.b <- 1 + (1 – p) * n

post.mean <- qbeta(.5, post.a, post.b)
lb <- qbeta(.025, post.a, post.b)
ub <- qbeta(.975, post.a, post.b)

shots$possible.HH[i] <- possible.HH.sequences
shots$actual.HH[i] <- actual.HH.sequences
shots$actual.HH.p[i] <- p
shots$bayes.HH.p[i] <- post.mean
shots$lb[i] <- lb
shots$ub[i] <- ub
}
[/code]

Now, let’s calculate the average HH streaks from our measures.

First, let’s use the original biased measure (the same one that got us the ugly 41.7% value for 4 coin flips).

[code lanugage=”r”]
mean(shots$actual.HH.p)
## [1] 0.4431845
[/code]

In my simulation, I got a value of 44.3%. Better than the 41.7% value, but still far away from the desired 50% value.

Now, let’s see how our average Bayesian solution did:

[code language=”r”]
mean(shots$bayes.HH.p)
## [1] 0.4701428
[/code]

The mean Bayesian posterior response is 47.0%, closer to 50% than before, but still not good enough. To be honest, I’m not totally sure why this value does not hit 50% (ignoring sampling error). I’ll still need to work on this.

Finally, let’s use our simple unbiased measure:

[code language=”r”]
sum(shots$actual.HH) / sum(shots$possible.HH)
## [1] 0.4975391
[/code]

Here, I got a value of 49.8% – close enough to be within sampling error of 50%.