Making fast, good decisions with the FFTrees R package

fftreesblog

“…[W]e are suspicious of rapid cognition. We live in a world that assumes that the quality of a decision is directly related to the time and effort that went into making it.” ~ Malcolm Gladwell

In his book Blink, Malcolm Gladwell summarises a common misconception about good decision making. According to folk wisdom, the more time, information, and effort you put into a decision, the better it gets. In other words, “More is better.” If you are a doctor making a diagnosis, more medical tests are always better. If you are trying to decide if your new Tinder match is worthy of a date, try to find them on Facebook, Instagram and Snapchat first. If you deciding how to invest in the stock market, get as much data as you can and build a statistical model so complex that it describes the past perfectly.

However, decades of research in cognitive science and machine learning have shown that the “More is better” theory is, in many real-world decisions, flat wrong. In contrast, there are many cases where, as Dr. Gerd Gigerenzer has put it, “Less is more.” Why? For two key reasons. The first reason is statistical: Complex decision models with many parameters can lead to overfitting. In essence, overfitting occurs when a model is very good at describing one specific past dataset, but fails in predicting new, unseen data (Gigerenzer & Brighton, 2009). For example, a complex economic model might describe changes in past stock prices very well, but is largely unable to predict future changes. This is why, as Burton Malkiel shows, it is so hard for complex trading algorithms to outperform simple index funds in the stock market (Malkiel, 1999). The second reason is psychological: Even if a complex decision model is good at predicting new data, if a person can’t understand it, or easily implement it in a real-world decision, like a doctor trying to use logistic regression in an emergency room, they won’t use it.

What simple decision rules can people use to make good decisions? One popular class of simple decision rules are Fast and Frugal Trees (FFTs, Gigerenzer & Todd, 1999; Martignon, Katsikopoulos & Woike, 2008). Fast and frugal trees make very fast decisions based on a few (usually 1 to 5) pieces of information and ignore all other information. In other words, Fast and frugal trees are noncompensatory, meaning that once they make a decision based on a few pieces of information, no additional information can ever change the decision. Because they are so simple to use, they have been used in many real-world decision tasks from making coronary artery disease diagnoses (Green & Mehr, 1997), to diagnosing depression (Jenny, Pachur, Williams, Becker & Margraf, 2013). However, lest you think that fast and frugal trees are only useful when time is limited, research has shown that fast and frugal trees can out-predict more complex models in decidedly non-human simulations (Gigerenzer, Czerlinski & Martignon, 1999).

While fast and frugal trees have shown promise, there are currently no off-the-shelf methods to create them. How can you create your own fast and frugal decision trees for your own dataset? Starting today, you can use the FFTrees R package available on CRAN. The main function in the package is FFTrees(), which takes a standard formula and data argument, and returns a fast and frugal tree (FFTrees) object. From this object, you can view its underlying trees, along with many standard classification statistics (e.g.; hit-rate, false alarm rate, AUC) applied to both training and test (i.e.; prediction) datasets. Finally, the function has two alternative classification algorithms, logistic regression and CART, built in, so you can always compare the accuracy of your fast and frugal trees to two gold-standards in the classification literature. If you’re like me, you’ll be amazed at how well simple, transparent fast and frugal trees perform relative to these gold-standards, especially in predicting new data!

The FFTrees package in action

You can install and load the FFTrees package from CRAN:

install.packages("FFTrees")
library("FFTrees")

Once you’ve installed the package, you can view the overview vignette by running the code FFTrees.guide(). However, for this blog post I’ll show you how to create fast and frugal trees for predicting breast cancer. The data we’ll use comes from the Wisconsin Breast Cancer Database (data source). The data is stored as a dataframe with 699 rows, representing 699 patients, and 10 columns. The 10 columns represent 9 physiological measurements, from cell sizes to cell shapes, and 1 binary variable (diagnosis) indicating whether the woman truly does, or does not have breast cancer. Here is how the first few rows of the dataframe look:

thickness cellsize.unif cellshape.unif adhesion epithelial nuclei.bare chromatin nucleoli mitoses diagnosis
5 1 1 1 2 1 3 1 1 FALSE
5 4 4 5 7 10 3 2 1 FALSE
3 1 1 1 2 2 3 1 1 FALSE
6 8 8 1 3 4 3 7 1 FALSE
4 1 1 3 2 1 3 1 1 FALSE
8 10 10 8 7 10 9 7 1 TRUE

To create a fast and frugal tree from the dataset, we’ll use the FFTrees() function, entering formula = diagnosis ~., meaning that we want to predict diagnosis as a function of (potentially), all other variables, and data = breastcancer. We’ll assign the result to a new object of class FFTrees called breastcancer.fft

breastcancer.fft <- FFTrees(formula = diagnosis ~.,
                            data = breastcancer)

Now that we’ve created the object, we can print it to the console to get basic information

breastcancer.fft
# "An FFTrees object containing 6 trees using 4 cues {cellsize.unif,cellshape.unif,nuclei.bare,epithelial} out of an original 9"
# "Data were trained on 683 exemplars. There were no test data"
# "FFT AUC: (Train = 0.98, Test = NA)"
# "My favorite tree is #3 [Training: HR = 0.93, FAR = 0.05], [Testing: HR = NA, FAR = NA]"

The printout tells us that the final FFTrees object contains 6 different trees, and the largest tree only uses 4 of the original 9 cues. To see the best tree, we can simply plot the FFTrees object:

plot(breastcancer.fft, 
     main = "Breastcancer FFT", 
     decision.names = c("Healthy", "Cancer"))

 

bcfft

 

There’s one of our fast and frugal trees! In the top section of the plot, we see that the data had 444 true healthy cases, and 239 true cancer cases. In the middle section, we see the actual tree. The tree then starts by looking at the cue cellsize.u. If the value is less than 3, the tree decides that the person is healthy. If the value is not less than 3, then the tree looks at cellshape. If the cellshape. <= 2, the tree decides the patient is healthy. If cellshape. > 2, the tree decides that the person does have cancer. That’s the whole decision algorithm! Now isn’t that a lot easier to interpret than something like logistic regression? Again, imagine giving this logistic regression equation to anyone without a degree in statistics.

Performance

Ok, so our fast and frugal tree easy to understand and use, but how well does it perform? The bottom section of the above plot shows a series of performance statistics. On the bottom left hand corner, we can see a classification table, which shows how the tree’s decisions compare to the truth. Entries on the main diagonal (Cor Rej and Hit) correspond to correct decisions, while the other entries correspond to incorrect decisions. As you can see, the tree performed exceptionally well:  it made correct diagnoses in 646 (424 + 222) out of all 683 cases (95% correct). Additional performance statistics, including specificity (1 – false alarm rate), hit rate, d-prime, AUC (area under the curve) are also displayed. Finally, in the bottom right plot, you can see an ROC curve which compares the performance of the trees to CART (in red) and logistic regression (in blue).

Viewing other trees

While this tree did well, it still made some errors in both detecting true cancer cases (i.e.; hit rate) and in correctly rejecting true healthy cases (i.e.; specificity). Now, what if you want a tree that rarely misses true cases, at the cost of additional false alarms? As  Luan, Schooler, & Gigerenzer (2011) have shown, you can easily shift the balance of errors in a fast and frugal tree by adjusting the decisions it makes at each level of a tree. The FFTrees function automatically builds several versions of the same general tree that make different error trade-offs. We can see the performance of each of these trees in the bottom-right ROC curve. Looking at the ROC curve, we can see that tree number 5 has a very high hit-rate, but a smaller specificity. We can look at this tree by adding the tree = 5 argument to plot():

plot(breastcancer.fft, 
     main = "Breastcancer FFT", 
     decision.names = c("Healthy", "Cancer"),
     tree = 5)

 

Screen Shot 2016-08-16 at 15.21.31

 

Here is the resulting tree. As you can see, this tree uses an extra cue called nuclei.bar. This tree has a perfect hit rate of 100% (just as we wanted), but at a cost of a lower specificity of 80%.

Additional arguments

Cross-validation: The FFTrees() function allows you to easily create trees from a training dataset, and test the performance of the trees with a test dataset (aka, cross-validation). You can do this by either entering an explicit test dataset in the data.test argument, or by randomly splitting the main dataset into a separate training and testing sample with train.p. For example, train.p = .5 will randomly split the data into a 50% training set, which will be used to build the trees, and a 50% test set, which will be used to evaluate their prediction performance.

# Create a 50% training and 50% testing dataset with train.p = .5
breastcancer.test.fft <- FFTrees(formula = diagnosis ~ .,
                                 data = breastcancer,
                                 train.p = .5)

Restricting trees: If you want to explicitly decide which cues you want in the tree, you can specify this in the formula argument. For example, the following code will generate a tree from the breast cancer data, but only using cues thickness, mitosis, and adhesion.

# Only use 3 cues in the trees
 breastcancer.r.fft <- FFTrees(formula = diagnosis ~ thickness + mitoses + adhesion,
                               data = breastcancer)

Summary

The FFTrees package contains lots of other functions for visualising and comparing trees. In addition, it contains several real world datasets you can use to predict who survived the Titanic (titanic), whether a mushroom is poisonous or not (mushrooms), or how a politician will vote on a new bill (voting). To see all the details, check out the package vignettes either in the package or on CRAN (here). For all you judgment and decision making researchers out there, I will also be presenting the package at the annual meeting of the Society for Judgment and Decision Making (SJDM) in Boston in November 2016.

The package is also very much in development, so I am grateful for any recommendations, inevitable bug-reports, or criticisms. You can post bug-reports at www.github.com/ndphillips/FFTrees/issues, or email me directly at Nathaniel.D.Phillips.is@gmail.com

Alternative decision tree R packages

I hope you have as much fun with the package as I do, but I must warn you: this package was written by a psychologist who just happens to love programming in R for fun, rather than a statistician who really knows their shit about how best to build trees. For this reason, the code is messy, dreadfully inefficient, and without proper reference to established tree building algorithms. While I hope to collaborate with statisticians in the future to improve the tree building algorithm (likely by implementing one that already exists), it’s not quite there yet. For more established methods, I strongly recommend that you check out the rpart, and party packages.

Coauthors

This package was developed in collaboration with Dr. Hansjoerg Neth and Dr. Wolfgang Gaissmaier at the University of Konstanz, and Dr. Jan Woike at the Max Planck Institute for Human Development in Berlin.

Code / Package Updates

This blog post has been updated to incorporate updates made to version 1.1.4 of the FFTrees package. Specifically, the original fft() function was changed to FFTrees() to avoid confusion with fast fourier transform. All the original code (should) still work, but with some warnings.

 References

Gigerenzer, G., & Brighton, H. (2009). Homo heuristicus: Why biased minds make better inferences. Topics in Cognitive Science, 1(1), 107–143.

Gigerenzer, G., & Todd, P. M. (1999). Fast and frugal heuristics: The adaptive toolbox. In Simple heuristics that make us smart (pp. 3–34). Oxford University Press.

Gigerenzer, G., Czerlinski, J., & Martignon, L. (1999). How good are fast and frugal heuristics? In Decision science and technology (pp. 81–103). Springer.

Gladwell, M. (2007). Blink: The power of thinking without thinking. Back Bay Books.

Green, L., & Mehr, D. R. (1997). What alters physicians’ decisions to admit to the coronary care unit? Journal of Family Practice, 45(3), 219–226.

Jenny, M. A., Pachur, T., Williams, S. L., Becker, E., & Margraf, J. (2013). Simple rules for detecting depression. Journal of Applied Research in Memory and Cognition, 2(3), 149–157.

Malkiel, B. G. (1999). A random walk down Wall Street: including a life-cycle guide to personal investing. WW Norton & Company.

Marewski, J. N., & Gigerenzer, G. (2012). Heuristic decision making in medicine. Dialogues Clin Neurosci, 14(1), 77–89.

Martignon, L., Katsikopoulos, K. V., & Woike, J. K. (2008). Categorization with limited resources: A family of simple heuristics. Journal of Mathematical Psychology, 52(6), 352-361.

Luan, S., Schooler, L. J., & Gigerenzer, G. (2011). A signal-detection analysis of fast-and-frugal trees. Psychological Review, 118(2), 316.

The new and improved pirateplot()! Now with themes!

Hello fellow R pirates! I recently wrote a post demonstrating the pirateplot() function in the yarrr package. The pirateplot() function replace deceptive (and boring) bar plots with a full RDI plot, capable of plotting Raw data (and smoothed densities), Descriptive statistics (like means and medians), and Inferential statistics (a 95% Bayesian Highest Density Interval “HDI”).

In the past 48 hours I’ve had an overwhelmingly positive response to the pirateplot() function. At the same time, I received several great recommendations for tweaks and improvements to the function to make it even better. After consulting the original pirate text, I decided that the changes stay true to the original piratery intent of the pirate plot. I’m sure the founding members of the pirate plot would be proud.

Let’s go over the changes! To use the latest version of the pirateplot() function, be sure to install the latest version of the yarrr package:

# install.packages("devtools") # Only if you don't have the devtools library already installed
library("devtools")
install_github("ndphillips/yarrr")

Once you’ve installed the latest version, load the yarrr package with library()

library("yarrr")

Now you’re ready to make some pirate plots! Here are the major updates to the function:

Opacity Themes

The five critical aspects of a pirate plot are the bars, beans, points, (average) lines, and hdis. You can adjust the opacity of each of these elements with opacity arguments — such as bars.o, beans.o (etc.).

The biggest update to pirateplot() is the addition of opacity themes which are designated by a new argument called theme.o. The input to this argument is an integer (currently 1, 2 or 3) that defines an opacity theme across all five elements. Themes 1, 2, and 3 create specific opacity values for each of the elements, while Theme 0 sets all opacities to 0. Thankfully, the themes just set default values for the individual element opacities — you can still override the opacities of any specific object within a theme by including an object specific opacity value.

Here are examples of the three different themes applied to the ChickWeight dataset:

Theme 1

Theme 1 emphasises the bar with light points and beans (I’ll use the appletv palette for this one)


pirateplot(formula = weight ~ Diet,
           data = ChickWeight,
           main = "Theme 1\nappletv palette",
           theme.o = 1,
           pal = "appletv")


theme1

Theme 2

Theme 2 emphasises the points and beans (using the southpark palette)


pirateplot(formula = weight ~ Diet,
           data = ChickWeight,
           main = "Theme 2\nsouthpark palette",
           theme.o = 2,
           pal = "southpark")

theme2

Theme 3

Theme 3 Emphases the 95% Highest Density Intervals (HDIs). Keep in mind that calculating HDIs can take a few seconds for each bean… Here I’ll use the Basel palette.


pirateplot(formula = weight ~ Diet,
           data = ChickWeight,
           main = "Theme 3\nbasel palette",
           theme.o = 3,
           pal = "basel")


theme3

Theme 0

In Theme 0, all opacities are set to 0 by default, so you can just individually specify the opacity of each element. In this example, I’ll turn the lines on full-blast, and turn the points on slighly. I’ll also increase the amount of jittering and size of the points. I’ll also use the google palette.


pirateplot(formula = weight ~ Diet,
           data = ChickWeight,
           main = "Theme 0\ngoogle palette",
           pal = "google",
           point.o = .2,
           line.o = 1,
           theme.o = 0,
           line.lwd = 10,
           point.pch = 16,
           point.cex = 1.5,
           jitter.val = .1)

theme0

Of course, you can still change the colors of the plotting elements with the par argument, and the background using the back.col argument. Here’s an x-ray version of Theme 3

pirateplot(formula = weight ~ Diet,
           data = ChickWeight,
           main = "Theme 3\nlight color with black background",
           pal = "white",
           theme.o = 3,
           point.pch = 16,
           back.col = gray(.2))

xray

Gridlines

You can now include gridlines in your plot with the gl.col argument. Just specify the color of the lines and the function will put them in reasonable places. The following plot also shows how pirateplot() handles two independent variables:

pirateplot(formula = weight ~ Diet + Time,
           data = subset(ChickWeight, Time < 10),
           theme.o = 2,
           pal = "basel",
           point.pch = 16,
           gl.col = gray(.8),
           main = "Two IVs\nWith gridlines")

gridlines

Other minor changes

I’ve also made the following smaller changes

  • The function no longer automatically sorts the levels of the IV. It will plot levels of the IV(s) in the order the are found in the original dataframe.
  • You can now manually change the color of the bar borders with the bar.border.col argument.

Happy pirate plotting!

As always, I am very happy to receive more comments and feedback on the function. You can write me at yarrr.book@gmail.com. Happy pirate plotting!

www.thepiratesguidetor.com

I missed out on www.yarrr.com (some domain squatter is holding it for a pirate’s ransom), but I was able to snag www.thepiratesguidetor.com! This is now the official home of YaRrr! The Pirate’s Guide to R

For those of you who don’t know, this whole story started last summer. I was taking a late night swim on the Bodensee in Konstanz and saw a rusty object sticking out of the water. Upon digging it out, I realised it was an ancient usb-stick with the word YaRrr inscribed on the side. Intrigued, I brought it home and plugged it into my laptop. Inside the stick, I found a single pdf file written entirely in pirate-speak. After watching several pirate movies, I learned enough pirate-speak to begin translating the text to English. Sure enough, the book turned out to be an introduction to R called The Pirate’s Guide to R.

This book clearly has both massive historical and pedagogical significance. Most importantly, it turns out that pirates were programming in R well before the earliest known advent of computers. Of slightly less significance is that the book has turned out to be a surprisingly up-to-date and approachable introductory text to R. For both of these reasons, I felt it was my duty to share the book with the world. You can download the latest version of the pdf (again, I am continually updating the translations) by clicking on the pirate’s beard below:

YaRrr Cover 2.001

Coin flips and the Hot Hand

The Hot Hand is the belief that an actor, such as a basketball player, is more likely to be successful if he has just experienced success than if he has just experienced a failure. For example, a basketball player is believed to be ‘hot’ when he has just made one (or more) shots in a row – and as a result, is more likely to make his next shot than if he was not in this ‘hot’ state. In statistical language, the hot hand is the claim that the conditional probability of a success given a prior success (p(Success | Success)) is higher than the conditional probability of a success given a previous failure (p(Success | Failure)).

While belief in the hot hand is pervasive, several decades of research, notably one by Gilovich, Vallone and Tversky has suggested that it in fact does not exist. Rather, it is the result of a cognitive fallacy wherein people expect random sequences to have fewer strings of successes than are actually expected by chance.

However, in the past few weeks, an unpublished manuscript written by Joshua Miller and Adam Sanjurjo has apparently put an atom bomb on this debate. It was even picked up by the Wall Street Journal! The fact that the WSJ is writing about the results from an unpublished manuscript shows how big of a deal this is. So what are Miller and Sanjujo claiming that set off this massive reaction? They claim that in fact scientists, rather than basketball fans, have been making a statistical mistake. Specifically, Miller and Sanjurjo claim that, for a fair coin, the probability of a head given a previous head may not be 0.50!

Miller and Sanjurjo take the following example:

Take a coin and flip it 4 times, calculate the percentage of times where a head was followed by a head. Call this the “Head Repetition Probability.” Repeat this procedure 1 million times. What will the average Head Repetition Probability be?

If your answer is 50%, then you’re similar to most people. And like most people, you’re wrong. In fact, the probability is 41.7% – substantially lower than the intuitive 50% number that most people, and prior researchers, had believed. Why is this important? Well, if researchers find that, in fact, real basketball players have a Basket Repetition Probability of 50% (they have), then they have a higher RP than is expected by chance (i.e.; 41.7%)! Thus, the often detected 50% repetition probability in real people may in fact be evidence for the Hot Hand!

But were does this 41.7% figure come from? Does it really reflect the expected probability of getting a Head given that the previous outcome is a head? If so, then coins should be inherently anti-streaky, which would pretty much contradict everything we know from Probability 101. Thankfully, that’s not what the 41.7% number reflects. Rather, the number comes from an improper statistic which will equate sequences that are actually quite different.

Where the 41.7% figure comes from (and how to fix it)

To see where the 41.7% figure comes from, I will follow Andrew Gelman’s example and simulate the result in R using flips of a virtual fair coin. Using the coin analogy, we want to know “What is the probability of flipping a H given that a H occurred on the previous flip?”

However, in contrast to Gelman, I will not randomly simulate coin flips. Instead, I’ll use R to calculate all solutions analytically.

First, we’ll create a dataframe called possible showing of all possible sequences of 3 flips of a fair coin (though you can easily change the number of flips in the code by changing the n.flips value). We’ll use 1 to denote a Head and 0 to denote a Tail:

n.flips <- 3
possible <- expand.grid(lapply(1:n.flips, function(x) {c(0, 1)}))
names(possible) <- paste("f", 1:n.flips, sep = "")

possible


##   f1 f2 f3
## 1  0  0  0
## 2  1  0  0
## 3  0  1  0
## 4  1  1  0
## 5  0  0  1
## 6  1  0  1
## 7  0  1  1
## 8  1  1  1

As you can see, there are 8 possible sequences (2 to the power of 3). However, we need to remove the two cases where there are no heads in the first 2 flips. Why? Because if there are no heads in the first two flips, then there is no possibility for a sequence of two heads anywhere in the sequence.

possible <- possible[rowMeans(possible[,1:(n.flips - 1)]) != 0,]
possible

##   f1 f2 f3
## 2  1  0  0
## 3  0  1  0
## 4  1  1  0
## 6  1  0  1
## 7  0  1  1
## 8  1  1  1

Now we have 6 possible sequences that are equally likely to occur. For each of these sequences, how often is there an HH (a Head-Head repetition?). To do this, let’s calculate three statistics:

streak.opportunities: How many opportunities are there for an HH streak? In other words, how many H values are there in the first 2 flips?

actual.streaks: How many actual HH streaks are there?

p.HH: Of those possible streak opportunities, what percentage of the time was there an actual streak?

for(i in 1:nrow(possible)) {

  seq <- possible$flips[i]
  seq.values <- as.numeric(possible[i,1:n.flips])

  heads1 <- seq.values[1:(n.flips -1)] == 1
  heads2 <- seq.values[2:(n.flips)] == 1

  streak.opportunities.i <- sum(heads1)
  actual.streaks.i <- sum(heads1 & heads2)

  p.hh <- actual.streaks.i / streak.opportunities.i

  possible$hh.streak.opportunities[i] <- streak.opportunities.i
  possible$hh.streaks[i] <- actual.streaks.i
  possible$p.hh[i] <- p.hh

}

possible

##   f1 f2 f3 hh.streak.opportunities hh.streaks p.hh
## 2  1  0  0                       1          0  0.0
## 3  0  1  0                       1          0  0.0
## 4  1  1  0                       2          1  0.5
## 6  1  0  1                       1          0  0.0
## 7  0  1  1                       1          1  1.0
## 8  1  1  1                       2          2  1.0

We’re almost to the 41.7% number! So, to calculate the expected probability of getting HH, we can calculate the average p.HH value across all 6 sequences:

mean(possible$p.hh)
## [1] 0.4166667

Wow! The value is indeed 41.67%! But does this number really represent the conditional probability of getting a H given a H? No, it doesn’t

The Problem

Why is the 41.7% number misleading? Well, in our previous calculation, we treated all flip sequences as equally informative. That is, we said each sequence is equally likely, therefore, we can just take the average proportion of hits given hits (p.HH) across all 6 sequences. However, this is a false assumption! Indeed, some of these sequences are more informative than others.

To see this, we’ll take an extreme example. Consider flipping a coin 10 times. Here are two possible sequences. Assuming the coin is fair, both of these sequences are equally likely. For each sequence, I’ve calculated the number of HH streak opportunities, the number of actual HH streaks, and the percentage of HH streaks.

Sequence Streak opportunities Number of Streaks Streak Percentage
1, 1, 1, 1, 1, 1, 1, 1, 1, 1 9 9 1.0
0, 0, 0, 0, 0, 0, 0, 0, 1, 1 1 1 1.0

In the first sequence, there are 9 opportunities for a streak (in the 2nd through the 10th flip). Additionally, there were 9 actual streaks, so the proportion of streaks was 1.0.

Now, consider the second sequence. In this sequence, there is only one opportunity for an HH streak (on the final 10th flip). And indeed, there was indeed an H on the 10th and final flip. Therefore, the proportion of HH streaks out of all possible HH streaks was 100% – just like in the first sequence! But are these sequences really equally informative? I think not. In the first sequence, we considered all 10 data points while in the second sequence, we ignored the first 8 flips and only considered the last two.

Think about a real world example: if you saw a player hit 10 shots in a row, would’t you think she provided stronger evidence for a hot-hand than a player who missed 8 shots and then made only 2 in a row?

Thankfully, there is an easy fix to the problem

Fixing the Problem (and finally getting p(H|H) = 50%)

To fix the problem, we need to make sure we weight sequences according to how many opportunities there were for an HH sequence. To do this, we can simply calculate the total number of HH streaks across all sequences, and divide it by the total number of HH streak opportunities across all sequences. Here, we’re ignoring differences between people.

Let’s go back to our coin flip dataframe and make this calculation. If I’m right, we should get a p(H|H) probability of 50%:

f1 f2 f3 hh.streak.opportunities hh.streaks p.hh
1 0 0 1 0 0.0
0 1 0 1 0 0.0
1 1 0 2 1 0.5
1 0 1 1 0 0.0
0 1 1 1 1 1.0
1 1 1 2 2 1.0

To calculate this, we’ll calculate the total possible number of streaks (sum of the 5th column) and divide it by the total possible number of streaks (sum of the 4th column). We should get 4 / 8 = 50%:

sum(possible$hh.streaks) / sum(possible$hh.streak.opportunities)
## [1] 0.5

Phew, we got it! Again, the reason why our 50% number is larger than the previous 41.7% number is because we’ve given more weight to sequences with more opportunities for HH events than those with fewer opportunities.

Conclusion

While the 41.7% number reported by Joshua Miller and Adam Sanjurjo (and subsequently confirmed by Andrew Gelman sounds like a scary violation of the laws of statistics, it is in fact an artefact of an improper statistical procedure. As I’ve shown, taking the simple average proportion of HH events across many sequences is improper because it gives equal weight to sequences with many opportunities for HH events and to those with only a few opportunities. By analogy, consider two basketball players who shoot 100 shots. Player A misses his first 97 shots, and makes his last 3 shots. In contrast, player B makes all 100 shots. Which player was more hot? According to the improper statistic – both are equally hot because both have a p(H|H) value of 100%! Following this logic, both provide equivalent (positive) evidence for the hot-hand. Of course, this is poor reasoning, we should give player B’s performance more weight. Departing from this example and moving back to the more general example of coin flips, we can avoid this bias by simply calculating the total number of HH events divided by all possible HH events, you will reach the correct (and intuitive) 50% figure.

Addendum: Using Bayes to help correct the error

After finishing the previous analysis, I wondered if we could use Bayesian statistics to correct the bias inherent in the improper statistic. Specifically, if we allow Bayes theorem to weight evidence according to the number of opportunities for success (which as I’ve shown to be critical), can we return to using the improper statistic?

To answer this question, I will repeat the improper statistic procedure, but I won’t calculate the mean of the sample pHH proportions. Instead, I’ll first calculate the Bayesian posterior highest density of the pHH value for each player (e.g.; coin), and then calculate the average Bayesian posterior highest density across all players.

I’m make this example a bit more interesting by simulating the results of baskets in a shooting tournament. I’ll generate outcomes of 10 sequential shots from 10,000 basketball players. Importantly, each player will have a true 50% chance of success on each shot.


n.players <- 1000
n.shots <- 10
shots <- as.data.frame(matrix(sample(c(0, 1), size = n.players * n.shots, replace = T), 
nrow = n.players, ncol = n.shots))

Again, let’s delete cases where the first 9 shots are all misses

shots <- shots[rowMeans(shots[,1:9]) != 0,]

Now, for each player, let’s calculate 4 statistics

  • possible.HH: The number of hits in shots 1 – 9; this equates to the number of possible streaks in the sequence
  • actual.HH: The number of actual HH streaks in the sequence
  • actual.HH.p: The actual proportion of HH streaks in the sequence
  • posterior.HH.p: The Bayesian posterior estimate of the probability of an HH sequence given
for(i in 1:nrow(shots)) {

  player.seq <- shots[i,]
  
  possible.HH.sequences <- sum(player.seq[1:(n.shots - 1)] == 1)
  actual.HH.sequences <- sum(player.seq[1:(n.shots - 1)] == 1 & player.seq[2:(n.shots)] == 1)

  new.player.seq <- c(rep(1, actual.HH.sequences), rep(0, possible.HH.sequences - actual.HH.sequences))
  
    n <- possible.HH.sequences
    p <- actual.HH.sequences / possible.HH.sequences
    
    post.a <- 1 + p * n
    post.b <- 1 + (1 - p) * n
      
    post.mean <- qbeta(.5, post.a, post.b)
    lb <- qbeta(.025, post.a, post.b)
    ub <- qbeta(.975, post.a, post.b)
    
  shots$possible.HH[i] <- possible.HH.sequences
  shots$actual.HH[i] <- actual.HH.sequences
  shots$actual.HH.p[i] <- p
  shots$bayes.HH.p[i] <- post.mean
  shots$lb[i] <- lb
  shots$ub[i] <- ub
  }

Now, let’s calculate the average HH streaks from our measures.

First, let’s use the original biased measure (the same one that got us the ugly 41.7% value for 4 coin flips).

mean(shots$actual.HH.p)
## [1] 0.4431845

In my simulation, I got a value of 44.3%. Better than the 41.7% value, but still far away from the desired 50% value.

Now, let’s see how our average Bayesian solution did:

mean(shots$bayes.HH.p)
## [1] 0.4701428

The mean Bayesian posterior response is 47.0%, closer to 50% than before, but still not good enough. To be honest, I’m not totally sure why this value does not hit 50% (ignoring sampling error). I’ll still need to work on this.

Finally, let’s use our simple unbiased measure:

sum(shots$actual.HH) / sum(shots$possible.HH)
## [1] 0.4975391

Here, I got a value of 49.8% – close enough to be within sampling error of 50%.

The yarrr R package!

After reading Hilary Parker’s and Hadley Wickham’s great tutorials on building R packages, I’ve decided to build the yarrr package. The package contains all of the datasets, functions, and the game-changing jokes contained in YaRrr! The Pirate’s Guide to R.

To install the package from github, run the following code:

install.packages("devtools")
library("devtools")
install_github("ndphillips/yarrr")
library("yarrr")

Once you do, you should be able to see the pirates dataset – a historical survey of hundreds of pirates now in R form for your data analysis pleasures:

View(pirates)

P.S. Thanks to Hilary Parker for showing me how to do proper code formatting in wordpress!

Press Release: The Janus face of Darwinian competition

Press Release: http://msutoday.msu.edu/news/2015/too-many-candidates-spoil-the-stew/

Paper: The Janus face of Darwinian competition

This week Michigan State University released the following press release on a recent publication in Scientific Reports that I was privileged to be a co-author on! The paper, titled “The Janus face of Darwinian competition” was first-authored by Dr. Arend Hintze at Michigan State University. The paper uses computational techniques to test how competition affects the evolution of decision making strategies. Here is the punchline of the paper: some competition is good, but too much can spur the evolution of poor decision making strategies. The paper builds on the competitive sampling paradigm I co-developed along with Dr. Yaakov Kareev, Dr. Judith Avrahami, and Dr. Ralph Hertwig in our paper “Rivals in the dark: How competition influences search in decisions under uncertainty” published last year in Cognition.

Maximum-Likelihood parameter estimation in R: A brief tutorial

A brief tutorial on ML estimation in R

Recently, a colleague asked me to demonstrate how one can calculate maximum-likelihood (ML) parameter estimates in R. As I had been meaning to do this for my R course anyway, I decided to write up the following brief tutorial. In the tutorial, I go through the three basic steps of ML estimation in R for both a statistical example (the simple linear model) and a psychological model of choice (the Proportional Difference model, Gonzalez-Vallejo, 2001).

You can view the tutorial on the preceding link. If you have any comments, criticisms or questions, please let me know and I will try to update the document!

Note: I am in debt to Dr. Stephen Lewandowsky and Dr. Simon Farrell’s summer school on computational cognitive modelling for teaching me these techniques. You can purchase their great book on the subject here Computational Modeling in Cognition: Principles and Practice