This page is now outdated! Please go to ndphillips.github.io

My new page is at http://ndphillips.github.io, or the easier-to-remember link http://www.nathanieldphillips.org! From now on, all my future posts will be there.

From WordPress to GitHub Pages

Over the past two years, I’ve really enjoyed creating and maintaining this website with WordPress to share my R-related musings — from my e-book YaRrr! The Pirate’s guide to R, to my R packages like yarrr and FFTrees.

However, over the past few months, I’ve realised that my content creation and sharing process has been flawed. Until now, I’ve been writing my posts in WordPress. To share a new post, I’ve had to write the post, including all code and output, in R Markdown, and then find a way to transfer the content to the WordPress editor. This often meant having to save and upload separate plotting images, hack HTML (which I don’t really know) to make sure the code is nicely formatted, all while trying to navigate the at times overwhelming assault of menus and customisations on the WordPress admin page. In short — it has worked, but it has been a bit of a pain.

Last week, my colleague Dirk Wulff introduced me to a much better solution to website management: R Markdown combined with GitHub pages. This solution allows you to do all webpage management within RStudio, while providing a much cleaner user experience. Thanks to the great tutorials provided by RStudio at http://rmarkdown.rstudio.com/rmarkdown_websites.html and Florian Prive at https://privefl.github.io/blog/A-website-and-blog-for-R-users/, I was able to create my elegant new webpage http://ndphillips.github.io in an afternoon.

Discovering this new process of website management with GitHub pages has been like getting my first iPhone, discovering R, and getting my first stick of deodorant: I didn’t know I needed it until I got it. And when I did, I couldn’t imagine how I lived without it.

I love the open-source community

I an only begin to fully express how much I appreciate the open-source programming community. With free online tutorials, free hosting provided by GitHub, the open-source programming language R, the simple Markdown language, and the programming environment RStudio, I was able to create something that would have taken me months, if not years, to do on my own from scratch.

I am really looking forward to using this new platform to share my thoughts and pass what I’ve learned on to someone else. It’s going to be fun 🙂

The yarrr package (0.0.8) is (finally!) on CRAN

Great news R pirates! The yarrr package, which contains the pirateplot, has now been updated to version 0.0.8 and is up on CRAN (after hiding in plain sight on GitHub). Let’s install the latest version and go over some of the updates:

[code language=”r”]
install.packages("yarrr") # Install package from CRAN
library("yarrr") # Load the package
yarrr.guide() # Open the package guide
[/code]

The most important function in the yarrr package is pirateplot(). What the heck is a pirateplot? A pirateplot is a modern way of visualising the relationship between a categorical independent variable, and a continuous dependent variable. Unlike traditional plotting methods, like barplots and boxplots, a pirateplot is an RDI plotting trifecta which presents Raw data (all data as points), Descriptive statistics (as a horizontal line at the mean — or any other function you wish), and Inferential statistics (95% Bayesian Highest Density Intervals, and smoothed densities).

pirateplot-elements

For a full guide to the package, check out the package guide at CRAN here. For now, here are some examples of pirateplots showing off some the package updates.

Up to 3 IVs

You can now include up to three independent variables in your pirateplot. The first IV is presented as adjacent beans, the second is presented in different groups of beans in the same plot, and the third IV is shown in separate plots.

Here is a pirateplot of the heights of pirates based on three separate IVs: headband (whether the pirate wears a headband or not), sex, and eyepatch (whether the pirate wears an eye patch or not):

[code language=”r”]
pirateplot(formula = height ~ sex + headband + eyepatch,
point.o = .1,
data = pirates)
[/code]

threeivpp

Here, we can see that male pirates tend to be the tallest, but there there doesn’t seem to be a difference between those who wear headbands or not, and those who have eye patches or not.

New color palettes

The updated package has a few fun new color palettes contained in the piratepal() function. The first, called ‘xmen’, is inspired by my 90s Saturday morning cartoon nostalgia.

[code language=”r”]
# Display the xmen palette
piratepal(palette = "xmen",
trans = .1, # Slightly transparent colors
plot.result = TRUE)
[/code]

xmen_display

Here, I’ll use the xmen palette to plot the distribution of the weights of chickens over time (if someone has a more suitable dataset for the xmen palette let me know!):

[code language=”r”]
pirateplot(formula = weight ~ Time,
data = ChickWeight,
main = "Weights of chickens by Time",
pal = "xmen",
gl.col = "gray")

mtext(text = "Using the xmen palette!",
side = 3,
font = 3)

mtext(text = "*The mean and variance of chicken\nweights tend to increase over time.",
side = 1,
adj = 1,
line = 3.5,
font = 3,
cex = .7)
[/code]

xmen_chikens

The second palette called “pony” is inspired by the Bronys in our IT department.

[code language=”r”]
# Display the pony palette
piratepal(palette = "pony",
trans = .1, # Slightly transparent colors
plot.result = TRUE)
[/code]

pony_image

Here, I’ll plot the distribution of the lengths of movies as a function of their MPAA ratings (where G is for suitable for children, and R is suitable for adults) using the pony palette:

[code language=”r”]
pirateplot(formula= time ~ rating,
data = subset(movies, time > 0 & rating %in% c("G", "PG", "PG-13", "R")),
pal = "pony",
point.o = .05,
bean.o = 1,
main = "Movie times by rating",
bean.lwd = 2,
gl.col = "gray")

mtext(text = "Using the pony palette!",
side = 3,
font = 3)

mtext(text = "*Movies rated for children\n(G and PG) tend to be longer \nthan those rated for adults",
side = 1,
adj = 1,
font = 3,
line = 3.5,
cex = .7)
[/code]

pony_times

I have to be honest, the pony palette colors are not terribly well suited for this pirateplot — but I think they look better in a basic scatterplot. Because the piratepal function returns a vector of colors (when plot.result = F), you can also use it in other plots. Here, I’ll use the pony palette in a scatterplot:

[code language=”r”]
set.seed(100) # for replicability
x <- rnorm(100, mean = 10, sd = 1)
y <- x + rnorm(100, mean = 0, sd = 1)
point.sizes <- runif(100, min = .2, max = 2) # Just for fun

plot(x, y,
main = "Scatterplot with the pony palette",
pch = 21,
bg = piratepal("pony", trans = .1),
col = "white",
bty = "n",
cex = point.sizes)

grid() # Add gridlines
[/code]

ponyscatter

To see all of the palettes (including those inspired by movies and a transit map of Basel), just run the function with “all” as the main argument

[code language=”r”]
piratepal(palette = "all")
[/code]

Of course, if you find that these color palettes give you a headache, you can always set a pirateplot to grayscale (or any other color), by specifying a single color in the palette argument. Here, I’ll create a grayscale pirateplot showing the distribution of movie budgets by their creative type:

[code language=”r”]
pirateplot(formula = budget ~ creative.type,
data = subset(movies, budget > 0 &
creative.type %in% c("Multiple Creative Types", "Factual") == FALSE),
point.o = .02,
xlab = "Movie Creative Type",
main = "Movie budgets (in millions) by rating",
gl.col = "gray",
pal = "black")

mtext("Using a grayscale pirateplot",
side = 3,
font = 3)

mtext("*Superhero movies tend to have the highest budgets\n…by far!",
side = 1, adj = 1, line = 3,
cex = .8, font = 3)
[/code]

moviebudgetpp

Looks like super hero movies have the highest budgets…by far!

And again, to get more tips on how to customise your palettes and pirateplots, check out the main package guide at https://cran.r-project.org/web/packages/yarrr/vignettes/guide.html, or by running the following code:

[code language=”r”]
yarrr.guide() # Open the yarrr package guide
[/code]

Acknowledgements and Comments

– The pirateplot is largely inspired by the great beanplot package by Peter Kampstra.
– Bayesian 95% HDIs are calculated using the truly amazing BayesFactor package by Richard Morey [Note: a previous version of this post incorrectly called Richard “Brian” — I blame lack of caffeine].
– The latest developer version of yarrr is always available at https://github.com/ndphillips/yarrr. Please post any bugs, issues, or feature requests at https://github.com/ndphillips/yarrr/issues

Making fast, good decisions with the FFTrees R package

fftreesblog

“…[W]e are suspicious of rapid cognition. We live in a world that assumes that the quality of a decision is directly related to the time and effort that went into making it.” ~ Malcolm Gladwell

In his book Blink, Malcolm Gladwell summarises a common misconception about good decision making. According to folk wisdom, the more time, information, and effort you put into a decision, the better it gets. In other words, “More is better.” If you are a doctor making a diagnosis, more medical tests are always better. If you are trying to decide if your new Tinder match is worthy of a date, try to find them on Facebook, Instagram and Snapchat first. If you deciding how to invest in the stock market, get as much data as you can and build a statistical model so complex that it describes the past perfectly.

However, decades of research in cognitive science and machine learning have shown that the “More is better” theory is, in many real-world decisions, flat wrong. In contrast, there are many cases where, as Dr. Gerd Gigerenzer has put it, “Less is more.” Why? For two key reasons. The first reason is statistical: Complex decision models with many parameters can lead to overfitting. In essence, overfitting occurs when a model is very good at describing one specific past dataset, but fails in predicting new, unseen data (Gigerenzer & Brighton, 2009). For example, a complex economic model might describe changes in past stock prices very well, but is largely unable to predict future changes. This is why, as Burton Malkiel shows, it is so hard for complex trading strategies to outperform simple index funds in the stock market (Malkiel, 1999). The second reason is psychological: Even if a complex decision model is good at predicting new data, if a person can’t understand it, or easily implement it in a real-world decision, like a doctor trying to use logistic regression in an emergency room, they won’t use it.

What simple decision rules can people use to make good decisions? One popular class of simple decision rules are Fast and Frugal Trees (FFTs, Gigerenzer & Todd, 1999; Martignon, Katsikopoulos & Woike, 2008). Fast and frugal trees make very fast decisions based on a few (usually 1 to 5) pieces of information and ignore all other information. In other words, Fast and frugal trees are noncompensatory, meaning that once they make a decision based on a few pieces of information, no additional information can ever change the decision. Because they are so simple to use, they have been used in many real-world decision tasks from making coronary artery disease diagnoses (Green & Mehr, 1997), to diagnosing depression (Jenny, Pachur, Williams, Becker & Margraf, 2013). However, lest you think that fast and frugal trees are only useful when time is limited, research has shown that fast and frugal trees can out-predict more complex models in decidedly non-human simulations (Gigerenzer, Czerlinski & Martignon, 1999).

While fast and frugal trees have shown promise, there are currently no off-the-shelf methods to create them. How can you create your own fast and frugal decision trees for your own dataset? Starting today, you can use the FFTrees R package available on CRAN. The main function in the package is FFTrees(), which takes a standard formula and data argument, and returns a fast and frugal tree (FFTrees) object. From this object, you can view its underlying trees, along with many standard classification statistics (e.g.; hit-rate, false alarm rate, AUC) applied to both training and test (i.e.; prediction) datasets. Finally, the function has two alternative classification algorithms, logistic regression and CART, built in, so you can always compare the accuracy of your fast and frugal trees to two gold-standards in the classification literature. If you’re like me, you’ll be amazed at how well simple, transparent fast and frugal trees perform relative to these gold-standards, especially in predicting new data!

The FFTrees package in action

You can install and load the FFTrees package from CRAN:

[code language=”r”]
install.packages("FFTrees")
library("FFTrees")
[/code]

Once you’ve installed the package, you can view the overview vignette by running the code FFTrees.guide(). However, for this blog post I’ll show you how to create fast and frugal trees for predicting breast cancer. The data we’ll use comes from the Wisconsin Breast Cancer Database (data source). The data is stored as a dataframe with 699 rows, representing 699 patients, and 10 columns. The 10 columns represent 9 physiological measurements, from cell sizes to cell shapes, and 1 binary variable (diagnosis) indicating whether the woman truly does, or does not have breast cancer. Here is how the first few rows of the dataframe look:

thickness cellsize.unif cellshape.unif adhesion epithelial nuclei.bare chromatin nucleoli mitoses diagnosis
5 1 1 1 2 1 3 1 1 FALSE
5 4 4 5 7 10 3 2 1 FALSE
3 1 1 1 2 2 3 1 1 FALSE
6 8 8 1 3 4 3 7 1 FALSE
4 1 1 3 2 1 3 1 1 FALSE
8 10 10 8 7 10 9 7 1 TRUE

To create a fast and frugal tree from the dataset, we’ll use the FFTrees() function, entering formula = diagnosis ~., meaning that we want to predict diagnosis as a function of (potentially), all other variables, and data = breastcancer. We’ll assign the result to a new object of class FFTrees called breastcancer.fft

[code language=”r”]
breastcancer.fft <- FFTrees(formula = diagnosis ~.,
data = breastcancer)
[/code]

Now that we’ve created the object, we can print it to the console to get basic information

[code language=”r”]
breastcancer.fft
# "An FFTrees object containing 6 trees using 4 cues {cellsize.unif,cellshape.unif,nuclei.bare,epithelial} out of an original 9"
# "Data were trained on 683 exemplars. There were no test data"
# "FFT AUC: (Train = 0.98, Test = NA)"
# "My favorite tree is #3 [Training: HR = 0.93, FAR = 0.05], [Testing: HR = NA, FAR = NA]"
[/code]

The printout tells us that the final FFTrees object contains 6 different trees, and the largest tree only uses 4 of the original 9 cues. To see the best tree, we can simply plot the FFTrees object:

[code language=”r”]
plot(breastcancer.fft,
main = "Breastcancer FFT",
decision.names = c("Healthy", "Cancer"))
[/code]

 

bcfft

 

There’s one of our fast and frugal trees! In the top section of the plot, we see that the data had 444 true healthy cases, and 239 true cancer cases. In the middle section, we see the actual tree. The tree then starts by looking at the cue cellsize.u. If the value is less than 3, the tree decides that the person is healthy. If the value is not less than 3, then the tree looks at cellshape. If the cellshape. <= 2, the tree decides the patient is healthy. If cellshape. > 2, the tree decides that the person does have cancer. That’s the whole decision algorithm! Now isn’t that a lot easier to interpret than something like logistic regression? Again, imagine giving this logistic regression equation to anyone without a degree in statistics.

Performance

Ok, so our fast and frugal tree easy to understand and use, but how well does it perform? The bottom section of the above plot shows a series of performance statistics. On the bottom left hand corner, we can see a classification table, which shows how the tree’s decisions compare to the truth. Entries on the main diagonal (Cor Rej and Hit) correspond to correct decisions, while the other entries correspond to incorrect decisions. As you can see, the tree performed exceptionally well:  it made correct diagnoses in 646 (424 + 222) out of all 683 cases (95% correct). Additional performance statistics, including specificity (1 – false alarm rate), hit rate, d-prime, AUC (area under the curve) are also displayed. Finally, in the bottom right plot, you can see an ROC curve which compares the performance of the trees to CART (in red) and logistic regression (in blue).

Viewing other trees

While this tree did well, it still made some errors in both detecting true cancer cases (i.e.; hit rate) and in correctly rejecting true healthy cases (i.e.; specificity). Now, what if you want a tree that rarely misses true cases, at the cost of additional false alarms? As  Luan, Schooler, & Gigerenzer (2011) have shown, you can easily shift the balance of errors in a fast and frugal tree by adjusting the decisions it makes at each level of a tree. The FFTrees function automatically builds several versions of the same general tree that make different error trade-offs. We can see the performance of each of these trees in the bottom-right ROC curve. Looking at the ROC curve, we can see that tree number 5 has a very high hit-rate, but a smaller specificity. We can look at this tree by adding the tree = 5 argument to plot():

[code language=”r”]
plot(breastcancer.fft,
main = "Breastcancer FFT",
decision.names = c("Healthy", "Cancer"),
tree = 5)
[/code]

 

Screen Shot 2016-08-16 at 15.21.31

 

Here is the resulting tree. As you can see, this tree uses an extra cue called nuclei.bar. This tree has a perfect hit rate of 100% (just as we wanted), but at a cost of a lower specificity of 80%.

Additional arguments

Cross-validation: The FFTrees() function allows you to easily create trees from a training dataset, and test the performance of the trees with a test dataset (aka, cross-validation). You can do this by either entering an explicit test dataset in the data.test argument, or by randomly splitting the main dataset into a separate training and testing sample with train.p. For example, train.p = .5 will randomly split the data into a 50% training set, which will be used to build the trees, and a 50% test set, which will be used to evaluate their prediction performance.

[code language=”r”]
# Create a 50% training and 50% testing dataset with train.p = .5
breastcancer.test.fft <- FFTrees(formula = diagnosis ~ .,
data = breastcancer,
train.p = .5)
[/code]

Restricting trees: If you want to explicitly decide which cues you want in the tree, you can specify this in the formula argument. For example, the following code will generate a tree from the breast cancer data, but only using cues thickness, mitosis, and adhesion.

[code language=”r”]
# Only use 3 cues in the trees
breastcancer.r.fft <- FFTrees(formula = diagnosis ~ thickness + mitoses + adhesion,
data = breastcancer)
[/code]

Summary

The FFTrees package contains lots of other functions for visualising and comparing trees. In addition, it contains several real world datasets you can use to predict who survived the Titanic (titanic), whether a mushroom is poisonous or not (mushrooms), or how a politician will vote on a new bill (voting). To see all the details, check out the package vignettes either in the package or on CRAN (here). For all you judgment and decision making researchers out there, I will also be presenting the package at the annual meeting of the Society for Judgment and Decision Making (SJDM) in Boston in November 2016.

The package is also very much in development, so I am grateful for any recommendations, inevitable bug-reports, or criticisms. You can post bug-reports at www.github.com/ndphillips/FFTrees/issues, or email me directly at Nathaniel.D.Phillips.is@gmail.com

Alternative decision tree R packages

I hope you have as much fun with the package as I do, but I must warn you: this package was written by a psychologist who just happens to love programming in R for fun, rather than a statistician who really knows their shit about how best to build trees. For this reason, the code is messy, dreadfully inefficient, and without proper reference to established tree building algorithms. While I hope to collaborate with statisticians in the future to improve the tree building algorithm (likely by implementing one that already exists), it’s not quite there yet. For more established methods, I strongly recommend that you check out the rpart, and party packages.

Coauthors

This package was developed in collaboration with Dr. Hansjoerg Neth and Dr. Wolfgang Gaissmaier at the University of Konstanz, and Dr. Jan Woike at the Max Planck Institute for Human Development in Berlin.

Code / Package Updates

This blog post has been updated to incorporate updates made to version 1.1.4 of the FFTrees package. Specifically, the original fft() function was changed to FFTrees() to avoid confusion with fast fourier transform. All the original code (should) still work, but with some warnings.

 References

Gigerenzer, G., & Brighton, H. (2009). Homo heuristicus: Why biased minds make better inferences. Topics in Cognitive Science, 1(1), 107–143.

Gigerenzer, G., & Todd, P. M. (1999). Fast and frugal heuristics: The adaptive toolbox. In Simple heuristics that make us smart (pp. 3–34). Oxford University Press.

Gigerenzer, G., Czerlinski, J., & Martignon, L. (1999). How good are fast and frugal heuristics? In Decision science and technology (pp. 81–103). Springer.

Gladwell, M. (2007). Blink: The power of thinking without thinking. Back Bay Books.

Green, L., & Mehr, D. R. (1997). What alters physicians’ decisions to admit to the coronary care unit? Journal of Family Practice, 45(3), 219–226.

Jenny, M. A., Pachur, T., Williams, S. L., Becker, E., & Margraf, J. (2013). Simple rules for detecting depression. Journal of Applied Research in Memory and Cognition, 2(3), 149–157.

Malkiel, B. G. (1999). A random walk down Wall Street: including a life-cycle guide to personal investing. WW Norton & Company.

Marewski, J. N., & Gigerenzer, G. (2012). Heuristic decision making in medicine. Dialogues Clin Neurosci, 14(1), 77–89.

Martignon, L., Katsikopoulos, K. V., & Woike, J. K. (2008). Categorization with limited resources: A family of simple heuristics. Journal of Mathematical Psychology, 52(6), 352-361.

Luan, S., Schooler, L. J., & Gigerenzer, G. (2011). A signal-detection analysis of fast-and-frugal trees. Psychological Review, 118(2), 316.

The new and improved pirateplot()! Now with themes!

Note: this post refers to an outdated version of the pirateplot() function. For the latest description of the function, check out http://rpubs.com/yarrr/pirateplot

Hello fellow R pirates! I recently wrote a post demonstrating the pirateplot() function in the yarrr package. The pirateplot() function replace deceptive (and boring) bar plots with a full RDI plot, capable of plotting Raw data (and smoothed densities), Descriptive statistics (like means and medians), and Inferential statistics (a 95% Bayesian Highest Density Interval “HDI”).

In the past 48 hours I’ve had an overwhelmingly positive response to the pirateplot() function. At the same time, I received several great recommendations for tweaks and improvements to the function to make it even better. After consulting the original pirate text, I decided that the changes stay true to the original piratery intent of the pirate plot. I’m sure the founding members of the pirate plot would be proud.

Let’s go over the changes! First let’s install and loadthe yarrr package which contains the pirateplot() function:

[code language=”r”]
install.packages("yarrr")
library("yarrr")
[/code]

 

Now you’re ready to make some pirate plots! Here are the major updates to the function:

Opacity Themes

The five critical aspects of a pirate plot are the bars, beans, points, (average) lines, and hdis. You can adjust the opacity of each of these elements with opacity arguments — such as bars.o, beans.o (etc.).

The biggest update to pirateplot() is the addition of opacity themes which are designated by a new argument called theme.o. The input to this argument is an integer (currently 1, 2 or 3) that defines an opacity theme across all five elements. Themes 1, 2, and 3 create specific opacity values for each of the elements, while Theme 0 sets all opacities to 0. Thankfully, the themes just set default values for the individual element opacities — you can still override the opacities of any specific object within a theme by including an object specific opacity value.

Here are examples of the three different themes applied to the ChickWeight dataset:

Theme 1

Theme 1 emphasises the bar with light points and beans (I’ll use the appletv palette for this one)

[code language=”r”]

pirateplot(formula = weight ~ Diet,
data = ChickWeight,
main = "Theme 1\nappletv palette",
theme.o = 1,
pal = "appletv")

[/code]

theme1

Theme 2

Theme 2 emphasises the points and beans (using the southpark palette)

[code language=”r”]

pirateplot(formula = weight ~ Diet,
data = ChickWeight,
main = "Theme 2\nsouthpark palette",
theme.o = 2,
pal = "southpark")

[/code]

theme2

Theme 3

Theme 3 Emphases the 95% Highest Density Intervals (HDIs). Keep in mind that calculating HDIs can take a few seconds for each bean… Here I’ll use the Basel palette.

[code language=”r”]

pirateplot(formula = weight ~ Diet,
data = ChickWeight,
main = "Theme 3\nbasel palette",
theme.o = 3,
pal = "basel")

[/code]

theme3

Theme 0

In Theme 0, all opacities are set to 0 by default, so you can just individually specify the opacity of each element. In this example, I’ll turn the lines on full-blast, and turn the points on slighly. I’ll also increase the amount of jittering and size of the points. I’ll also use the google palette.

[code language=”r”]

pirateplot(formula = weight ~ Diet,
data = ChickWeight,
main = "Theme 0\ngoogle palette",
pal = "google",
point.o = .2,
line.o = 1,
theme.o = 0,
line.lwd = 10,
point.pch = 16,
point.cex = 1.5,
jitter.val = .1)

[/code]

theme0

Of course, you can still change the colors of the plotting elements with the par argument, and the background using the back.col argument. Here’s an x-ray version of Theme 3

[code language=”r”]
pirateplot(formula = weight ~ Diet,
data = ChickWeight,
main = "Theme 3\nlight color with black background",
pal = "white",
theme.o = 3,
point.pch = 16,
back.col = gray(.2))

[/code]

xray

Gridlines

You can now include gridlines in your plot with the gl.col argument. Just specify the color of the lines and the function will put them in reasonable places. The following plot also shows how pirateplot() handles two independent variables:

[code language=”r”]
pirateplot(formula = weight ~ Diet + Time,
data = subset(ChickWeight, Time < 10),
theme.o = 2,
pal = "basel",
point.pch = 16,
gl.col = gray(.8),
main = "Two IVs\nWith gridlines")
[/code]

gridlines

Other minor changes

I’ve also made the following smaller changes

  • The function no longer automatically sorts the levels of the IV. It will plot levels of the IV(s) in the order the are found in the original dataframe.
  • You can now manually change the color of the bar borders with the bar.border.col argument.

Happy pirate plotting!

As always, I am very happy to receive more comments and feedback on the function. You can write me at yarrr.book@gmail.com. Happy pirate plotting!

The Pirate Plot (2.0) – The RDI plotting choice of R pirates


pirateplotex3
Note: this post refers to an outdated version of the pirateplot() function. For the latest description of the function, check out http://rpubs.com/yarrr/pirateplot

Barplots suck

Plain vanilla barplots are as uninformative (and ugly) as they are popular. And boy, are they popular. From the floors of congress, to our latest scientific articles, barplots surround us. The reason why barplots are so popular is because they are so simple and easy to understand. However, that simplicity also carries costs — namely, barplots can mask important patters in data like multiple modes and skewness.

Instead of barplots, we should be using RDI plots, where RDI stands for Raw (data), Description and Inference. Specifically, an RDI plot should present complete raw data — including smoothed densities, descriptive statistics — like means and medians, and Inferential statistics — like a Bayesian 95% Highest Density Interval (HDI). The R community already has access to many great examples of plots that come close to the RDI trifecta. For example, beanplots, created by the beanplot() function, show complete raw data and smoothed distributions (Kampstra, 2008).

Today, the R community has access to a new RDI plot — the pirate plot. I discovered the original code underlying the pirate plot during a late night swim on the Bodensee in Konstanz Germany. The pirate plot function was written in an archaic German pirate dialect on an old beer bottle and is unfortunately unusable. However, I have taken the time to painstakingly translate the original pirate code into a new R function called pirateplot(). The latest version (now 2.0) of the translations are stored in the yarrr package on Github at (www.github.com/ndphillips/yarrr). To install the package and access the piratepal() function within R, first install and load the yarrr package

[code language=”r”]

install.packages("yarrr")
library("yarrr")

[/code]

 

Now you’re ready to make some pirate plots! Let’s create a pirate plot from the pirates dataset in the yarrr package. This dataset contains results from a survey of several pirates at the Bodensee in Konstanz. We’ll create a pirateplot showing the distribution of ages of pirates based on their favorite pirate:

[code language=”r”]

pirateplot(formula = age ~ favorite.pirate,
data = pirates,
xlab = "Favorite Pirate",
ylab = "Age",
main = "My First Pirate Plot!")

[/code]

pp1

The arguments for the pirateplot are very similar to that of other plotting functions like barplot() and beanplot(). They key arguments are formula, where you specify one (or two) categorical variable(s) for the x-axis, and and numerical variable for the y-axis.

In addition to the data arguments, there are arguments that dictate the opacity of the 5 key elements of a pirate plot: bar.o, The opacity of the bars. bean.o, the opacity of the beans, point.o, the opacity of the points, and line.o, the opacity of the average lines at the top of the bars. Finally, hdi.o controls the opacity of the 95% Bayesian Highest Density Interval (HDI). The HDIs are calculated using the BEST package (Kruschke, 2013). Because calculating HDIs can be time-consuming, they are turned off by default (i.e.; hdi.o = 0). In the next plots, I’ll turn them on so you can see them.

The pirateplot() function has built-in color arguments. You can control the overall color palette of the plot with pal, and the color of the plot background with back.col. Let’s change a few of these arguments. I’ll also include the 95% Highest Density Intervals (HDIs) by setting hdi.o = .7.

[code language=”r”]

pirateplot(formula = age ~ favorite.pirate,
data = pirates,
xlab = "Favorite Pirate",
ylab = "Age",
main = "Black and White Pirate Plot",
pal = "black",
hdi.o = .7,
line.o = 1,
bar.o = .1,
bean.o = .1,
point.o = .1,
point.pch = 16,
back.col = gray(.97))
[/code]

pp2

As you can see, the entire plot is now grayscale, and different elements of the plot have been emphasised by changing the opacity arguments. For example, now that we’ve set the opacity of the HDI to .8 (the default is 0), we can see the Bayesian 95% Highest Density Interval for the mean of each group.

Hopefully it’s clear how much better RDI plots are than standard bar plots. Now, in addition to just seeing one piece of information (the mean) of each group, we can see all the raw data, a smoothed density curve of the data (helpful for detecting multiple modes and skewness), as well as Bayesian inference.

Oh, and just for comparison purposes, we can create a standard barplot within the pirateplot() function by adjusting the opacity arguments:

[code language=”r”]

pirateplot(formula = age ~ favorite.pirate,
data = pirates,
xlab = "Favorite Pirate",
ylab = "Age",
main = "Black and White Pirate Plot",
pal = "black",
hdi.o = 0,
line.o = 0,
bar.o = 1,
bean.o = 0,
point.o = 0)
[/code]

sad

 

Now how awful does that barplot look in comparison to the far superior pirate plot?!

You can also include multiple independent variables as arguments to the pirateplot() function. For example, I can plot the pirates’ beard lengths separated by sex and the college pirate went to. For this plot, I’ll use the southpark palette and emphasize the HDI by turning its opacity up to .6

[code language=”r”]

pirateplot(formula = beard.length ~ sex + college,
data = pirates,
main = "Beard lengths",
pal = "southpark",
xlab = "",
ylab = "Beard Length",
point.pch = 16,
point.o = .2,
hdi.o = .6,
bar.o = .1,
line.o = .5)
[/code]

pp4

As you can see, it’s very easy to customise the look and focus of your pirate plot. Here are 6 different plots of the weights of chickens given one of 4 diets (from the ChickWeight dataframe in R). You can see the code for each by accessing the help menu for the pirateplot() function within R.

ppmatrix

 

Have fun creating your own pirate plots! If you have suggestions for further improvements, don’t hesitate to write my squire at yarrr.book@gmail.com.

References

Kampstra, P. (2008) Beanplot: A Boxplot Alternative for Visual Comparison of Distributions. Journal of Statistical Software, Code Snippets, 28(1), 1-9. URL http://www.jstatsoft.org/v28/c01/

Kruschke, J. K. 2013. Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General 142(2):573-603. doi: 10.1037/a0029146

www.thepiratesguidetor.com

I missed out on www.yarrr.com (some domain squatter is holding it for a pirate’s ransom), but I was able to snag www.thepiratesguidetor.com! This is now the official home of YaRrr! The Pirate’s Guide to R

For those of you who don’t know, this whole story started last summer. I was taking a late night swim on the Bodensee in Konstanz and saw a rusty object sticking out of the water. Upon digging it out, I realised it was an ancient usb-stick with the word YaRrr inscribed on the side. Intrigued, I brought it home and plugged it into my laptop. Inside the stick, I found a single pdf file written entirely in pirate-speak. After watching several pirate movies, I learned enough pirate-speak to begin translating the text to English. Sure enough, the book turned out to be an introduction to R called The Pirate’s Guide to R.

This book clearly has both massive historical and pedagogical significance. Most importantly, it turns out that pirates were programming in R well before the earliest known advent of computers. Of slightly less significance is that the book has turned out to be a surprisingly up-to-date and approachable introductory text to R. For both of these reasons, I felt it was my duty to share the book with the world. You can download the latest version of the pdf (again, I am continually updating the translations) by clicking on the pirate’s beard below:

YaRrr Cover 2.001

piratepal – An R pirate’s source of palettes inspired by everything from the Mona Lisa to the Evil Dead

When most people think about R, they think of statistics. They don’t think of horror films. I’m here to change that.

Let’s back up. While R is certainly one of the top two (I’m looking over my back at you Python) languages for statistical analysis, it’s just as good for creating beautiful and informative plots – but only if you have the right colors. No matter what kind of plot you create, if you use colors that suck, your plot will look like shit. To pick colors that go well together, you need to use a color palette.

How can you select a color palette in R? Up until now I’ve been using the great RColorBrewer package. However, I’ve lost my enthusiasm for the palettes in the package – not because they’re bad, but because they didn’t mean anything to me. To fix this, I created several palettes either created by graphic artists, or inspired by movies, works of art, and my own aimless Googling. They are all stored in a function called piratepal contained in the yarrr package.

To use piratepal(), you first need to download and install the yarrr package (if you don’t have the devtools package installed, you’ll need to install that first):

[code language=”r”]
install.packages("devtools")
library("devtools")
install_github("ndphillips/yarrr")
library("yarrr")
[/code]

Once you’ve installed the yarrr package, you can learn about the piratepal() function using the help menu:

[code language=”r”]
?piratepal
[/code]

Here, you’ll see that piratepal() has three arguments

  • palette: A string defining the color palette to use (see examples). To use a random palette, use “random”. To view all palettes, use “all” combined with action = “show”
  • action: Either “return” to return a vector of colors, or “show” to show the palette. You can also use “r” or “s” for shorthand.
  • trans: A number in the interval [0, 1] indicating how transparent to make the colors. A value of 0 means no transparency and a value of 1 means complete transparency. Defaults to 0.

Let’s start by seeing all the different palettes in the package. To do this, set palette = “all”, and action = “show”:

[code language=”r”]
piratepal(palette = "all", action = "show")
[/code]

piratepalettes

Here, you can see a brief overview of all the palettes in the package. The names of the palettes on are on the left, and the colors in each palette are displayed to the right of the names. If you want to see the colors in a specific palette in more detail, you can see them (combined with an inspirational image) by using action = “show”.

For example, here is a palette called “scholar” that I got from this blog on the Shutterstock website

scholar

Now if you want to use the colors in a palette, just use the action = “return” argument to return a vector of the palette colors:

[code language=”r”]
piratepal(palette = "scholar", action = "return")
[/code]

This code will return the following vector of the colors in the palette:

blue1 yellow blue2 gray
#2B3E56FF #CDA725FF #C5CBD3FF #F1F1E7FF

Once you’ve assigned this vector of colors to an object (like my.colors), you can use them in any plot that you’d like!

You’ll notice that while some of these palettes have vague, generic names (like “scholar”), others are suspiciously recognisable. For example, what is this “monalisa” palette? Well, I discovered this blog post from a site called artvarsity that contains color palettes inspired by classical art. So naturally, I stole some and put them into the package. For example, let’s look at the “monalisa” palette:

[code language=”r”]
piratepal(palette = "monalisa", action = "show")
[/code]

monalisa

Oh yeah, I’ve also got several palettes inspired by movies. Oh yes, palettes from movies. I discovered this badass site called Movies in Color which provides color palettes for tons of movies. Don’t get me wrong, these palettes aren’t necessarily the most beautiful in the world – but who doesn’t want to use a palette from a horror movie?!

Here’s a palette from “Eternal Sunshine of the Spotless Mind” – the film shown at every psychology undergraduate social event

[code language=”r”]
piratepal(palette = "eternal", action = "show")
[/code]

eternal

Since I love Pixar movies, and found this great website showing palettes for all the Pixar films, I added lots of palettes from Pixar movies. Let’s take a look at the “up” palette:

[code language=”r”]
piratepal(palette = "up", action = "show")
[/code]

up

Just for fun, Let’s draw the house from the movie “Up” using the “google” palette (using my handy digital color meter, I also stole the colors from the Google logo).

[code language=”r”]
google.colors <- piratepal("google", trans = .1)
n.balloons <- 500
x <- rnorm(n.balloons, 0)
y <- rnorm(n.balloons, 2, 1)
plot(1, xlim = c(-7, 7), ylim = c(-7, 7),
xlab = "", ylab = "", type = "n", xaxt = "n", yaxt = "n", bty = "n")

rect(-2, -6, 2, -2)
polygon(c(-2, 0, 2),
c(-2, 0, -2)
)
rect(-7, -7, -2, 100)
rect(2, -7, 7, 100)
rect(-.5, -6, .5, -4)
points(.3, -5)

line.start.x <- rnorm(n.balloons, 0, .4)
line.start.y <- -1 + rnorm(n.balloons, 0, .1)

segments(line.start, line.start.y, x, y, lty = 1, col = gray(.7), lwd = .2)
points(x, y, pch = 21, bg = sample(google.colors, 100, replace = T),
xlim = c(-7, 7), ylim = c(-7, 7), col = gray(.9), cex = rnorm(100, 2, .3))
[/code]

Here’s the result! Oh and did you know that the Up house really exists in Seattle?!?! Here’s the real one!:

uphouse

Finally, for all you horror fans out there: here’s the Evil Dead palette. I’ll warn you, it’s pretty bleak – but honestly what did you expect?

[code language=”r”]
piratepal(palette = "evildead", action = "show")
[/code]

evildead

Let’s take advantage of this friendly, colorful palette. Here are the domestic revenues of the 4 Evil Dead films (from the-numbers.com):

[code language=”r”]

evil.colors <- piratepal(
palette = "evildead", action = "return",
trans = .2)

revenues <- c(2400000, 5923044, 11502976, 54239856)

plot(1, xlim = c(.5, 4.5), ylim = c(0, 59000000), type = "n",
xlab = "", ylab = "", xaxt = "n", bty = "n",
yaxt = "n", main = "Evil Dead Movie Domestic Revenues")

mtext("Using the evildead palette in the yarrr package!", side = 3,
cex = .8, line = .5)

segments(1:4, rep(0, 4), 1:4, revenues, lty = 2)

abline(h = seq(0, 55000000, 10000000), col = gray(.9))

adj <- 2000000
text(1:4, revenues + adj,
labels = c("$2.4 Mil", "$5.9 Mil", "$11.5 Mil", "$54 Mil"), pos = 3)

points(1:4, revenues,
col = evil.colors, cex = 4, pch = 16)

mtext(c("Evil Dead\n(1983)", "Evil Dead 2\n(1987)", "Army of Darkness\n(1993)",
"Evil Dead\n(2016)"), at = 1:4, side = 1, line = 1)
[/code]

evildeadrevenue

Here are a few of the other movie-based palettes:

[code language=”r”]
piratepal(palette = "ghostbusters", action = "show")
[/code]

Dr Ray Stantz: Everything was fine with our system until the power grid was shut off by dickless here.
Walter Peck: They caused an explosion!
Mayor: Is this true?
Dr. Peter Venkman: Yes it’s true.
[pause]
Dr. Peter Venkman: This man has no dick.

ghostbusters

[code language=”r”]
piratepal(palette = "rat", action = "show")
[/code]
Remy: [as Emile tastes a piece of cheese] Creamy, salty-sweet, an oaky nuttiness… You detect that?
Emile: Oh, I’m detecting nuttiness…

rat

I hope these art / movie / random googling inspired palettes inspire people to make plots with better colors. Oh and if you have a favorite movie that you’d like me to add to the function in the form of a palette, let me know. I can always use a good movie recommendation. Bonus points if it’s available on German Netflix.

Coin flips and the Hot Hand

The Hot Hand is the belief that an actor, such as a basketball player, is more likely to be successful if he has just experienced success than if he has just experienced a failure. For example, a basketball player is believed to be ‘hot’ when he has just made one (or more) shots in a row – and as a result, is more likely to make his next shot than if he was not in this ‘hot’ state. In statistical language, the hot hand is the claim that the conditional probability of a success given a prior success (p(Success | Success)) is higher than the conditional probability of a success given a previous failure (p(Success | Failure)).

While belief in the hot hand is pervasive, several decades of research, notably one by Gilovich, Vallone and Tversky has suggested that it in fact does not exist. Rather, it is the result of a cognitive fallacy wherein people expect random sequences to have fewer strings of successes than are actually expected by chance.

However, in the past few weeks, an unpublished manuscript written by Joshua Miller and Adam Sanjurjo has apparently put an atom bomb on this debate. It was even picked up by the Wall Street Journal! The fact that the WSJ is writing about the results from an unpublished manuscript shows how big of a deal this is. So what are Miller and Sanjujo claiming that set off this massive reaction? They claim that in fact scientists, rather than basketball fans, have been making a statistical mistake. Specifically, Miller and Sanjurjo claim that, for a fair coin, the probability of a head given a previous head may not be 0.50!

Miller and Sanjurjo take the following example:

Take a coin and flip it 4 times, calculate the percentage of times where a head was followed by a head. Call this the “Head Repetition Probability.” Repeat this procedure 1 million times. What will the average Head Repetition Probability be?

If your answer is 50%, then you’re similar to most people. And like most people, you’re wrong. In fact, the probability is 41.7% – substantially lower than the intuitive 50% number that most people, and prior researchers, had believed. Why is this important? Well, if researchers find that, in fact, real basketball players have a Basket Repetition Probability of 50% (they have), then they have a higher RP than is expected by chance (i.e.; 41.7%)! Thus, the often detected 50% repetition probability in real people may in fact be evidence for the Hot Hand!

But were does this 41.7% figure come from? Does it really reflect the expected probability of getting a Head given that the previous outcome is a head? If so, then coins should be inherently anti-streaky, which would pretty much contradict everything we know from Probability 101. Thankfully, that’s not what the 41.7% number reflects. Rather, the number comes from an improper statistic which will equate sequences that are actually quite different.

Where the 41.7% figure comes from (and how to fix it)

To see where the 41.7% figure comes from, I will follow Andrew Gelman’s example and simulate the result in R using flips of a virtual fair coin. Using the coin analogy, we want to know “What is the probability of flipping a H given that a H occurred on the previous flip?”

However, in contrast to Gelman, I will not randomly simulate coin flips. Instead, I’ll use R to calculate all solutions analytically.

First, we’ll create a dataframe called possible showing of all possible sequences of 3 flips of a fair coin (though you can easily change the number of flips in the code by changing the n.flips value). We’ll use 1 to denote a Head and 0 to denote a Tail:

[code language=”r”]
n.flips <- 3
possible <- expand.grid(lapply(1:n.flips, function(x) {c(0, 1)}))
names(possible) <- paste("f", 1:n.flips, sep = "")

possible

## f1 f2 f3
## 1 0 0 0
## 2 1 0 0
## 3 0 1 0
## 4 1 1 0
## 5 0 0 1
## 6 1 0 1
## 7 0 1 1
## 8 1 1 1
[/code]

As you can see, there are 8 possible sequences (2 to the power of 3). However, we need to remove the two cases where there are no heads in the first 2 flips. Why? Because if there are no heads in the first two flips, then there is no possibility for a sequence of two heads anywhere in the sequence.

[code language=”r”]
possible <- possible[rowMeans(possible[,1:(n.flips – 1)]) != 0,]
possible

## f1 f2 f3
## 2 1 0 0
## 3 0 1 0
## 4 1 1 0
## 6 1 0 1
## 7 0 1 1
## 8 1 1 1
[/code]

Now we have 6 possible sequences that are equally likely to occur. For each of these sequences, how often is there an HH (a Head-Head repetition?). To do this, let’s calculate three statistics:

streak.opportunities: How many opportunities are there for an HH streak? In other words, how many H values are there in the first 2 flips?

actual.streaks: How many actual HH streaks are there?

p.HH: Of those possible streak opportunities, what percentage of the time was there an actual streak?

[code language=”r”]
for(i in 1:nrow(possible)) {

seq <- possible$flips[i]
seq.values <- as.numeric(possible[i,1:n.flips])

heads1 <- seq.values[1:(n.flips -1)] == 1
heads2 <- seq.values[2:(n.flips)] == 1

streak.opportunities.i <- sum(heads1)
actual.streaks.i <- sum(heads1 & heads2)

p.hh <- actual.streaks.i / streak.opportunities.i

possible$hh.streak.opportunities[i] <- streak.opportunities.i
possible$hh.streaks[i] <- actual.streaks.i
possible$p.hh[i] <- p.hh

}

possible

## f1 f2 f3 hh.streak.opportunities hh.streaks p.hh
## 2 1 0 0 1 0 0.0
## 3 0 1 0 1 0 0.0
## 4 1 1 0 2 1 0.5
## 6 1 0 1 1 0 0.0
## 7 0 1 1 1 1 1.0
## 8 1 1 1 2 2 1.0
[/code]

We’re almost to the 41.7% number! So, to calculate the expected probability of getting HH, we can calculate the average p.HH value across all 6 sequences:

[code language=”r”]
mean(possible$p.hh)
## [1] 0.4166667
[/code]

Wow! The value is indeed 41.67%! But does this number really represent the conditional probability of getting a H given a H? No, it doesn’t

The Problem

Why is the 41.7% number misleading? Well, in our previous calculation, we treated all flip sequences as equally informative. That is, we said each sequence is equally likely, therefore, we can just take the average proportion of hits given hits (p.HH) across all 6 sequences. However, this is a false assumption! Indeed, some of these sequences are more informative than others.

To see this, we’ll take an extreme example. Consider flipping a coin 10 times. Here are two possible sequences. Assuming the coin is fair, both of these sequences are equally likely. For each sequence, I’ve calculated the number of HH streak opportunities, the number of actual HH streaks, and the percentage of HH streaks.

Sequence Streak opportunities Number of Streaks Streak Percentage
1, 1, 1, 1, 1, 1, 1, 1, 1, 1 9 9 1.0
0, 0, 0, 0, 0, 0, 0, 0, 1, 1 1 1 1.0

In the first sequence, there are 9 opportunities for a streak (in the 2nd through the 10th flip). Additionally, there were 9 actual streaks, so the proportion of streaks was 1.0.

Now, consider the second sequence. In this sequence, there is only one opportunity for an HH streak (on the final 10th flip). And indeed, there was indeed an H on the 10th and final flip. Therefore, the proportion of HH streaks out of all possible HH streaks was 100% – just like in the first sequence! But are these sequences really equally informative? I think not. In the first sequence, we considered all 10 data points while in the second sequence, we ignored the first 8 flips and only considered the last two.

Think about a real world example: if you saw a player hit 10 shots in a row, would’t you think she provided stronger evidence for a hot-hand than a player who missed 8 shots and then made only 2 in a row?

Thankfully, there is an easy fix to the problem

Fixing the Problem (and finally getting p(H|H) = 50%)

To fix the problem, we need to make sure we weight sequences according to how many opportunities there were for an HH sequence. To do this, we can simply calculate the total number of HH streaks across all sequences, and divide it by the total number of HH streak opportunities across all sequences. Here, we’re ignoring differences between people.

Let’s go back to our coin flip dataframe and make this calculation. If I’m right, we should get a p(H|H) probability of 50%:

f1 f2 f3 hh.streak.opportunities hh.streaks p.hh
1 0 0 1 0 0.0
0 1 0 1 0 0.0
1 1 0 2 1 0.5
1 0 1 1 0 0.0
0 1 1 1 1 1.0
1 1 1 2 2 1.0

To calculate this, we’ll calculate the total possible number of streaks (sum of the 5th column) and divide it by the total possible number of streaks (sum of the 4th column). We should get 4 / 8 = 50%:

[code language=”r”]
sum(possible$hh.streaks) / sum(possible$hh.streak.opportunities)
## [1] 0.5
[/code]

Phew, we got it! Again, the reason why our 50% number is larger than the previous 41.7% number is because we’ve given more weight to sequences with more opportunities for HH events than those with fewer opportunities.

Conclusion

While the 41.7% number reported by Joshua Miller and Adam Sanjurjo (and subsequently confirmed by Andrew Gelman sounds like a scary violation of the laws of statistics, it is in fact an artefact of an improper statistical procedure. As I’ve shown, taking the simple average proportion of HH events across many sequences is improper because it gives equal weight to sequences with many opportunities for HH events and to those with only a few opportunities. By analogy, consider two basketball players who shoot 100 shots. Player A misses his first 97 shots, and makes his last 3 shots. In contrast, player B makes all 100 shots. Which player was more hot? According to the improper statistic – both are equally hot because both have a p(H|H) value of 100%! Following this logic, both provide equivalent (positive) evidence for the hot-hand. Of course, this is poor reasoning, we should give player B’s performance more weight. Departing from this example and moving back to the more general example of coin flips, we can avoid this bias by simply calculating the total number of HH events divided by all possible HH events, you will reach the correct (and intuitive) 50% figure.

Addendum: Using Bayes to help correct the error

After finishing the previous analysis, I wondered if we could use Bayesian statistics to correct the bias inherent in the improper statistic. Specifically, if we allow Bayes theorem to weight evidence according to the number of opportunities for success (which as I’ve shown to be critical), can we return to using the improper statistic?

To answer this question, I will repeat the improper statistic procedure, but I won’t calculate the mean of the sample pHH proportions. Instead, I’ll first calculate the Bayesian posterior highest density of the pHH value for each player (e.g.; coin), and then calculate the average Bayesian posterior highest density across all players.

I’m make this example a bit more interesting by simulating the results of baskets in a shooting tournament. I’ll generate outcomes of 10 sequential shots from 10,000 basketball players. Importantly, each player will have a true 50% chance of success on each shot.

[code language=”r”]

n.players <- 1000
n.shots <- 10
shots <- as.data.frame(matrix(sample(c(0, 1), size = n.players * n.shots, replace = T),
nrow = n.players, ncol = n.shots))

[/code]

Again, let’s delete cases where the first 9 shots are all misses

[code language=”r”]
shots <- shots[rowMeans(shots[,1:9]) != 0,]
[/code]

Now, for each player, let’s calculate 4 statistics

  • possible.HH: The number of hits in shots 1 – 9; this equates to the number of possible streaks in the sequence
  • actual.HH: The number of actual HH streaks in the sequence
  • actual.HH.p: The actual proportion of HH streaks in the sequence
  • posterior.HH.p: The Bayesian posterior estimate of the probability of an HH sequence given

[code language=”r”]
for(i in 1:nrow(shots)) {

player.seq <- shots[i,]

possible.HH.sequences <- sum(player.seq[1:(n.shots – 1)] == 1)
actual.HH.sequences <- sum(player.seq[1:(n.shots – 1)] == 1 & player.seq[2:(n.shots)] == 1)

new.player.seq <- c(rep(1, actual.HH.sequences), rep(0, possible.HH.sequences – actual.HH.sequences))

n <- possible.HH.sequences
p <- actual.HH.sequences / possible.HH.sequences

post.a <- 1 + p * n
post.b <- 1 + (1 – p) * n

post.mean <- qbeta(.5, post.a, post.b)
lb <- qbeta(.025, post.a, post.b)
ub <- qbeta(.975, post.a, post.b)

shots$possible.HH[i] <- possible.HH.sequences
shots$actual.HH[i] <- actual.HH.sequences
shots$actual.HH.p[i] <- p
shots$bayes.HH.p[i] <- post.mean
shots$lb[i] <- lb
shots$ub[i] <- ub
}
[/code]

Now, let’s calculate the average HH streaks from our measures.

First, let’s use the original biased measure (the same one that got us the ugly 41.7% value for 4 coin flips).

[code lanugage=”r”]
mean(shots$actual.HH.p)
## [1] 0.4431845
[/code]

In my simulation, I got a value of 44.3%. Better than the 41.7% value, but still far away from the desired 50% value.

Now, let’s see how our average Bayesian solution did:

[code language=”r”]
mean(shots$bayes.HH.p)
## [1] 0.4701428
[/code]

The mean Bayesian posterior response is 47.0%, closer to 50% than before, but still not good enough. To be honest, I’m not totally sure why this value does not hit 50% (ignoring sampling error). I’ll still need to work on this.

Finally, let’s use our simple unbiased measure:

[code language=”r”]
sum(shots$actual.HH) / sum(shots$possible.HH)
## [1] 0.4975391
[/code]

Here, I got a value of 49.8% – close enough to be within sampling error of 50%.

The yarrr R package!

After reading Hilary Parker’s and Hadley Wickham’s great tutorials on building R packages, I’ve decided to build the yarrr package. The package contains all of the datasets, functions, and the game-changing jokes contained in YaRrr! The Pirate’s Guide to R.

To install the package from github, run the following code:

[code language=”r”]
install.packages("devtools")
library("devtools")
install_github("ndphillips/yarrr")
library("yarrr")
[/code]

Once you do, you should be able to see the pirates dataset – a historical survey of hundreds of pirates now in R form for your data analysis pleasures:

[code language=”r”]
View(pirates)
[/code]

P.S. Thanks to Hilary Parker for showing me how to do proper code formatting in wordpress!

Press Release: The Janus face of Darwinian competition

Press Release: http://msutoday.msu.edu/news/2015/too-many-candidates-spoil-the-stew/

Paper: The Janus face of Darwinian competition

This week Michigan State University released the following press release on a recent publication in Scientific Reports that I was privileged to be a co-author on! The paper, titled “The Janus face of Darwinian competition” was first-authored by Dr. Arend Hintze at Michigan State University. The paper uses computational techniques to test how competition affects the evolution of decision making strategies. Here is the punchline of the paper: some competition is good, but too much can spur the evolution of poor decision making strategies. The paper builds on the competitive sampling paradigm I co-developed along with Dr. Yaakov Kareev, Dr. Judith Avrahami, and Dr. Ralph Hertwig in our paper “Rivals in the dark: How competition influences search in decisions under uncertainty” published last year in Cognition.