The yarrr package (0.0.8) is (finally!) on CRAN

Great news R pirates! The yarrr package, which contains the pirateplot, has now been updated to version 0.0.8 and is up on CRAN (after hiding in plain sight on GitHub). Let’s install the latest version and go over some of the updates:

[code language=”r”]
install.packages("yarrr") # Install package from CRAN
library("yarrr") # Load the package
yarrr.guide() # Open the package guide
[/code]

The most important function in the yarrr package is pirateplot(). What the heck is a pirateplot? A pirateplot is a modern way of visualising the relationship between a categorical independent variable, and a continuous dependent variable. Unlike traditional plotting methods, like barplots and boxplots, a pirateplot is an RDI plotting trifecta which presents Raw data (all data as points), Descriptive statistics (as a horizontal line at the mean — or any other function you wish), and Inferential statistics (95% Bayesian Highest Density Intervals, and smoothed densities).

pirateplot-elements

For a full guide to the package, check out the package guide at CRAN here. For now, here are some examples of pirateplots showing off some the package updates.

Up to 3 IVs

You can now include up to three independent variables in your pirateplot. The first IV is presented as adjacent beans, the second is presented in different groups of beans in the same plot, and the third IV is shown in separate plots.

Here is a pirateplot of the heights of pirates based on three separate IVs: headband (whether the pirate wears a headband or not), sex, and eyepatch (whether the pirate wears an eye patch or not):

[code language=”r”]
pirateplot(formula = height ~ sex + headband + eyepatch,
point.o = .1,
data = pirates)
[/code]

threeivpp

Here, we can see that male pirates tend to be the tallest, but there there doesn’t seem to be a difference between those who wear headbands or not, and those who have eye patches or not.

New color palettes

The updated package has a few fun new color palettes contained in the piratepal() function. The first, called ‘xmen’, is inspired by my 90s Saturday morning cartoon nostalgia.

[code language=”r”]
# Display the xmen palette
piratepal(palette = "xmen",
trans = .1, # Slightly transparent colors
plot.result = TRUE)
[/code]

xmen_display

Here, I’ll use the xmen palette to plot the distribution of the weights of chickens over time (if someone has a more suitable dataset for the xmen palette let me know!):

[code language=”r”]
pirateplot(formula = weight ~ Time,
data = ChickWeight,
main = "Weights of chickens by Time",
pal = "xmen",
gl.col = "gray")

mtext(text = "Using the xmen palette!",
side = 3,
font = 3)

mtext(text = "*The mean and variance of chicken\nweights tend to increase over time.",
side = 1,
adj = 1,
line = 3.5,
font = 3,
cex = .7)
[/code]

xmen_chikens

The second palette called “pony” is inspired by the Bronys in our IT department.

[code language=”r”]
# Display the pony palette
piratepal(palette = "pony",
trans = .1, # Slightly transparent colors
plot.result = TRUE)
[/code]

pony_image

Here, I’ll plot the distribution of the lengths of movies as a function of their MPAA ratings (where G is for suitable for children, and R is suitable for adults) using the pony palette:

[code language=”r”]
pirateplot(formula= time ~ rating,
data = subset(movies, time > 0 & rating %in% c("G", "PG", "PG-13", "R")),
pal = "pony",
point.o = .05,
bean.o = 1,
main = "Movie times by rating",
bean.lwd = 2,
gl.col = "gray")

mtext(text = "Using the pony palette!",
side = 3,
font = 3)

mtext(text = "*Movies rated for children\n(G and PG) tend to be longer \nthan those rated for adults",
side = 1,
adj = 1,
font = 3,
line = 3.5,
cex = .7)
[/code]

pony_times

I have to be honest, the pony palette colors are not terribly well suited for this pirateplot — but I think they look better in a basic scatterplot. Because the piratepal function returns a vector of colors (when plot.result = F), you can also use it in other plots. Here, I’ll use the pony palette in a scatterplot:

[code language=”r”]
set.seed(100) # for replicability
x <- rnorm(100, mean = 10, sd = 1)
y <- x + rnorm(100, mean = 0, sd = 1)
point.sizes <- runif(100, min = .2, max = 2) # Just for fun

plot(x, y,
main = "Scatterplot with the pony palette",
pch = 21,
bg = piratepal("pony", trans = .1),
col = "white",
bty = "n",
cex = point.sizes)

grid() # Add gridlines
[/code]

ponyscatter

To see all of the palettes (including those inspired by movies and a transit map of Basel), just run the function with “all” as the main argument

[code language=”r”]
piratepal(palette = "all")
[/code]

Of course, if you find that these color palettes give you a headache, you can always set a pirateplot to grayscale (or any other color), by specifying a single color in the palette argument. Here, I’ll create a grayscale pirateplot showing the distribution of movie budgets by their creative type:

[code language=”r”]
pirateplot(formula = budget ~ creative.type,
data = subset(movies, budget > 0 &
creative.type %in% c("Multiple Creative Types", "Factual") == FALSE),
point.o = .02,
xlab = "Movie Creative Type",
main = "Movie budgets (in millions) by rating",
gl.col = "gray",
pal = "black")

mtext("Using a grayscale pirateplot",
side = 3,
font = 3)

mtext("*Superhero movies tend to have the highest budgets\n…by far!",
side = 1, adj = 1, line = 3,
cex = .8, font = 3)
[/code]

moviebudgetpp

Looks like super hero movies have the highest budgets…by far!

And again, to get more tips on how to customise your palettes and pirateplots, check out the main package guide at https://cran.r-project.org/web/packages/yarrr/vignettes/guide.html, or by running the following code:

[code language=”r”]
yarrr.guide() # Open the yarrr package guide
[/code]

Acknowledgements and Comments

– The pirateplot is largely inspired by the great beanplot package by Peter Kampstra.
– Bayesian 95% HDIs are calculated using the truly amazing BayesFactor package by Richard Morey [Note: a previous version of this post incorrectly called Richard “Brian” — I blame lack of caffeine].
– The latest developer version of yarrr is always available at https://github.com/ndphillips/yarrr. Please post any bugs, issues, or feature requests at https://github.com/ndphillips/yarrr/issues

Making fast, good decisions with the FFTrees R package

fftreesblog

“…[W]e are suspicious of rapid cognition. We live in a world that assumes that the quality of a decision is directly related to the time and effort that went into making it.” ~ Malcolm Gladwell

In his book Blink, Malcolm Gladwell summarises a common misconception about good decision making. According to folk wisdom, the more time, information, and effort you put into a decision, the better it gets. In other words, “More is better.” If you are a doctor making a diagnosis, more medical tests are always better. If you are trying to decide if your new Tinder match is worthy of a date, try to find them on Facebook, Instagram and Snapchat first. If you deciding how to invest in the stock market, get as much data as you can and build a statistical model so complex that it describes the past perfectly.

However, decades of research in cognitive science and machine learning have shown that the “More is better” theory is, in many real-world decisions, flat wrong. In contrast, there are many cases where, as Dr. Gerd Gigerenzer has put it, “Less is more.” Why? For two key reasons. The first reason is statistical: Complex decision models with many parameters can lead to overfitting. In essence, overfitting occurs when a model is very good at describing one specific past dataset, but fails in predicting new, unseen data (Gigerenzer & Brighton, 2009). For example, a complex economic model might describe changes in past stock prices very well, but is largely unable to predict future changes. This is why, as Burton Malkiel shows, it is so hard for complex trading strategies to outperform simple index funds in the stock market (Malkiel, 1999). The second reason is psychological: Even if a complex decision model is good at predicting new data, if a person can’t understand it, or easily implement it in a real-world decision, like a doctor trying to use logistic regression in an emergency room, they won’t use it.

What simple decision rules can people use to make good decisions? One popular class of simple decision rules are Fast and Frugal Trees (FFTs, Gigerenzer & Todd, 1999; Martignon, Katsikopoulos & Woike, 2008). Fast and frugal trees make very fast decisions based on a few (usually 1 to 5) pieces of information and ignore all other information. In other words, Fast and frugal trees are noncompensatory, meaning that once they make a decision based on a few pieces of information, no additional information can ever change the decision. Because they are so simple to use, they have been used in many real-world decision tasks from making coronary artery disease diagnoses (Green & Mehr, 1997), to diagnosing depression (Jenny, Pachur, Williams, Becker & Margraf, 2013). However, lest you think that fast and frugal trees are only useful when time is limited, research has shown that fast and frugal trees can out-predict more complex models in decidedly non-human simulations (Gigerenzer, Czerlinski & Martignon, 1999).

While fast and frugal trees have shown promise, there are currently no off-the-shelf methods to create them. How can you create your own fast and frugal decision trees for your own dataset? Starting today, you can use the FFTrees R package available on CRAN. The main function in the package is FFTrees(), which takes a standard formula and data argument, and returns a fast and frugal tree (FFTrees) object. From this object, you can view its underlying trees, along with many standard classification statistics (e.g.; hit-rate, false alarm rate, AUC) applied to both training and test (i.e.; prediction) datasets. Finally, the function has two alternative classification algorithms, logistic regression and CART, built in, so you can always compare the accuracy of your fast and frugal trees to two gold-standards in the classification literature. If you’re like me, you’ll be amazed at how well simple, transparent fast and frugal trees perform relative to these gold-standards, especially in predicting new data!

The FFTrees package in action

You can install and load the FFTrees package from CRAN:

[code language=”r”]
install.packages("FFTrees")
library("FFTrees")
[/code]

Once you’ve installed the package, you can view the overview vignette by running the code FFTrees.guide(). However, for this blog post I’ll show you how to create fast and frugal trees for predicting breast cancer. The data we’ll use comes from the Wisconsin Breast Cancer Database (data source). The data is stored as a dataframe with 699 rows, representing 699 patients, and 10 columns. The 10 columns represent 9 physiological measurements, from cell sizes to cell shapes, and 1 binary variable (diagnosis) indicating whether the woman truly does, or does not have breast cancer. Here is how the first few rows of the dataframe look:

thickness cellsize.unif cellshape.unif adhesion epithelial nuclei.bare chromatin nucleoli mitoses diagnosis
5 1 1 1 2 1 3 1 1 FALSE
5 4 4 5 7 10 3 2 1 FALSE
3 1 1 1 2 2 3 1 1 FALSE
6 8 8 1 3 4 3 7 1 FALSE
4 1 1 3 2 1 3 1 1 FALSE
8 10 10 8 7 10 9 7 1 TRUE

To create a fast and frugal tree from the dataset, we’ll use the FFTrees() function, entering formula = diagnosis ~., meaning that we want to predict diagnosis as a function of (potentially), all other variables, and data = breastcancer. We’ll assign the result to a new object of class FFTrees called breastcancer.fft

[code language=”r”]
breastcancer.fft <- FFTrees(formula = diagnosis ~.,
data = breastcancer)
[/code]

Now that we’ve created the object, we can print it to the console to get basic information

[code language=”r”]
breastcancer.fft
# "An FFTrees object containing 6 trees using 4 cues {cellsize.unif,cellshape.unif,nuclei.bare,epithelial} out of an original 9"
# "Data were trained on 683 exemplars. There were no test data"
# "FFT AUC: (Train = 0.98, Test = NA)"
# "My favorite tree is #3 [Training: HR = 0.93, FAR = 0.05], [Testing: HR = NA, FAR = NA]"
[/code]

The printout tells us that the final FFTrees object contains 6 different trees, and the largest tree only uses 4 of the original 9 cues. To see the best tree, we can simply plot the FFTrees object:

[code language=”r”]
plot(breastcancer.fft,
main = "Breastcancer FFT",
decision.names = c("Healthy", "Cancer"))
[/code]

 

bcfft

 

There’s one of our fast and frugal trees! In the top section of the plot, we see that the data had 444 true healthy cases, and 239 true cancer cases. In the middle section, we see the actual tree. The tree then starts by looking at the cue cellsize.u. If the value is less than 3, the tree decides that the person is healthy. If the value is not less than 3, then the tree looks at cellshape. If the cellshape. <= 2, the tree decides the patient is healthy. If cellshape. > 2, the tree decides that the person does have cancer. That’s the whole decision algorithm! Now isn’t that a lot easier to interpret than something like logistic regression? Again, imagine giving this logistic regression equation to anyone without a degree in statistics.

Performance

Ok, so our fast and frugal tree easy to understand and use, but how well does it perform? The bottom section of the above plot shows a series of performance statistics. On the bottom left hand corner, we can see a classification table, which shows how the tree’s decisions compare to the truth. Entries on the main diagonal (Cor Rej and Hit) correspond to correct decisions, while the other entries correspond to incorrect decisions. As you can see, the tree performed exceptionally well:  it made correct diagnoses in 646 (424 + 222) out of all 683 cases (95% correct). Additional performance statistics, including specificity (1 – false alarm rate), hit rate, d-prime, AUC (area under the curve) are also displayed. Finally, in the bottom right plot, you can see an ROC curve which compares the performance of the trees to CART (in red) and logistic regression (in blue).

Viewing other trees

While this tree did well, it still made some errors in both detecting true cancer cases (i.e.; hit rate) and in correctly rejecting true healthy cases (i.e.; specificity). Now, what if you want a tree that rarely misses true cases, at the cost of additional false alarms? As  Luan, Schooler, & Gigerenzer (2011) have shown, you can easily shift the balance of errors in a fast and frugal tree by adjusting the decisions it makes at each level of a tree. The FFTrees function automatically builds several versions of the same general tree that make different error trade-offs. We can see the performance of each of these trees in the bottom-right ROC curve. Looking at the ROC curve, we can see that tree number 5 has a very high hit-rate, but a smaller specificity. We can look at this tree by adding the tree = 5 argument to plot():

[code language=”r”]
plot(breastcancer.fft,
main = "Breastcancer FFT",
decision.names = c("Healthy", "Cancer"),
tree = 5)
[/code]

 

Screen Shot 2016-08-16 at 15.21.31

 

Here is the resulting tree. As you can see, this tree uses an extra cue called nuclei.bar. This tree has a perfect hit rate of 100% (just as we wanted), but at a cost of a lower specificity of 80%.

Additional arguments

Cross-validation: The FFTrees() function allows you to easily create trees from a training dataset, and test the performance of the trees with a test dataset (aka, cross-validation). You can do this by either entering an explicit test dataset in the data.test argument, or by randomly splitting the main dataset into a separate training and testing sample with train.p. For example, train.p = .5 will randomly split the data into a 50% training set, which will be used to build the trees, and a 50% test set, which will be used to evaluate their prediction performance.

[code language=”r”]
# Create a 50% training and 50% testing dataset with train.p = .5
breastcancer.test.fft <- FFTrees(formula = diagnosis ~ .,
data = breastcancer,
train.p = .5)
[/code]

Restricting trees: If you want to explicitly decide which cues you want in the tree, you can specify this in the formula argument. For example, the following code will generate a tree from the breast cancer data, but only using cues thickness, mitosis, and adhesion.

[code language=”r”]
# Only use 3 cues in the trees
breastcancer.r.fft <- FFTrees(formula = diagnosis ~ thickness + mitoses + adhesion,
data = breastcancer)
[/code]

Summary

The FFTrees package contains lots of other functions for visualising and comparing trees. In addition, it contains several real world datasets you can use to predict who survived the Titanic (titanic), whether a mushroom is poisonous or not (mushrooms), or how a politician will vote on a new bill (voting). To see all the details, check out the package vignettes either in the package or on CRAN (here). For all you judgment and decision making researchers out there, I will also be presenting the package at the annual meeting of the Society for Judgment and Decision Making (SJDM) in Boston in November 2016.

The package is also very much in development, so I am grateful for any recommendations, inevitable bug-reports, or criticisms. You can post bug-reports at www.github.com/ndphillips/FFTrees/issues, or email me directly at Nathaniel.D.Phillips.is@gmail.com

Alternative decision tree R packages

I hope you have as much fun with the package as I do, but I must warn you: this package was written by a psychologist who just happens to love programming in R for fun, rather than a statistician who really knows their shit about how best to build trees. For this reason, the code is messy, dreadfully inefficient, and without proper reference to established tree building algorithms. While I hope to collaborate with statisticians in the future to improve the tree building algorithm (likely by implementing one that already exists), it’s not quite there yet. For more established methods, I strongly recommend that you check out the rpart, and party packages.

Coauthors

This package was developed in collaboration with Dr. Hansjoerg Neth and Dr. Wolfgang Gaissmaier at the University of Konstanz, and Dr. Jan Woike at the Max Planck Institute for Human Development in Berlin.

Code / Package Updates

This blog post has been updated to incorporate updates made to version 1.1.4 of the FFTrees package. Specifically, the original fft() function was changed to FFTrees() to avoid confusion with fast fourier transform. All the original code (should) still work, but with some warnings.

 References

Gigerenzer, G., & Brighton, H. (2009). Homo heuristicus: Why biased minds make better inferences. Topics in Cognitive Science, 1(1), 107–143.

Gigerenzer, G., & Todd, P. M. (1999). Fast and frugal heuristics: The adaptive toolbox. In Simple heuristics that make us smart (pp. 3–34). Oxford University Press.

Gigerenzer, G., Czerlinski, J., & Martignon, L. (1999). How good are fast and frugal heuristics? In Decision science and technology (pp. 81–103). Springer.

Gladwell, M. (2007). Blink: The power of thinking without thinking. Back Bay Books.

Green, L., & Mehr, D. R. (1997). What alters physicians’ decisions to admit to the coronary care unit? Journal of Family Practice, 45(3), 219–226.

Jenny, M. A., Pachur, T., Williams, S. L., Becker, E., & Margraf, J. (2013). Simple rules for detecting depression. Journal of Applied Research in Memory and Cognition, 2(3), 149–157.

Malkiel, B. G. (1999). A random walk down Wall Street: including a life-cycle guide to personal investing. WW Norton & Company.

Marewski, J. N., & Gigerenzer, G. (2012). Heuristic decision making in medicine. Dialogues Clin Neurosci, 14(1), 77–89.

Martignon, L., Katsikopoulos, K. V., & Woike, J. K. (2008). Categorization with limited resources: A family of simple heuristics. Journal of Mathematical Psychology, 52(6), 352-361.

Luan, S., Schooler, L. J., & Gigerenzer, G. (2011). A signal-detection analysis of fast-and-frugal trees. Psychological Review, 118(2), 316.

The new and improved pirateplot()! Now with themes!

Note: this post refers to an outdated version of the pirateplot() function. For the latest description of the function, check out http://rpubs.com/yarrr/pirateplot

Hello fellow R pirates! I recently wrote a post demonstrating the pirateplot() function in the yarrr package. The pirateplot() function replace deceptive (and boring) bar plots with a full RDI plot, capable of plotting Raw data (and smoothed densities), Descriptive statistics (like means and medians), and Inferential statistics (a 95% Bayesian Highest Density Interval “HDI”).

In the past 48 hours I’ve had an overwhelmingly positive response to the pirateplot() function. At the same time, I received several great recommendations for tweaks and improvements to the function to make it even better. After consulting the original pirate text, I decided that the changes stay true to the original piratery intent of the pirate plot. I’m sure the founding members of the pirate plot would be proud.

Let’s go over the changes! First let’s install and loadthe yarrr package which contains the pirateplot() function:

[code language=”r”]
install.packages("yarrr")
library("yarrr")
[/code]

 

Now you’re ready to make some pirate plots! Here are the major updates to the function:

Opacity Themes

The five critical aspects of a pirate plot are the bars, beans, points, (average) lines, and hdis. You can adjust the opacity of each of these elements with opacity arguments — such as bars.o, beans.o (etc.).

The biggest update to pirateplot() is the addition of opacity themes which are designated by a new argument called theme.o. The input to this argument is an integer (currently 1, 2 or 3) that defines an opacity theme across all five elements. Themes 1, 2, and 3 create specific opacity values for each of the elements, while Theme 0 sets all opacities to 0. Thankfully, the themes just set default values for the individual element opacities — you can still override the opacities of any specific object within a theme by including an object specific opacity value.

Here are examples of the three different themes applied to the ChickWeight dataset:

Theme 1

Theme 1 emphasises the bar with light points and beans (I’ll use the appletv palette for this one)

[code language=”r”]

pirateplot(formula = weight ~ Diet,
data = ChickWeight,
main = "Theme 1\nappletv palette",
theme.o = 1,
pal = "appletv")

[/code]

theme1

Theme 2

Theme 2 emphasises the points and beans (using the southpark palette)

[code language=”r”]

pirateplot(formula = weight ~ Diet,
data = ChickWeight,
main = "Theme 2\nsouthpark palette",
theme.o = 2,
pal = "southpark")

[/code]

theme2

Theme 3

Theme 3 Emphases the 95% Highest Density Intervals (HDIs). Keep in mind that calculating HDIs can take a few seconds for each bean… Here I’ll use the Basel palette.

[code language=”r”]

pirateplot(formula = weight ~ Diet,
data = ChickWeight,
main = "Theme 3\nbasel palette",
theme.o = 3,
pal = "basel")

[/code]

theme3

Theme 0

In Theme 0, all opacities are set to 0 by default, so you can just individually specify the opacity of each element. In this example, I’ll turn the lines on full-blast, and turn the points on slighly. I’ll also increase the amount of jittering and size of the points. I’ll also use the google palette.

[code language=”r”]

pirateplot(formula = weight ~ Diet,
data = ChickWeight,
main = "Theme 0\ngoogle palette",
pal = "google",
point.o = .2,
line.o = 1,
theme.o = 0,
line.lwd = 10,
point.pch = 16,
point.cex = 1.5,
jitter.val = .1)

[/code]

theme0

Of course, you can still change the colors of the plotting elements with the par argument, and the background using the back.col argument. Here’s an x-ray version of Theme 3

[code language=”r”]
pirateplot(formula = weight ~ Diet,
data = ChickWeight,
main = "Theme 3\nlight color with black background",
pal = "white",
theme.o = 3,
point.pch = 16,
back.col = gray(.2))

[/code]

xray

Gridlines

You can now include gridlines in your plot with the gl.col argument. Just specify the color of the lines and the function will put them in reasonable places. The following plot also shows how pirateplot() handles two independent variables:

[code language=”r”]
pirateplot(formula = weight ~ Diet + Time,
data = subset(ChickWeight, Time < 10),
theme.o = 2,
pal = "basel",
point.pch = 16,
gl.col = gray(.8),
main = "Two IVs\nWith gridlines")
[/code]

gridlines

Other minor changes

I’ve also made the following smaller changes

  • The function no longer automatically sorts the levels of the IV. It will plot levels of the IV(s) in the order the are found in the original dataframe.
  • You can now manually change the color of the bar borders with the bar.border.col argument.

Happy pirate plotting!

As always, I am very happy to receive more comments and feedback on the function. You can write me at yarrr.book@gmail.com. Happy pirate plotting!

The Pirate Plot (2.0) – The RDI plotting choice of R pirates


pirateplotex3
Note: this post refers to an outdated version of the pirateplot() function. For the latest description of the function, check out http://rpubs.com/yarrr/pirateplot

Barplots suck

Plain vanilla barplots are as uninformative (and ugly) as they are popular. And boy, are they popular. From the floors of congress, to our latest scientific articles, barplots surround us. The reason why barplots are so popular is because they are so simple and easy to understand. However, that simplicity also carries costs — namely, barplots can mask important patters in data like multiple modes and skewness.

Instead of barplots, we should be using RDI plots, where RDI stands for Raw (data), Description and Inference. Specifically, an RDI plot should present complete raw data — including smoothed densities, descriptive statistics — like means and medians, and Inferential statistics — like a Bayesian 95% Highest Density Interval (HDI). The R community already has access to many great examples of plots that come close to the RDI trifecta. For example, beanplots, created by the beanplot() function, show complete raw data and smoothed distributions (Kampstra, 2008).

Today, the R community has access to a new RDI plot — the pirate plot. I discovered the original code underlying the pirate plot during a late night swim on the Bodensee in Konstanz Germany. The pirate plot function was written in an archaic German pirate dialect on an old beer bottle and is unfortunately unusable. However, I have taken the time to painstakingly translate the original pirate code into a new R function called pirateplot(). The latest version (now 2.0) of the translations are stored in the yarrr package on Github at (www.github.com/ndphillips/yarrr). To install the package and access the piratepal() function within R, first install and load the yarrr package

[code language=”r”]

install.packages("yarrr")
library("yarrr")

[/code]

 

Now you’re ready to make some pirate plots! Let’s create a pirate plot from the pirates dataset in the yarrr package. This dataset contains results from a survey of several pirates at the Bodensee in Konstanz. We’ll create a pirateplot showing the distribution of ages of pirates based on their favorite pirate:

[code language=”r”]

pirateplot(formula = age ~ favorite.pirate,
data = pirates,
xlab = "Favorite Pirate",
ylab = "Age",
main = "My First Pirate Plot!")

[/code]

pp1

The arguments for the pirateplot are very similar to that of other plotting functions like barplot() and beanplot(). They key arguments are formula, where you specify one (or two) categorical variable(s) for the x-axis, and and numerical variable for the y-axis.

In addition to the data arguments, there are arguments that dictate the opacity of the 5 key elements of a pirate plot: bar.o, The opacity of the bars. bean.o, the opacity of the beans, point.o, the opacity of the points, and line.o, the opacity of the average lines at the top of the bars. Finally, hdi.o controls the opacity of the 95% Bayesian Highest Density Interval (HDI). The HDIs are calculated using the BEST package (Kruschke, 2013). Because calculating HDIs can be time-consuming, they are turned off by default (i.e.; hdi.o = 0). In the next plots, I’ll turn them on so you can see them.

The pirateplot() function has built-in color arguments. You can control the overall color palette of the plot with pal, and the color of the plot background with back.col. Let’s change a few of these arguments. I’ll also include the 95% Highest Density Intervals (HDIs) by setting hdi.o = .7.

[code language=”r”]

pirateplot(formula = age ~ favorite.pirate,
data = pirates,
xlab = "Favorite Pirate",
ylab = "Age",
main = "Black and White Pirate Plot",
pal = "black",
hdi.o = .7,
line.o = 1,
bar.o = .1,
bean.o = .1,
point.o = .1,
point.pch = 16,
back.col = gray(.97))
[/code]

pp2

As you can see, the entire plot is now grayscale, and different elements of the plot have been emphasised by changing the opacity arguments. For example, now that we’ve set the opacity of the HDI to .8 (the default is 0), we can see the Bayesian 95% Highest Density Interval for the mean of each group.

Hopefully it’s clear how much better RDI plots are than standard bar plots. Now, in addition to just seeing one piece of information (the mean) of each group, we can see all the raw data, a smoothed density curve of the data (helpful for detecting multiple modes and skewness), as well as Bayesian inference.

Oh, and just for comparison purposes, we can create a standard barplot within the pirateplot() function by adjusting the opacity arguments:

[code language=”r”]

pirateplot(formula = age ~ favorite.pirate,
data = pirates,
xlab = "Favorite Pirate",
ylab = "Age",
main = "Black and White Pirate Plot",
pal = "black",
hdi.o = 0,
line.o = 0,
bar.o = 1,
bean.o = 0,
point.o = 0)
[/code]

sad

 

Now how awful does that barplot look in comparison to the far superior pirate plot?!

You can also include multiple independent variables as arguments to the pirateplot() function. For example, I can plot the pirates’ beard lengths separated by sex and the college pirate went to. For this plot, I’ll use the southpark palette and emphasize the HDI by turning its opacity up to .6

[code language=”r”]

pirateplot(formula = beard.length ~ sex + college,
data = pirates,
main = "Beard lengths",
pal = "southpark",
xlab = "",
ylab = "Beard Length",
point.pch = 16,
point.o = .2,
hdi.o = .6,
bar.o = .1,
line.o = .5)
[/code]

pp4

As you can see, it’s very easy to customise the look and focus of your pirate plot. Here are 6 different plots of the weights of chickens given one of 4 diets (from the ChickWeight dataframe in R). You can see the code for each by accessing the help menu for the pirateplot() function within R.

ppmatrix

 

Have fun creating your own pirate plots! If you have suggestions for further improvements, don’t hesitate to write my squire at yarrr.book@gmail.com.

References

Kampstra, P. (2008) Beanplot: A Boxplot Alternative for Visual Comparison of Distributions. Journal of Statistical Software, Code Snippets, 28(1), 1-9. URL http://www.jstatsoft.org/v28/c01/

Kruschke, J. K. 2013. Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General 142(2):573-603. doi: 10.1037/a0029146

www.thepiratesguidetor.com

I missed out on www.yarrr.com (some domain squatter is holding it for a pirate’s ransom), but I was able to snag www.thepiratesguidetor.com! This is now the official home of YaRrr! The Pirate’s Guide to R

For those of you who don’t know, this whole story started last summer. I was taking a late night swim on the Bodensee in Konstanz and saw a rusty object sticking out of the water. Upon digging it out, I realised it was an ancient usb-stick with the word YaRrr inscribed on the side. Intrigued, I brought it home and plugged it into my laptop. Inside the stick, I found a single pdf file written entirely in pirate-speak. After watching several pirate movies, I learned enough pirate-speak to begin translating the text to English. Sure enough, the book turned out to be an introduction to R called The Pirate’s Guide to R.

This book clearly has both massive historical and pedagogical significance. Most importantly, it turns out that pirates were programming in R well before the earliest known advent of computers. Of slightly less significance is that the book has turned out to be a surprisingly up-to-date and approachable introductory text to R. For both of these reasons, I felt it was my duty to share the book with the world. You can download the latest version of the pdf (again, I am continually updating the translations) by clicking on the pirate’s beard below:

YaRrr Cover 2.001

piratepal – An R pirate’s source of palettes inspired by everything from the Mona Lisa to the Evil Dead

When most people think about R, they think of statistics. They don’t think of horror films. I’m here to change that.

Let’s back up. While R is certainly one of the top two (I’m looking over my back at you Python) languages for statistical analysis, it’s just as good for creating beautiful and informative plots – but only if you have the right colors. No matter what kind of plot you create, if you use colors that suck, your plot will look like shit. To pick colors that go well together, you need to use a color palette.

How can you select a color palette in R? Up until now I’ve been using the great RColorBrewer package. However, I’ve lost my enthusiasm for the palettes in the package – not because they’re bad, but because they didn’t mean anything to me. To fix this, I created several palettes either created by graphic artists, or inspired by movies, works of art, and my own aimless Googling. They are all stored in a function called piratepal contained in the yarrr package.

To use piratepal(), you first need to download and install the yarrr package (if you don’t have the devtools package installed, you’ll need to install that first):

[code language=”r”]
install.packages("devtools")
library("devtools")
install_github("ndphillips/yarrr")
library("yarrr")
[/code]

Once you’ve installed the yarrr package, you can learn about the piratepal() function using the help menu:

[code language=”r”]
?piratepal
[/code]

Here, you’ll see that piratepal() has three arguments

  • palette: A string defining the color palette to use (see examples). To use a random palette, use “random”. To view all palettes, use “all” combined with action = “show”
  • action: Either “return” to return a vector of colors, or “show” to show the palette. You can also use “r” or “s” for shorthand.
  • trans: A number in the interval [0, 1] indicating how transparent to make the colors. A value of 0 means no transparency and a value of 1 means complete transparency. Defaults to 0.

Let’s start by seeing all the different palettes in the package. To do this, set palette = “all”, and action = “show”:

[code language=”r”]
piratepal(palette = "all", action = "show")
[/code]

piratepalettes

Here, you can see a brief overview of all the palettes in the package. The names of the palettes on are on the left, and the colors in each palette are displayed to the right of the names. If you want to see the colors in a specific palette in more detail, you can see them (combined with an inspirational image) by using action = “show”.

For example, here is a palette called “scholar” that I got from this blog on the Shutterstock website

scholar

Now if you want to use the colors in a palette, just use the action = “return” argument to return a vector of the palette colors:

[code language=”r”]
piratepal(palette = "scholar", action = "return")
[/code]

This code will return the following vector of the colors in the palette:

blue1 yellow blue2 gray
#2B3E56FF #CDA725FF #C5CBD3FF #F1F1E7FF

Once you’ve assigned this vector of colors to an object (like my.colors), you can use them in any plot that you’d like!

You’ll notice that while some of these palettes have vague, generic names (like “scholar”), others are suspiciously recognisable. For example, what is this “monalisa” palette? Well, I discovered this blog post from a site called artvarsity that contains color palettes inspired by classical art. So naturally, I stole some and put them into the package. For example, let’s look at the “monalisa” palette:

[code language=”r”]
piratepal(palette = "monalisa", action = "show")
[/code]

monalisa

Oh yeah, I’ve also got several palettes inspired by movies. Oh yes, palettes from movies. I discovered this badass site called Movies in Color which provides color palettes for tons of movies. Don’t get me wrong, these palettes aren’t necessarily the most beautiful in the world – but who doesn’t want to use a palette from a horror movie?!

Here’s a palette from “Eternal Sunshine of the Spotless Mind” – the film shown at every psychology undergraduate social event

[code language=”r”]
piratepal(palette = "eternal", action = "show")
[/code]

eternal

Since I love Pixar movies, and found this great website showing palettes for all the Pixar films, I added lots of palettes from Pixar movies. Let’s take a look at the “up” palette:

[code language=”r”]
piratepal(palette = "up", action = "show")
[/code]

up

Just for fun, Let’s draw the house from the movie “Up” using the “google” palette (using my handy digital color meter, I also stole the colors from the Google logo).

[code language=”r”]
google.colors <- piratepal("google", trans = .1)
n.balloons <- 500
x <- rnorm(n.balloons, 0)
y <- rnorm(n.balloons, 2, 1)
plot(1, xlim = c(-7, 7), ylim = c(-7, 7),
xlab = "", ylab = "", type = "n", xaxt = "n", yaxt = "n", bty = "n")

rect(-2, -6, 2, -2)
polygon(c(-2, 0, 2),
c(-2, 0, -2)
)
rect(-7, -7, -2, 100)
rect(2, -7, 7, 100)
rect(-.5, -6, .5, -4)
points(.3, -5)

line.start.x <- rnorm(n.balloons, 0, .4)
line.start.y <- -1 + rnorm(n.balloons, 0, .1)

segments(line.start, line.start.y, x, y, lty = 1, col = gray(.7), lwd = .2)
points(x, y, pch = 21, bg = sample(google.colors, 100, replace = T),
xlim = c(-7, 7), ylim = c(-7, 7), col = gray(.9), cex = rnorm(100, 2, .3))
[/code]

Here’s the result! Oh and did you know that the Up house really exists in Seattle?!?! Here’s the real one!:

uphouse

Finally, for all you horror fans out there: here’s the Evil Dead palette. I’ll warn you, it’s pretty bleak – but honestly what did you expect?

[code language=”r”]
piratepal(palette = "evildead", action = "show")
[/code]

evildead

Let’s take advantage of this friendly, colorful palette. Here are the domestic revenues of the 4 Evil Dead films (from the-numbers.com):

[code language=”r”]

evil.colors <- piratepal(
palette = "evildead", action = "return",
trans = .2)

revenues <- c(2400000, 5923044, 11502976, 54239856)

plot(1, xlim = c(.5, 4.5), ylim = c(0, 59000000), type = "n",
xlab = "", ylab = "", xaxt = "n", bty = "n",
yaxt = "n", main = "Evil Dead Movie Domestic Revenues")

mtext("Using the evildead palette in the yarrr package!", side = 3,
cex = .8, line = .5)

segments(1:4, rep(0, 4), 1:4, revenues, lty = 2)

abline(h = seq(0, 55000000, 10000000), col = gray(.9))

adj <- 2000000
text(1:4, revenues + adj,
labels = c("$2.4 Mil", "$5.9 Mil", "$11.5 Mil", "$54 Mil"), pos = 3)

points(1:4, revenues,
col = evil.colors, cex = 4, pch = 16)

mtext(c("Evil Dead\n(1983)", "Evil Dead 2\n(1987)", "Army of Darkness\n(1993)",
"Evil Dead\n(2016)"), at = 1:4, side = 1, line = 1)
[/code]

evildeadrevenue

Here are a few of the other movie-based palettes:

[code language=”r”]
piratepal(palette = "ghostbusters", action = "show")
[/code]

Dr Ray Stantz: Everything was fine with our system until the power grid was shut off by dickless here.
Walter Peck: They caused an explosion!
Mayor: Is this true?
Dr. Peter Venkman: Yes it’s true.
[pause]
Dr. Peter Venkman: This man has no dick.

ghostbusters

[code language=”r”]
piratepal(palette = "rat", action = "show")
[/code]
Remy: [as Emile tastes a piece of cheese] Creamy, salty-sweet, an oaky nuttiness… You detect that?
Emile: Oh, I’m detecting nuttiness…

rat

I hope these art / movie / random googling inspired palettes inspire people to make plots with better colors. Oh and if you have a favorite movie that you’d like me to add to the function in the form of a palette, let me know. I can always use a good movie recommendation. Bonus points if it’s available on German Netflix.

Maximum-Likelihood parameter estimation in R: A brief tutorial

A brief tutorial on ML estimation in R

Recently, a colleague asked me to demonstrate how one can calculate maximum-likelihood (ML) parameter estimates in R. As I had been meaning to do this for my R course anyway, I decided to write up the following brief tutorial. In the tutorial, I go through the three basic steps of ML estimation in R for both a statistical example (the simple linear model) and a psychological model of choice (the Proportional Difference model, Gonzalez-Vallejo, 2001).

You can view the tutorial on the preceding link. If you have any comments, criticisms or questions, please let me know and I will try to update the document!

Note: I am in debt to Dr. Stephen Lewandowsky and Dr. Simon Farrell’s summer school on computational cognitive modelling for teaching me these techniques. You can purchase their great book on the subject here Computational Modeling in Cognition: Principles and Practice

Create your own Visual Resume with my new Shiny App

Shiny Link: https://ndphillips.shinyapps.io/Visual_Resume/

How much do I love R? Enough that I maintain several ‘fun R’ projects. One of my favorites is my Visual Resume generator – which creates an individualised visual resume based on a person’s education and job history (and publications and research interests, etc.). I was inspired by this blog post on infographic resumes to build a version in R. For example, here is a copy of my visual resume (as of January 2014):

Nathaniel Phillips Infographic Resume

Since so many people like this, I decided to create a Shiny application where everyone can create their own. A beta version of the app is now live at https://ndphillips.shinyapps.io/Visual_Resume/. Play around with it and please send me feedback and recommendations for improvement!

Icon Array: An R function (and Shiny App!) to create icon arrays

What if I told you there was a new drug that has been proven to DOUBLE your changes of getting over a cough? Sounds great doesn’t it? Well, before we (ahem) cough up the dough for drug X, let’s take a closer look at the clinical trial data:

First, let’s look at the results from the placebo condition:

Placebo

Ok, now let’s look at the results from the group of people given Drug X

Drug X

Hmmm, as we can see from the plots, while the users of Drug X were indeed twice as likely to get over the cold, they were also much more likely to develop side-effects.

Frequency grid displays such as these can help people understand risks. But creating the plots themselves is a bit of a pain. To solve this, I wrote an R function called Icon.Array. A beta version of the function implemented in a Shiny app is located at https://ndphillips.shinyapps.io/IconArray_App/.

My R course…now on YouTube!

Well, after talking about it for about a year now, I’ve finally started converting my R course to a YouTube friendly format. Since I’m in the middle of my university R course, the videos don’t quite start at the beginning. I’ll add the beginning 3-5 lectures after the semester ends. In the meantime, here’s the permanent link: http://goo.gl/a6L0up