Grouped means (or anything else…)
An easy one today, but something that stumped me for a while* the first time I tried it out.
How do you get a group mean (or other summary statistic) from R? Lets say you have a Y variable that represents repetitions for each of however many factors.
You could subset the data by each combination of the X variables. Something like
trt1alt1 <- mean(data$Y[data$trt==1&data$alt==1,]) trt1alt2 <- mean(data$Y[data$trt==1&data$alt==2,]) trt1alt3 <- mean(data$Y[data$trt==1&data$alt==3,]) trt2alt1 <- mean(data$Y[data$trt==2&data$alt==1,]) trt2alt2 <- mean(data$Y[data$trt==2&data$alt==2,]) trt2alt3 <- mean(data$Y[data$trt==2&data$alt==3,]) ...
would do the trick. But thats daft. For one thing it takes a long time to type or edit, especially if you have a lot of groups.
The better way it to use aggregate.
aggregate(data$Y, by = list(trt = data$trt, alt = data$alt), FUN=mean)
this outputs a table with the levels of the variables and the Y variable. No fuss, no bother. If you need to include other arguments to mean, such as its na.rm argument, thats possible too…
aggregate(data$Y, by = list(trt = data$trt, alt = data$alt), FUN=mean, na.rm=TRUE)
Aggregate can also be applied to other functions, custom built or otherwise. There are also other options, such as the data.table or ddply packages. Some of the apply functions can also do the simple single level stuff too.
* I say a while….I mean an hour or so…so not long at all.
try ?ave
That also looks to be the ticket, although I dont see any advantages over aggregate.
gives you grouped means (or other fun’s) at the original level of data granularity. Say you had a data frame with sales by state, year and you wanted to compute a new field which represented % of sales for that particular state for that year. Think ave is simpler than aggregate for that type of task. For example:
df <- warpbreaks
names(df) <- c('sales','state','year')
df$tot_sales <- ave(df$sales, df$state,df$year, FUN=sum)
df$pct <- df$sales / df$tot_sales
aggregate(df$sales, by = list(df$state, df$year), FUN=sum)
df
But aggregate is for aggregating data i.e. summarising/reducing, rather than creating a vector of the same length for use in later calculations. I completely agree with you though, ave is the function to use in your example, it saves using rep or similar and I can envisage various other scenarios. But for making a group mean for use in summarising the data in a graph or similar, I think aggregate has the advantage.
Thanks for pointing out ave though!! Another useful function!!
Yes, I am pretty big fan of ave…extremely useful for panel data sets. You can even get group cumulative products and sums if the data is sorted appropriately. For example:
df <- warpbreaks
names(df) <- c('sales','state','year')
df$t = c(1:nrow(df)) # add t to represent a more granular time
df = df[order(df$state, df$year, df$t),]
df$cum_sales <- ave(df$sales, df$state,df$year, FUN=cumsum)
df
You can simplify the statement even further, if you use “[" rather than "$" to access the columns. "[" returns a data.frame/list instead of a vector. Try:
## Some dummy data
data <- data.frame(Y=1:18, trt=rep(c("A1", "A2"),9), alt=rep(c("B1", "B2", "B3"), 6))
aggregate(data["Y"], data[c("trt", "alt")], mean)
# trt alt Y
# 1 A1 B1 7
# 2 A2 B1 10
# 3 A1 B2 11
# 4 A2 B2 8
# 5 A1 B3 9
# 6 A2 B3 12
Certainly, although, as Tom mentions, the formula version is simpler still.
aggregate(Y ~ trt + alt, data, mean)
Given your data are in a dataframe, you could simplify & clarify your code by using the formula version of aggregate():
aggregate(Y~trt+alt,data=data, FUN=mean)
Great point, I’ve stumbled into the trap of using the by argument all of the time. I’ll have to update the post or make a new one…
I know so many people who don’t know about aggregate(). It is one of the most useful R functions around for data summary, and very fast. Hopefully your post will enlighten many!
I totally agree that its a great function! This is why I wrote the post – to enlighten those currently in the dark!
Aggregate is great when you stumble upon it. An example would add some value to people who find themselves on this page, through something together with the mtcars dataset I reckon.
Yeah, aggregate isnt the easiest to find, not a very intuitive name. But it is a great function. Thanks for the example suggestion! I’ll have to update the post with a proper example.
It’s definately worth looking more into the *ply functions in the plyr package (ddply being the most common) – they are very powerful.
For functions applied to groups, you probably want summarise / summarize. For example, see Hadley’s comment at http://stackoverflow.com/questions/3277326/group-by-in-r-ddply-with-weighted-mean.
Im not familiar with plyr yet, but I should indeed look into it at some point…