Skip to content

Grouped means (or anything else…)

June 25, 2012

An easy one today, but something that stumped me for a while* the first time I tried it out.

How do you get a group mean (or other summary statistic) from R? Lets say you have a Y variable that represents repetitions for each of however many factors.

You could subset the data by each combination of the X variables. Something like

trt1alt1 <- mean(data$Y[data$trt==1&data$alt==1,])
trt1alt2 <- mean(data$Y[data$trt==1&data$alt==2,])
trt1alt3 <- mean(data$Y[data$trt==1&data$alt==3,])
trt2alt1 <- mean(data$Y[data$trt==2&data$alt==1,])
trt2alt2 <- mean(data$Y[data$trt==2&data$alt==2,])
trt2alt3 <- mean(data$Y[data$trt==2&data$alt==3,])
...

would do the trick. But thats daft. For one thing it takes a long time to type or edit, especially if you have a lot of groups.

The better way it to use aggregate.

aggregate(data$Y, by = list(trt = data$trt, alt = data$alt), FUN=mean)

this outputs a table with the levels of the variables and the Y variable. No fuss, no bother. If you need to include other arguments to mean, such as its na.rm argument, thats possible too…

aggregate(data$Y, by = list(trt = data$trt, alt = data$alt), FUN=mean, na.rm=TRUE)

Aggregate can also be applied to other functions, custom built or otherwise. There are also other options, such as the data.table or  ddply packages. Some of the apply functions can also do the simple single level stuff too.

 

 

* I say a while….I mean an hour or so…so not long at all.

About these ads

From → R

16 Comments
  1. Brian permalink

    try ?ave

    • That also looks to be the ticket, although I dont see any advantages over aggregate.

      • Brian permalink

        gives you grouped means (or other fun’s) at the original level of data granularity. Say you had a data frame with sales by state, year and you wanted to compute a new field which represented % of sales for that particular state for that year. Think ave is simpler than aggregate for that type of task. For example:

        df <- warpbreaks
        names(df) <- c('sales','state','year')
        df$tot_sales <- ave(df$sales, df$state,df$year, FUN=sum)
        df$pct <- df$sales / df$tot_sales
        aggregate(df$sales, by = list(df$state, df$year), FUN=sum)
        df

      • But aggregate is for aggregating data i.e. summarising/reducing, rather than creating a vector of the same length for use in later calculations. I completely agree with you though, ave is the function to use in your example, it saves using rep or similar and I can envisage various other scenarios. But for making a group mean for use in summarising the data in a graph or similar, I think aggregate has the advantage.
        Thanks for pointing out ave though!! Another useful function!!

      • Brian permalink

        Yes, I am pretty big fan of ave…extremely useful for panel data sets. You can even get group cumulative products and sums if the data is sorted appropriately. For example:

        df <- warpbreaks
        names(df) <- c('sales','state','year')
        df$t = c(1:nrow(df)) # add t to represent a more granular time
        df = df[order(df$state, df$year, df$t),]
        df$cum_sales <- ave(df$sales, df$state,df$year, FUN=cumsum)
        df

  2. You can simplify the statement even further, if you use “[" rather than "$" to access the columns. "[" returns a data.frame/list instead of a vector. Try:

    ## Some dummy data
    data <- data.frame(Y=1:18, trt=rep(c("A1", "A2"),9), alt=rep(c("B1", "B2", "B3"), 6))

    aggregate(data["Y"], data[c("trt", "alt")], mean)

    # trt alt Y
    # 1 A1 B1 7
    # 2 A2 B1 10
    # 3 A1 B2 11
    # 4 A2 B2 8
    # 5 A1 B3 9
    # 6 A2 B3 12

    • Certainly, although, as Tom mentions, the formula version is simpler still.
      aggregate(Y ~ trt + alt, data, mean)

  3. Tom permalink

    Given your data are in a dataframe, you could simplify & clarify your code by using the formula version of aggregate():
    aggregate(Y~trt+alt,data=data, FUN=mean)

    • Great point, I’ve stumbled into the trap of using the by argument all of the time. I’ll have to update the post or make a new one…

  4. dangerouspenguin permalink

    I know so many people who don’t know about aggregate(). It is one of the most useful R functions around for data summary, and very fast. Hopefully your post will enlighten many!

    • I totally agree that its a great function! This is why I wrote the post – to enlighten those currently in the dark!

  5. Aggregate is great when you stumble upon it. An example would add some value to people who find themselves on this page, through something together with the mtcars dataset I reckon.

    • Yeah, aggregate isnt the easiest to find, not a very intuitive name. But it is a great function. Thanks for the example suggestion! I’ll have to update the post with a proper example.

  6. RJK permalink

    It’s definately worth looking more into the *ply functions in the plyr package (ddply being the most common) – they are very powerful.

    For functions applied to groups, you probably want summarise / summarize. For example, see Hadley’s comment at http://stackoverflow.com/questions/3277326/group-by-in-r-ddply-with-weighted-mean.

Trackbacks & Pingbacks

  1. Grouped means (again) « Insights of a PhD student

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: