So, the post I did yesterday on aggregate seemed to go down well.

One of the comments suggested I add an example. Other comments had other useful hints which I thought I’d pass on more formally. So here goes…

The mtcars dataset in base has data on various aspects of cars – miles per gallon, number of cylinders etc etc etc. This makes it a nice example dataset for aggregate as some of the variables make for suitable grouping factors.

First we get hold of the dataset. Thats easy as its a base dataset:

`data(mtcars)`

If we look at the help file (?mtcars) for the dataset we notice that the last column is the number of carburettors (“carb”), which could be an good grouping variable. Another is  whether the car is an automatic or not (“am”). As our dependant variable we’ll use the horsepower (“hp”)

To make the group means we simply use aggregate as I mentioned yesterday.

```aggregate(mtcars\$hp, by=list(carb=mtcars\$carb, am=mtcars\$am), mean)
#  carb am        x
#1    1  0 104.0000
#2    2  0 134.5000
#3    3  0 180.0000
#4    4  0 198.0000
#5    1  1  72.5000
#6    2  1  91.2500
#7    4  1 161.3333
#8    6  1 175.0000
#9    8  1 335.0000```

We can make this code a bit shorter though by using the formula interface:

`aggregate(hp ~ carb + am, mtcars, mean)`

Easy huh? Of course in order to use that table in further procedures (plotting or whatever) you should give it a name:

`mtcarsagg <- aggregate(hp ~ carb + am, mtcars, mean)`

We can also use other functions in aggregate. Min, max, range for example. A natural one that comes to mind is the standard error. First we’ll define a function to calculate standard error*, then apply it to the example above:

```st.err <- function(x) {
sd(x)/sqrt(length(x))
}
SE <- aggregate(hp ~ carb+am, mtcars, st.err)
#  carb am        hp
#1    1  0  3.785939
#2    2  0 18.777202
#3    3  0  0.000000
#4    4  0 20.136439
#5    1  1  6.837397
#6    2  1 13.930632
#7    4  1 51.333333
#8    6  1        NA
#9    8  1        NA```

It could be that you come up with NA values in the table if you have NAs in the dependant variable or is for st.err for example there is only a single value like for automatic cars with 6 or 8 carburettors. If so, adding the na.rm=TRUE argument to the  end of the argument list should clear it if theres NAs**. If theres only a single value, then mean and SE are probably not so useful anyway.

We can of course bind them together:

```mtcarsagg <- cbind(mtcarsagg, SE[,3])
mtcarsagg
#  carb am       hp     SE\$hp
#1    1  0 104.0000  3.785939
#2    2  0 134.5000 18.777202
#3    3  0 180.0000  0.000000
#4    4  0 198.0000 20.136439
#5    1  1  72.5000  6.837397
#6    2  1  91.2500 13.930632
#7    4  1 161.3333 51.333333
#8    6  1 175.0000        NA
#9    8  1 335.0000        NA```

A note about using aggregate with the by argument – it doesnt give the dependant variable column a name, so you have to do that manually
(names(mtcarsagg)[3] <- “hp” for instance). Using the formula interface does name the column automatically though, and as its shorter, its probably easier to use anyway!

As with anything in R, theres multiple ways to do everything. Other methods for the above include the ave function, Hadley Wickham’s plyr package and aggregate(data[“Y”], data[c(“trt”, “alt”)], mean)

Hope that helps!!!

* A slightly more advanced standard error function with NA removal support is:

```st.err <- function(x, na.rm=FALSE) {
if(na.rm==TRUE) x <- na.omit(x)
sd(x)/sqrt(length(x))
}```

**I attempted to find a dataset with NAs to provide an example, but it ignored the NAs anyway, despite mean giving an NA result (which is weird). Sorry about that.

From → R

Nice article! I would probably mention that the formula interface has some overhead in terms of execution speed compared to the more basic arguments. For small datasets, this probably isn’t an issue and the ease of use wins out. However, when data gets big, you may want to reconsider using the more basic arguments. When speed becomes an issue in general, I’d look into the data.table package. Here’s an example using 25 million rows:

dat <- data.frame(x1 = rep(1:500000, each = 5), x2 = rep(1:500, each = 10),
y = runif(25000000))
system.time(aggregate(y ~ x1 + x2, data = dat, FUN = mean))
user system elapsed
72.46 2.64 75.16
system.time(aggregate(dat\$y, list(x1 = dat\$x1, x2 = dat\$x2), FUN = mean))
user system elapsed
65.61 1.78 67.43 #nearly 8 seconds faster

#using data.table
require(data.table)
dt <- data.table(dat)
system.time(dt[, list(meany = mean(y)), by = c("x1", "x2")])
user system elapsed
12.33 0.36 12.73 #much faster yet

I compare lots of timings on a recent StackOverflow post, which may also be of interest to you and others: http://stackoverflow.com/questions/10748253/idiomatic-r-code-for-partitioning-a-vector-by-an-index-and-performing-an-operati/10748470#10748470

Again, nice work!
-Chase

• Thanks!
Those are some impressive time savings, especially using data.table! I’ll definitely have to look into it! Someone else mentioned it on another post (also related to time savings) so it sounds like it could be pretty useful!!

2. Hi, i’m completely new to R. Doing a mini project on some data and was wondering if you can guide me?
I’m looking for a way to group continuous data into categorical.
I have data from months 1 – 12 (1=January, 12= Dec). I want to split these into seasons, e.g. months 3 to 5 = spring (given a code 1), months 6 – 8 = Summer (code =2). Any idea how? I know it can either be done by ifelse function or cut, not sure about aggregate though.
Many thanks

• Hi,
The easiest way is probably to separate your dataframe, add a column and rbind the frames back together:
dat1 <- data[data\$month %in% 3:5,]
dat1\$season <- "spring"
do the same for summer, autumn and winter, then
data2 <- rbind(dat1, dat2, dat3, dat4)
You might need to check the structure of the season columns – if its a factor then it wont bind back together.
Otherwise you could do an ifelse or a for loop, but they have the potential to cause headaches, especially if youre new to R!
HTH

• Hi thanks, I’ve got that to work.

I’m trying to run a multivariate mixed effects model with random slopes as the seasons and random intercepts as the location code (labelled 1 to 8).
I keep getting this error when I try to view the predictions:

plot(augPred(mod2))
Error in sprintf(gettext(fmt, domain = domain), …) :
invalid type of argument[1]: ‘symbol’

any ideas how I can fix it? It may be something to do with the data set.

• Hi,
In short, sorry, no i dont.
Checking str(dataframe) is always handy to ensure that your location is coded as a factor etc.
I dont know what augPred uses sprintf() for, naming the axes perhaps…which could suggest some problem with how youve coded up your model – did you save it elsewhere and then submit it to lme? You are using lme and not lmer right?

3. well, trying to use nlme as it contains the augPred function. My data is the following:

str(new.data)
‘data.frame’: 345 obs. of 11 variables:
\$ WSZ_Code : int 2 6 7 7 7 5 1 5 8 1 …
\$ Treatment_Code: int 3 1 4 4 4 2 2 2 1 2 …
\$ Year : int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 …
\$ Month : int 1 2 2 2 3 3 3 3 3 3 …
\$ TTHM : num 30.7 24.8 60.4 58.1 62.2 40.3 20.8 36.3 40.5 47.8 …
\$ CL2_FREE : num 0.35 0.25 0.05 0.15 0.2 0.15 0.15 0.025 0.25 0.05 …
\$ BrO3 : num 0.5 0.5 0.5 0.5 0.5 …
\$ Colour : num 0.75 0.75 0.75 0.75 2 2 0.75 1.9 1.9 1.9 …
\$ PH : num 7.4 6.9 7.1 7.5 7.6 7.7 6.8 7.9 7.4 8.2 …
\$ TURB : num 0.055 0.2 0.055 0.055 0.055 0.055 0.11 0.11 0.055 0.16 …
\$ seasons : Factor w/ 4 levels “autumn”,”spring”,..: 4 4 4 4 2 2 2 2 2 2 …

Some models that work but can’t be plotted:
mod0 <- lme(tthm ~ 1, random= ~1 |loc_code, data=new.data, method="ML")
mod2 <- lme(tthm ~ cl2free, random= ~ 1| loc_code, data=new.data, method="ML")

I try plot(augPred(mod2)) and it gives me the error.
Sorry to bombard you, it's really hard to come across decent help!

• Hi again,
First thoughts…I dont see loc_code in your dataframe…or is that WSZ_Code? If so, that should probably be a factor rather than an integer. Other than that…Im afraid I dont know.

In terms of decent R help, the R Help Lists are really good. Check out https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models for mixed models. There are some very knowledgeable folk on there, some of whom are involved with package development (Doug Bates for instance who was involved in nlme and is now involved with lme4 along with Ben Bolker et al.). Theres also other lists you can subscribe to and ask questions of… see http://www.r-project.org/mail.html for details. My experience has been that people get back to you pretty quickly, normally within a day or so with pretty good answers.

HTH

• Ah OK, thank you. I’ve posted a few messages on some mailing lists, awaiting replies. I’m trying to categorize the integer variables into factor. Trying to use cut function, but I don’t think it’s very good.

• The easiest way to make a factor is
f_loc_code <- as.factor(loc_code)
The as. functions are incredibly useful! I put the f_ there just to be clear that its a factor. Its not something I regularly do, but I can see the usefulness.
With the mailing lists, make sure you post to the appropriate one (and sign up, you might not get the reply otherwise). If you post a message to the wrong list then you might not get any reply, or a much slower response. The MM list is generally pretty quick with replies though.