# Normalising data within groups

Occasionally it proves useful to normalise data. By this I mean to scale it between zero and one. Admittedly, most people frown of this but there are papers out there with this method in use*.

How do we go about this? Its a very simple formula to calculate:

y'[i] = y[i]/sqrt(sum(y^2))

So we square all of the ys, add them up and take the square root (call in the denominator). Then we divide each individual y value by the denominator.

In R this is simple – for instance decostand in the vegan package does exactly this (plus a whole heap of other standardisations).

But what I couldnt find was a function to take it a step further, a function that normalised within groups:

y'[ij] = y[ij]/sqrt(sum(y^2[j]))

The difference here are the js of course. Or to go a step further still:

y'[ijk] = y[ijk]/sqrt(sum(y^2[jk]))

where the ks represent subgroups of j.

I needed to do just this, so I wrote a function to do it!

You can get hold of it by running

source("http://db.tt/22hmSliJ")

in R. This provides you with a function called normalise with the following arguments

dataframe – self explanatory

columns – a quoted variable name (e.g. “weight”) actually only works on a single column currently so this is a bit of a misnomer. But its easy enough to loop it**

by – one or two grouping factors, again quoted and enclosed in c() if there are two

na.rm – logical, remove any NAs? Defaults to TRUE

data <- normalise(data, "weight", by="sex")

to normalise weight according to sex, or

data <- normalise(data, "weight", by=c("age", "sex"))

to normalise weight by age and sex.

The function adds a column to the original dataframe with the original name preceded by “norm.”, so in this case it would be “norm.weight”.

Currently it only works if the by argument is a factor, but I shall change that at some point and update this post. It might also change the order of the dataframe, but thats not so much of a big deal I dont think.

Hope it helps!

* e.g. Risch AC, Jurgensen MF, Frank DA (2007) Effects of grazing and soil micro-climate on decomposition rates in a spatio-temporally heterogeneous grassland. Plant and Soil 298:191-201

**

for(i in c("height", "weight", "eye_colour")){ data <- normalise(data, i, by="weight") }

That’s very nice. In data.table v1.8.1 you can add a column by group, like this :

DT[, newcol:=y/sqrt(sum(y^2)), by=colA] # group by one column

or

DT[, newcol:=y/sqrt(sum(y^2)), by=list(colA,colB)] # group by two columns

or

DT[, newcol:=y/sqrt(sum(y^2)), by=list(colA,anyRfunction(colB))] # group by expressions

You can group by factors, integers, numeric (double) or character, and as many of them as you like, inside the by=list(…). If you do the last for() loop over a lot of columns (or the objects are reasonable in size) then it might be slow due to copying the entire object for each new column addition. So there are ways to write that generic normalise() using data.table if you like, which adds each new column by reference with no copies. But if speed or memory isn’t an issue then theres no need to look at data.table.

Thanks for the heads up! This was an initial attempt so I’m sure theres plenty that can be improved in the code.

I just had a quick look on CRAN and see that data.table is still at version 1.8.0. Is 1.8.1 the R-Forge version?

Having read through most of the introduction to data.table, it looks like a handy little package. As you say, it could speed normalise() up considerably by cutting out the vector scans for subsetting (although on the data.frames im using it on currently, the for loop executes in less than a second for 4 or 5 columns anyway, but for others that might use it…).