Data manipulation : aggregation, merge and format changes

This page describes how to apply functions on each line/column of a data frame, or on lines grouped in subsets using another variable. It will also show how to create a contingency table, and how to change from long to wide format, and reversely.

  1. Apply a function on a list/data.frame (apply, lapply, sapply)
  2. Aggregation (tapply, aggregate, by)
  3. Merge
  4. Split
  5. Contingency table
  6. Wide and long formats

Apply a function on a list/data.frame

The “apply” family of functions is useful when applying the same code on each element/line of a vector, list, data.frame etc. Each of the function can be applied on a list (or data frame) but returns results in a different format.

apply()

The most generic function.
Use it to apply another function on each column/row of a data frame.
The function to be applied can be already defined, or defined on-the-fly.
Supplementary parameters can be transmitted to already defined functions.

  • Use MAR=1 to loop on rows
  • Use MAR=2 to loop on columns
mydf <- data.frame(m0sd1 = rnorm(200, mean = 0, sd = 1), m0sd10 = rnorm(200, 
    mean = 0, sd = 10), m0sd.1 = rnorm(200, mean = 0, sd = 0.1), m1.5sd1 = rnorm(200, 
    mean = 1.5, sd = 1), m0sd5 = rnorm(200, mean = 0, sd = 5))
## Already defined function
apply(mydf, MARGIN = 2, mean)
##     m0sd1    m0sd10    m0sd.1   m1.5sd1     m0sd5 
## -0.059248  0.641841  0.006092  1.662241  0.457147
## Equivalent to mean(column, trim=0.1)
apply(mydf, MARGIN = 2, mean, trim = 0.1)
##     m0sd1    m0sd10    m0sd.1   m1.5sd1     m0sd5 
## -0.047388  0.699333  0.008574  1.673356  0.480492
## Function defined on-the-fly
apply(mydf, MARGIN = 2, FUN = function(column) {
    return(c(mean(column), median(column)))
})
##         m0sd1  m0sd10   m0sd.1 m1.5sd1  m0sd5
## [1,] -0.05925 0.64184 0.006092   1.662 0.4571
## [2,] -0.02901 0.09039 0.008551   1.653 0.6613

NB: When the function to be applied is simple, other specialized function are often more efficient, like rowSums(matrix) or rowMeans(matrix).

lapply()

Use it to apply a function on the elements of a list. It can be interesting to use as.data.frame on the result to get it in a more handy format. A better approach is to directly use sapply().

mylist <- list(m0sd1 = rnorm(200, mean = 0, sd = 1), m0sd10 = rnorm(2000, mean = 0, 
    sd = 10), m0sd.1 = rnorm(20, mean = 0, sd = 0.1), m1.5sd1 = rnorm(200, mean = 1.5, 
    sd = 1), m0sd5 = rnorm(200, mean = 0, sd = 5))
result <- lapply(mylist, FUN = function(x) {
    return(c(mean = mean(x), median = median(x)))
})
result
## $m0sd1
##     mean   median 
## -0.07216 -0.15783 
## 
## $m0sd10
##    mean  median 
## -0.2937 -0.2853 
## 
## $m0sd.1
##    mean  median 
## 0.01812 0.00932 
## 
## $m1.5sd1
##   mean median 
##  1.537  1.475 
## 
## $m0sd5
##    mean  median 
## -0.1611 -0.3490
as.data.frame(result)
##           m0sd1  m0sd10  m0sd.1 m1.5sd1   m0sd5
## mean   -0.07216 -0.2937 0.01812   1.537 -0.1611
## median -0.15783 -0.2853 0.00932   1.475 -0.3490

sapply()

Use it to apply on the elements of a list and simplify the result to a vector or a matrix, if possible.

mylist <- list(m0sd1 = rnorm(200, mean = 0, sd = 1), m0sd10 = rnorm(2000, mean = 0, 
    sd = 10), m0sd.1 = rnorm(20, mean = 0, sd = 0.1), m1.5sd1 = rnorm(200, mean = 1.5, 
    sd = 1), m0sd5 = rnorm(200, mean = 0, sd = 5))
sapply(mylist, FUN = function(col) {
    return(c(mean(col), median(col)))
})
##         m0sd1   m0sd10     m0sd.1 m1.5sd1   m0sd5
## [1,] -0.05533 -0.04112 -0.0062452   1.325 -0.3524
## [2,] -0.11246  0.14949  0.0005626   1.289 -0.3461

Aggregation

tapply()

Usage : tapply(data, factorForGroups, FUN)

Use it to apply a function FUN on a vector, after aggregating the values using a factor.

mydf <- data.frame(age = c(19, 36, 63, 31, 43, 86), height = c(157, 167, 182, 
    172, 168, 173), sex = c("F", "F", "M", "F", "M", "M"))
tapply(mydf$age, INDEX = mydf$sex, FUN = mean)
##     F     M 
## 28.67 64.00

On a data frame, one can use split to generate a list and combine it with lapply() or mapply().

mydf <- data.frame(age = c(19, 36, 63, 31, 43, 86), height = c(157, 167, 182, 
    172, 168, 173), sex = c("F", "F", "M", "F", "M", "M"))
split(mydf, mydf$sex)
## $F
##   age height sex
## 1  19    157   F
## 2  36    167   F
## 4  31    172   F
## 
## $M
##   age height sex
## 3  63    182   M
## 5  43    168   M
## 6  86    173   M
lapply(split(mydf[, c("age", "height")], mydf$sex), FUN = colMeans)
## $F
##    age height 
##  28.67 165.33 
## 
## $M
##    age height 
##   64.0  174.3

aggregate

aggregate(x, by, FUN)

Works similarly to tapply() but:

  • on a whole data.frame (multiple columns)
  • the 'by' argument must be a list
  • only produces scalar summaries
mydf <- data.frame(age = c(19, 36, 63, 31, 43, 86), height = c(157, 167, 182, 
    172, 168, 173), sex = c("F", "F", "M", "F", "M", "M"))
aggregate(mydf[, 1:2], mydf["sex"], mean)
##   sex   age height
## 1   F 28.67  165.3
## 2   M 64.00  174.3
aggregate(mydf[, 1:2], mydf["sex"], paste0)
##   sex age.1 age.2 age.3 height.1 height.2 height.3
## 1   F    19    36    31      157      167      172
## 2   M    63    43    86      182      168      173
data(iris)
head(iris, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## Note that iris['Species'] is a list (as iris[5]), instead of
## iris$Species which is a vector (as iris[,5])
aggregate(iris[, 1:4], iris["Species"], FUN = mean)
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

by

This function is similar to tapply, but more general as it applies to data frames.
It first splits the input into multiple data frames and applies the function FUN on each sub-data frame.

mydf <- data.frame(age = c(19, 36, 63, 31, 43, 86), height = c(157, 167, 182, 
    172, 168, 173), sex = c("F", "F", "M", "F", "M", "M"))
by(mydf[, 1:2], INDICES = mydf$sex, FUN = colMeans)
## mydf$sex: F
##    age height 
##  28.67 165.33 
## -------------------------------------------------------- 
## mydf$sex: M
##    age height 
##   64.0  174.3
mydf <- data.frame(age = c(19, 36, 63, 31, 43, 86), height = c(157, 167, 182, 
    172, 168, 173), sex = c("F", "F", "M", "F", "M", "M"))
by(mydf[, 1:2], INDICES = mydf$sex, FUN = function(df) {
    apply(df, 2, summary)
})
## mydf$sex: F
##          age height
## Min.    19.0    157
## 1st Qu. 25.0    162
## Median  31.0    167
## Mean    28.7    165
## 3rd Qu. 33.5    170
## Max.    36.0    172
## -------------------------------------------------------- 
## mydf$sex: M
##          age height
## Min.    43.0    168
## 1st Qu. 53.0    170
## Median  63.0    173
## Mean    64.0    174
## 3rd Qu. 74.5    178
## Max.    86.0    182

Merge

merge

Merge two data frames, by column or by row. By default the function finds all common column names and use them all.
Interesting parameters include :

  • 'by' (by.x, by.y) to specify common columns to use for the merge
  • 'all' (all.x, all.y) to specify if all lines should be kept (even the ones without correspondance in the other data frame).
x <- data.frame(index = c(2, 3, 1), fruit = c("orange", "lemon", "grapefruit"))
y <- data.frame(index = c(3, 1, 2, 1), quantity = c(13, 6, 3, 5))
merge(x, y)
##   index      fruit quantity
## 1     1 grapefruit        6
## 2     1 grapefruit        5
## 3     2     orange        3
## 4     3      lemon       13

A good practice is to specify each time by which column you want to merge for x and y.

Union (Set operations)

Intersetcion between values of two sets (vectors). Any duplicated value is discarded.

set.seed(0)
x <- trunc(runif(10, min = 0, max = 100), 0)
y <- trunc(runif(10, min = 0, max = 100), 0)
x
##  [1] 89 26 37 57 90 20 89 94 66 62
y
##  [1]  6 20 17 68 38 76 49 71 99 38
union(x, y)
##  [1] 89 26 37 57 90 20 94 66 62  6 17 68 38 76 49 71 99

Other set operations are also possible.

  • Intersection with intersect(x, y)
  • Difference (which ones of x are not in y) with setdiff(x,y). is.element() returns boolean values, as well as %in%.

match(x, reflist) is a more general way to find which values of a new list x are in a reference list (and at which position).


Split

split() splits a vector (or rows of a data frame) into separate elements of a list, ready for further processing.
See tapply().


Contingency table

A contingency table counts the combinations of two factors. To create a contingency table, use the table() function.

x <- data.frame(fruit = c("orange", "lemon", "lemon", "grapefruit", "grapefruit", 
    "orange", "lemon"), country = c("Spain", "Spain", "California", "California", 
    "Spain", "Spain", "California"))
table(x)
##             country
## fruit        California Spain
##   grapefruit          1     1
##   lemon               2     1
##   orange              0     2

Wide and long formats

stack() transforms a matrix with headers (wide format), into one 2-columns data-frame with measurements/labels (long format).

mydf <- data.frame(x = trunc(runif(10, min = 0, max = 100), 0), y = round(runif(10, 
    min = 0, max = 1), 1))
stack(mydf)
##    values ind
## 1    77.0   x
## 2    93.0   x
## 3    21.0   x
## 4    65.0   x
## 5    12.0   x
## 6    26.0   x
## 7    38.0   x
## 8     1.0   x
## 9    38.0   x
## 10   86.0   x
## 11    0.3   y
## 12    0.5   y
## 13    0.6   y
## 14    0.5   y
## 15    0.2   y
## 16    0.8   y
## 17    0.7   y
## 18    0.8   y
## 19    0.1   y
## 20    0.7   y

Correlation need wide format, modeling need long format.

reshape() has more possibilities, and is consequently more complex.


Creative Commons License
This work by Celine Hernandez is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.