R data structures

  1. R variables
    • Attributes
  2. R storage modes
    • Logical
    • Vector
    • Matrix
    • List
    • Data frame (Data access, Subset)
    • Factor
  3. How to get information of the content/structure of a variable
    • summary()
    • str()

R variables

Attributes

Any R object has 2 attributes:

  • its storage mode (or “type”) from R's point of view: logical, numeric, function, complex, character, raw, list, NULL, closure, special, builtin… It is accessed by the typeof() functions.
  • its class (a label describing the object). It is accessed by the class() function.

Example : a number.

typeof(1)
## [1] "double"
class(1)
## [1] "numeric"

Example : an integer.

typeof(1L)
## [1] "integer"
class(1L)
## [1] "integer"

Example : a data frame.

adf <- data.frame("ofes", "feslkfj", "lfdkj")
typeof(adf)
## [1] "list"
class(adf)
## [1] "data.frame"

Functions have closure or builtin or special as type.

f <- function() {
}
f$a
## Error: objet de type 'closure' non indiçable

Note that mode() gives a result similar to typeof() but more compatible with the S language.


R storage modes

Logical

Easy to convert (and back) with numeric. But this easy conversion can cause problems if one uses F and T as shortcuts for FALSE and TRUE.

vector <- 1:10
vector[c(0, 1)]
## [1] 1
vector[c(F, T)]
## [1]  2  4  6  8 10
FALSE <- TRUE
## Error: membre gauche de l'assignation (do_set) incorrect
F <- TRUE
vector[c(F, T)]
##  [1]  1  2  3  4  5  6  7  8  9 10

Advice : Never use F and T shortcuts for FALSE and TRUE.


Vector

Contain values of the same mode.

x <- c(1, 2, 3, 4)
x
## [1] 1 2 3 4
mode(x)
## [1] "numeric"
x <- c(1, 2, TRUE, 3)
x
## [1] 1 2 1 3
mode(x)
## [1] "numeric"
x <- c(1, 2, "TRUE", 4)
x
## [1] "1"    "2"    "TRUE" "4"
mode(x)
## [1] "character"

Matrix

Matrices are two-dimensional vectors. Elements are accessed giving their coordinates in the matrix, first index being 1 (and not 0 as for other languages).

mat <- matrix(c(1:6), nrow = 2, byrow = TRUE)
mat
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
mat[1, 2]
## [1] 2

One can also access an element with one index, following “by column” order.

mat <- matrix(c(1:6), nrow = 2, byrow = TRUE)
mat
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
mat[5]
## [1] 3

A matrix can also be constructed manually, starting from a vector, and giving it dimensions and a class name.

mat <- 1:30
attr(mat, "dim") <- c(5, 6)
class(mat) <- "matrix"
mat
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    6   11   16   21   26
## [2,]    2    7   12   17   22   27
## [3,]    3    8   13   18   23   28
## [4,]    4    9   14   19   24   29
## [5,]    5   10   15   20   25   30

As for vectors, all elements must be of the same mode. Modifying a matrix of numbers by inserting a text value, will turn all numeric values into characters.

mat <- matrix(c(1:6), nrow = 2, byrow = TRUE)
mat
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
mat[1, 2] <- "a"
mat
##      [,1] [,2] [,3]
## [1,] "1"  "a"  "3" 
## [2,] "4"  "5"  "6"

List

A list is a collection of objects with different modes (vectors, matrices, other lists…) and potentialy different sizes.

mylist <- list(ages = c(12, 35, 34, 62), height = c(135, 128, 164, 165), sex = c("M", 
    "F", "M", "M"))

One access the elements either by rank or by name.

mylist[[1]]
## [1] 12 35 34 62
mylist[1]
## $ages
## [1] 12 35 34 62
mylist$height
## [1] 135 128 164 165

How to turn a list into a data.frame :

mylist <- list(ages = c(12, 35, 34, 62), height = c(135, 128, 164, 165), sex = c("M", 
    "F", "M", "M"))
mydf <- as.data.frame(mylist)
mydf
##   ages height sex
## 1   12    135   M
## 2   35    128   F
## 3   34    164   M
## 4   62    165   M

Data frames

A data.frame is a list where each element has the same size.

data <- data.frame(ages = c(12, 35, 34, 62), height = c(135, 128, 164, 165), 
    sex = c("M", "F", "M", "M"))
data
##   ages height sex
## 1   12    135   M
## 2   35    128   F
## 3   34    164   M
## 4   62    165   M
class(data)
## [1] "data.frame"
mode(data)
## [1] "list"

Important note:
By default, data.frame() and read.table() convert all non-numerical values into factors.
Use stringsAsFactors=FALSE or as.is=TRUE to avoid this behavior. It can also be set in defaults by options(stringsAsFactors=FALSE)

Data access

Access to data can be done by index or by label.

data <- data.frame(ages = c(12, 35, 34, 62), height = c(135, 128, 164, 165), 
    sex = c("M", "F", "M", "M"))
data[1]
##   ages
## 1   12
## 2   35
## 3   34
## 4   62
data[[1]]
## [1] 12 35 34 62
data[, 1]
## [1] 12 35 34 62
data$height
## [1] 135 128 164 165
data[, "height"]
## [1] 135 128 164 165

A good rule of thumb is to always use labels for accessing data frame columns (in case order changes), and full labels as partial labels imply R checking for all column names.

Subset

Using negative indices (not valid with labels).

data <- data.frame(ages = c(12, 35, 34, 62), height = c(135, 128, 164, 165), 
    sex = c("M", "F", "M", "M"))
data[, -2]
##   ages sex
## 1   12   M
## 2   35   F
## 3   34   M
## 4   62   M

Using -grep() but be carefull of which column really matches.

data[, -grep("h", names(data))]
##   ages sex
## 1   12   M
## 2   35   F
## 3   34   M
## 4   62   M

Using subset() and a - sign before the column name you want to discard.

subset(data, select = -height)
##   ages sex
## 1   12   M
## 2   35   F
## 3   34   M
## 4   62   M

Factors

Represent categorical variables, each level being a different text/label. This representation is more memory-efficient than vectors of characters.

hair <- factor(c("blond", "brown", "red", "blond"))
hair
## [1] blond brown red   blond
## Levels: blond brown red
class(hair)
## [1] "factor"
mode(hair)
## [1] "numeric"

To be more memory-efficient, factors encode the different levels internally as numbers.

as.numeric(hair)
## [1] 1 2 3 1

A factor variable can be modified, but only by using already-defined levels.

hair <- factor(c("blond", "brown", "red", "blond"))
hair[2] <- "blond"
hair
## [1] blond blond red   blond
## Levels: blond brown red
hair[2] <- "grey"
## Warning: niveau de facteur incorrect, NAs générés
hair
## [1] blond <NA>  red   blond
## Levels: blond brown red

To add a value: turn a factor to character, add element and turn it to factor again.

hair <- factor(c("blond", "brown", "red", "blond"))
hair
## [1] blond brown red   blond
## Levels: blond brown red
char_hair <- as.character(hair)
char_hair[2] <- "grey"
hair <- factor(char_hair)
hair
## [1] blond grey  red   blond
## Levels: blond grey red

Factor variables are not easy to handle. For instance, they are difficult to concatenate as only the internal numeric coding will be concatenated. As workarounds, one can first turn the factors into character vectors, or alternatively use the list/unlist couple of functions.

hair <- factor(c("blond", "brown", "red", "blond"))
c(hair, hair)
## [1] 1 2 3 1 1 2 3 1
# Workaround 1
factor(as.character(hair), as.character(hair))
## Warning: les niveaux dupliqués ne seront plus acceptés pour les variables
## de type 'factor'
## [1] blond brown red   blond
## Levels: blond brown red blond
# Workaround 2
unlist(list(hair, hair))
## [1] blond brown red   blond blond brown red   blond
## Levels: blond brown red

Use ordered=TRUE when when factors represent ordered values. This allows meaningful comparisons, otherwise only alphabetical comparison is available.

time <- factor(c(1, 2, 3, 2, 2, 1), levels = c(1, 2, 3), labels = c("never", 
    "sometimes", "always"), ordered = TRUE)
time
## [1] never     sometimes always    sometimes sometimes never    
## Levels: never < sometimes < always
time[2] < time[3]
## [1] TRUE
"sometimes" < "always"
## [1] FALSE

Important note:
By default, data.frame() and read.table() convert all non-numerical values into factors.
Use stringsAsFactors=FALSE or as.is=TRUE to avoid this behavior. It can also be set in defaults by options(stringsAsFactors=FALSE)


How to get information of the content/structure of a variable

summary()

This function displays a summary of the variable content.
NB: Class:none means that the class is the same as the mode.

mylist <- list(ages = c(12, 35, 34, 62), height = c(135, 128, 164, 165), sex = c("M", 
    "F", "M", "M"))
summary(mylist)
##        Length Class  Mode     
## ages   4      -none- numeric  
## height 4      -none- numeric  
## sex    4      -none- character

When applied to a data frame, it summarizes each column's content (min, max, median, mean, quartiles).

mydf <- data.frame(ages = c(12, 35, 34, 62), height = c(135, 128, 164, 165), 
    sex = c("M", "F", "M", "M"))
summary(mydf)
##       ages          height    sex  
##  Min.   :12.0   Min.   :128   F:1  
##  1st Qu.:28.5   1st Qu.:133   M:3  
##  Median :34.5   Median :150        
##  Mean   :35.8   Mean   :148        
##  3rd Qu.:41.8   3rd Qu.:164        
##  Max.   :62.0   Max.   :165

str()

Detailed information (as detailed as possible) about the structure of an object.

mydf <- data.frame(ages = c(12, 35, 34, 62), height = c(135, 128, 164, 165), 
    sex = c("M", "F", "M", "M"))
str(mydf)
## 'data.frame':    4 obs. of  3 variables:
##  $ ages  : num  12 35 34 62
##  $ height: num  135 128 164 165
##  $ sex   : Factor w/ 2 levels "F","M": 2 1 2 2

Creative Commons License
This work by Celine Hernandez is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.