Data preprocessing with Data.table in R

2017/12/09

Categories: Data_Science

Processing Large Data sets with Data.table

Dealing with Data sets in r is slow with data.frame. It takes very long time to process data with R’s Built in data processing Tools. data.table is perfect companion to handle large data sets in least amount of time

By using data.table you will not only be reducing computing time, but programming time as well. As you are already familiar with R’s basic syntax of data.frame we will see difference between data.frame and data.table

We are using Data from Kaggle OUTBRAIN Click prediction competition. First we will use R’s basic read.csv file to load file into R’s memory. We will load a 1.4GB data set into memory

system.time(
  clicks_train <- read.csv("clicks_train.csv",header = T)
)
   user  system  elapsed 
166.964   2.912  169.869

It took 169.869 seconds to load a single data set into memory let load it using data.table

system.time(
  clicks_train <- fread("clicks_train.csv",header = T)
)
  user  system elapsed 
 9.468   0.396  13.786 

It took just 10 seconds to load data set . that is a massive improvement. the csv file is loaded in data.table format . we can convert normal data.frames to data.table by using as.data.table() command. we can create data.tables by using data.table command

DT <- data.table(V1=c(1L,2L),
                 V2=LETTERS[1:3],
                 V3=round(rnorm(4),4),
                 V4=1:12)

Data table syntax is mainly based on DT[i,j,by] command. it has three parts: i, j and by

SUBSETTING ROWS USING i

You can subset by row numbers

clicks_train[2:10]
   display_id  ad_id clicked
1:          1 139684       0
2:          1 144739       1
3:          1 156824       0
4:          1 279295       0
5:          1 296965       0
6:          2 125211       0
7:          2 156535       0
8:          2 169564       0
9:          2 308455       1

you can also subset by Use column names to select rows in i based on a condition using fast automatic indexing. Or for selecting on multiple values DT[column %in% c(“value1″,”value2”)], which selects all rows that have value1 or value2 in column.

clicks_train[display_id == 1]
   display_id  ad_id clicked
1:          1  42337       0
2:          1 139684       0
3:          1 144739       1
4:          1 156824       0
5:          1 279295       0
6:          1 296965       0
clicks_train[display_id %in% c(1,2,3)]
    display_id  ad_id clicked
 1:          1  42337       0
 2:          1 139684       0
 3:          1 144739       1
 4:          1 156824       0
 5:          1 279295       0

MANIPULATING ON COLUMNS IN J

Select several columns in j DT[,V2] , where coloum V2 is returned as a vector . if you want multiple coloums to be returned use DT[,.(V2,V3)] where V2 and V3 are column names

clicks_train[,ad_id]
clicks_train[,.(ad_id,display_id)]

.() is an alias to list(). If .() is used, the returned value is a data.table. If .() is not used, the result is a vector.

You can call functions in j and operate on several coloums and assign names to them

clicks_train[,length(unique(ad_id))]
[1] 478950

# There are 478950 unique ad_id's in the set
 
clicks_train[,.(length(unique(ad_id)),mean(display_id))]
       V1      V2
1: 478950 8443768
 
# The mean of display id's is 8443768

clicks_train[,.(unique = length(unique(ad_id)),mean = mean(display_id))]
   unique    mean
1: 478950 8443768
 
# coloums get recycled if different length
 
clicks_train[,.(unique = ad_id,mean = mean(display_id))]
          unique    mean
       1:  42337 8443768
       2: 139684 8443768
       3: 144739 8443768
       4: 156824 8443768
       5: 279295 8443768

You can update coloums with out reassigning the above to another variable using := expression,Adding/updating several columns by reference using :=.

clicks_train[,mean := mean(display_id)]
head(clicks_train)
   display_id  ad_id clicked    mean
1:          1  42337       0 8443768
2:          1 139684       0 8443768
3:          1 144739       1 8443768
4:          1 156824       0 8443768
5:          1 279295       0 8443768
6:          1 296965       0 8443768
clicks_train[,c("mean","uniques") := list(mean(display_id),length(unique(ad_id)))]
head(clicks_train)
   display_id  ad_id clicked    mean uniques
1:          1  42337       0 8443768  478950
2:          1 139684       0 8443768  478950
3:          1 144739       1 8443768  478950
4:          1 156824       0 8443768  478950
5:          1 279295       0 8443768  478950
6:          1 296965       0 8443768  478950
# .() is same as list()

clicks_train[,c("mean","uniques") := .(mean(display_id),length(unique(ad_id)))]
head(clicks_train)
   display_id  ad_id clicked    mean uniques
1:          1  42337       0 8443768  478950
2:          1 139684       0 8443768  478950
3:          1 144739       1 8443768  478950
4:          1 156824       0 8443768  478950
5:          1 279295       0 8443768  478950
6:          1 296965       0 8443768  478950
# You can use the below formation also to get same result as above 
 
clicks_train[,':=' (mean = mean(display_id),uniques =length(unique(ad_id)))]

head(clicks_train)
   display_id  ad_id clicked    mean uniques
1:          1  42337       0 8443768  478950
2:          1 139684       0 8443768  478950
3:          1 144739       1 8443768  478950
4:          1 156824       0 8443768  478950
5:          1 279295       0 8443768  478950
6:          1 296965       0 8443768  478950

Grouping By J

The by part The final section of this data.table R tutorial focuses on the ‘by’ part. The ‘by’ part is used when we want to calculate the ‘j’ part grouped by a specific variable (or a manipulation of that variable). You will see that the ‘j’ expression is repeated for each ‘by’ group. It is simple to use: you just specify the column you want to group by in the ‘by’ argument. Lets find out no of clicks on each ad

clicks_train[,click := sum(clicked),by = ad_id]
head(clicks_train)
   display_id  ad_id clicked click
1:          1  42337       0  1970
2:          1 139684       0   199
3:          1 144739       1 18462
4:          1 156824       0  1897
5:          1 279295       0   109
6:          1 296965       0     4

The data.table is very fast when compared to base R and easy to carry out . if you have any question feel free comment below