Processing Large Data sets with Data.table
Dealing with Data sets in r is slow with data.frame. It takes very long time to process data with R’s Built in data processing Tools. data.table is perfect companion to handle large data sets in least amount of time
By using data.table you will not only be reducing computing time, but programming time as well. As you are already familiar with R’s basic syntax of data.frame we will see difference between data.frame and data.table
We are using Data from Kaggle OUTBRAIN Click prediction competition. First we will use R’s basic read.csv file to load file into R’s memory. We will load a 1.4GB data set into memory
system.time(
clicks_train <- read.csv("clicks_train.csv",header = T)
)
user system elapsed
166.964 2.912 169.869
It took 169.869 seconds to load a single data set into memory let load it using data.table
system.time(
clicks_train <- fread("clicks_train.csv",header = T)
)
user system elapsed
9.468 0.396 13.786
It took just 10 seconds to load data set . that is a massive improvement. the csv file is loaded in data.table format . we can convert normal data.frames to data.table by using as.data.table() command. we can create data.tables by using data.table command
DT <- data.table(V1=c(1L,2L),
V2=LETTERS[1:3],
V3=round(rnorm(4),4),
V4=1:12)
Data table syntax is mainly based on DT[i,j,by] command. it has three parts: i, j and by
SUBSETTING ROWS USING i
You can subset by row numbers
clicks_train[2:10]
display_id ad_id clicked
1: 1 139684 0
2: 1 144739 1
3: 1 156824 0
4: 1 279295 0
5: 1 296965 0
6: 2 125211 0
7: 2 156535 0
8: 2 169564 0
9: 2 308455 1
you can also subset by Use column names to select rows in i based on a condition using fast automatic indexing. Or for selecting on multiple values DT[column %in% c(“value1″,”value2”)], which selects all rows that have value1 or value2 in column.
clicks_train[display_id == 1]
display_id ad_id clicked
1: 1 42337 0
2: 1 139684 0
3: 1 144739 1
4: 1 156824 0
5: 1 279295 0
6: 1 296965 0
clicks_train[display_id %in% c(1,2,3)]
display_id ad_id clicked
1: 1 42337 0
2: 1 139684 0
3: 1 144739 1
4: 1 156824 0
5: 1 279295 0
MANIPULATING ON COLUMNS IN J
Select several columns in j DT[,V2] , where coloum V2 is returned as a vector . if you want multiple coloums to be returned use DT[,.(V2,V3)] where V2 and V3 are column names
clicks_train[,ad_id]
clicks_train[,.(ad_id,display_id)]
.() is an alias to list(). If .() is used, the returned value is a data.table. If .() is not used, the result is a vector.
You can call functions in j and operate on several coloums and assign names to them
clicks_train[,length(unique(ad_id))]
[1] 478950
# There are 478950 unique ad_id's in the set
clicks_train[,.(length(unique(ad_id)),mean(display_id))]
V1 V2
1: 478950 8443768
# The mean of display id's is 8443768
clicks_train[,.(unique = length(unique(ad_id)),mean = mean(display_id))]
unique mean
1: 478950 8443768
# coloums get recycled if different length
clicks_train[,.(unique = ad_id,mean = mean(display_id))]
unique mean
1: 42337 8443768
2: 139684 8443768
3: 144739 8443768
4: 156824 8443768
5: 279295 8443768
You can update coloums with out reassigning the above to another variable using := expression,Adding/updating several columns by reference using :=.
clicks_train[,mean := mean(display_id)]
head(clicks_train)
display_id ad_id clicked mean
1: 1 42337 0 8443768
2: 1 139684 0 8443768
3: 1 144739 1 8443768
4: 1 156824 0 8443768
5: 1 279295 0 8443768
6: 1 296965 0 8443768
clicks_train[,c("mean","uniques") := list(mean(display_id),length(unique(ad_id)))]
head(clicks_train)
display_id ad_id clicked mean uniques
1: 1 42337 0 8443768 478950
2: 1 139684 0 8443768 478950
3: 1 144739 1 8443768 478950
4: 1 156824 0 8443768 478950
5: 1 279295 0 8443768 478950
6: 1 296965 0 8443768 478950
# .() is same as list()
clicks_train[,c("mean","uniques") := .(mean(display_id),length(unique(ad_id)))]
head(clicks_train)
display_id ad_id clicked mean uniques
1: 1 42337 0 8443768 478950
2: 1 139684 0 8443768 478950
3: 1 144739 1 8443768 478950
4: 1 156824 0 8443768 478950
5: 1 279295 0 8443768 478950
6: 1 296965 0 8443768 478950
# You can use the below formation also to get same result as above
clicks_train[,':=' (mean = mean(display_id),uniques =length(unique(ad_id)))]
head(clicks_train)
display_id ad_id clicked mean uniques
1: 1 42337 0 8443768 478950
2: 1 139684 0 8443768 478950
3: 1 144739 1 8443768 478950
4: 1 156824 0 8443768 478950
5: 1 279295 0 8443768 478950
6: 1 296965 0 8443768 478950
Grouping By J
The by part The final section of this data.table R tutorial focuses on the ‘by’ part. The ‘by’ part is used when we want to calculate the ‘j’ part grouped by a specific variable (or a manipulation of that variable). You will see that the ‘j’ expression is repeated for each ‘by’ group. It is simple to use: you just specify the column you want to group by in the ‘by’ argument. Lets find out no of clicks on each ad
clicks_train[,click := sum(clicked),by = ad_id]
head(clicks_train)
display_id ad_id clicked click
1: 1 42337 0 1970
2: 1 139684 0 199
3: 1 144739 1 18462
4: 1 156824 0 1897
5: 1 279295 0 109
6: 1 296965 0 4
The data.table is very fast when compared to base R and easy to carry out . if you have any question feel free comment below