Importing Large NDJSON Files into R

I ran into this problem recently when trying to import the data my twitter scraper produced and thought this might make a worthwhile post.

The file I was trying to import was ~30GB, which is absolutely monsterous. This was in part do to all of the fields I didn’t bother dropping before writing them to my data.json file.

The Process

The first thing I needed to do was figure out a managable size. Thankfully the ndjson format keeps the entire record on one line, so I could split the lines into an undetermined amount of files based on a known number of records my system was able to process with my memory (RAM) limit. I decided on 50,000 records, knowing my system could handle about 800,000 before filling up my RAM and paging file and that I planned on parallizing the process (16 threads) to speed it up quite dramatically.

I made sure I had an empty folder to write the split file segments to, and ran this command from my working directory in Terminal.

1 2	split -l 50000 data.json ./import/tweets_

Simply, right? Now we will probably want to see the variables (technically properties since these are javascript objects).

1 2	head -1 import/tweets_da \| grep -oP '"([a-zA-Z0-9\-_]+)"\:'

This gives you an output similar to this

"id":

"text":

"source":

"truncated":

"user":

"id":

"name":

"location":

"url":

"description":

"protected":

"verified":

"lang":

"following":

"notifications":

"geo":

"coordinates":

"place":

"contributors":

"id":

"text":

"source":

"truncated":

"user":

"id":

"name":

"location":

"url":

"description":

"protected":

"verified":

...

Regular expressions are the best, arent they? Now for the R code which makes this buildup actually worthwhile.

library("data.table")

library("parallel")

library("jsonlite")

#Parallize this process on 16 threads

cluster <- makeCluster(16)

#Export the jsonlite function stream_in to the cluster

clusterExport(cluster,list("stream_in"))

#Create an empty list for the dataframe for each file

import <- list()

#Run this function on every file in the ./import directory

import <- parLapply(cluster,list.files(path = "./import"),function(file) {

#jsonlite function to convert the ndjson file to a dataframe

df <- stream_in(file(paste0("./import/",file)))

#select which columns to keep

df <- df[,c("text","created_at","lat","lng","id_str")]

return(df)

})

#function called from the data.table library

df <- rbindlist(import)

#Now you can stop the cluster

stopCluster(cluster)

Now the system won’t bonk since it only is keeping in 5 variables! You will notice your RAM fluctuate quite a bit while reading in files, since the initial stream_in() loads all of the properties into the dataframe (sometimes with nesting). Once the columns are omitted the memory is freed up. Happy programming 🙂

Importing Large NDJSON Files into R

The Process

Be First to Comment

Leave a Reply Cancel reply