I ran into this problem recently when trying to import the data my twitter scraper produced and thought this might make a worthwhile post.
The file I was trying to import was ~30GB, which is absolutely monsterous. This was in part do to all of the fields I didn’t bother dropping before writing them to my data.json file.
The Process
The first thing I needed to do was figure out a managable size. Thankfully the ndjson format keeps the entire record on one line, so I could split the lines into an undetermined amount of files based on a known number of records my system was able to process with my memory (RAM) limit. I decided on 50,000 records, knowing my system could handle about 800,000 before filling up my RAM and paging file and that I planned on parallizing the process (16 threads) to speed it up quite dramatically.
I made sure I had an empty folder to write the split file segments to, and ran this command from my working directory in Terminal.
1 2 | split -l 50000 data.json ./import/tweets_ |
Simply, right? Now we will probably want to see the variables (technically properties since these are javascript objects).
1 2 | head -1 import/tweets_da | grep -oP '"([a-zA-Z0-9\-_]+)"\:' |
This gives you an output similar to this
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | "id": "text": "source": "truncated": "user": "id": "name": "location": "url": "description": "protected": "verified": "lang": "following": "notifications": "geo": "coordinates": "place": "contributors": "id": "text": "source": "truncated": "user": "id": "name": "location": "url": "description": "protected": "verified": ... |
Regular expressions are the best, arent they? Now for the R code which makes this buildup actually worthwhile.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | library("data.table") library("parallel") library("jsonlite") #Parallize this process on 16 threads cluster <- makeCluster(16) #Export the jsonlite function stream_in to the cluster clusterExport(cluster,list("stream_in")) #Create an empty list for the dataframe for each file import <- list() #Run this function on every file in the ./import directory import <- parLapply(cluster,list.files(path = "./import"),function(file) { #jsonlite function to convert the ndjson file to a dataframe df <- stream_in(file(paste0("./import/",file))) #select which columns to keep df <- df[,c("text","created_at","lat","lng","id_str")] return(df) }) #function called from the data.table library df <- rbindlist(import) #Now you can stop the cluster stopCluster(cluster) |
Now the system won’t bonk since it only is keeping in 5 variables! You will notice your RAM fluctuate quite a bit while reading in files, since the initial stream_in() loads all of the properties into the dataframe (sometimes with nesting). Once the columns are omitted the memory is freed up. Happy programming 🙂
Be First to Comment