Press "Enter" to skip to content

Importing Large NDJSON Files into R

I ran into this problem recently when trying to import the data my twitter scraper produced and thought this might make a worthwhile post.

The file I was trying to import was ~30GB, which is absolutely monsterous. This was in part do to all of the fields I didn’t bother dropping before writing them to my data.json file.

The Process

The first thing I needed to do was figure out a managable size. Thankfully the ndjson format keeps the entire record on one line, so I could split the lines into an undetermined amount of files based on a known number of records my system was able to process with my memory (RAM) limit. I decided on 50,000 records, knowing my system could handle about 800,000 before filling up my RAM and paging file and that I planned on parallizing the process (16 threads) to speed it up quite dramatically.

I made sure I had an empty folder to write the split file segments to, and ran this command from my working directory in Terminal.

Simply, right? Now we will probably want to see the variables (technically properties since these are javascript objects).

This gives you an output similar to this

Regular expressions are the best, arent they? Now for the R code which makes this buildup actually worthwhile.

Now the system won’t bonk since it only is keeping in 5 variables! You will notice your RAM fluctuate quite a bit while reading in files, since the initial stream_in() loads all of the properties into the dataframe (sometimes with nesting). Once the columns are omitted the memory is freed up. Happy programming 🙂

Sharing is caring!

Be First to Comment

    Leave a Reply

    Your email address will not be published. Required fields are marked *