Press "Enter" to skip to content

Visualizing Bike Share Data (NiceRide)

This tutorial will cover exploring and visualizing data through 2018 for the Minneapolis, MN bike sharing service NiceRide. Part of what makes R incredible is the number of great packages. Part of what makes packages like ggmap and gganimate great is how they build on existing packages.

Station Network

First step, as always, is to include the libraries we will be using.

Fetching the Data (Fail)

First we are going to cover how I would normally approach fetching tables like the ones shown on https://s3.amazonaws.com/niceride-data/index.html.

This approach did not work.

This is where we learn the approach didn’t work. Calling tbls gives us this return

Since there is only one table we would normally be able to use the html_tables function from rvest and have a nice data frame.

Instead, we get the return above, which has been commented out. Zero rows. This is because the data which populates the page is loaded after. There is still hope though.

Fetching the Data (Success)

This is the approach that worked. If we go up one level to https://s3.amazonaws.com/niceride-data/ we can see an xml file which we can use.

That last bit lets us see a sampling of repo

Awesome, this is something we can use.

The function as_list is needed because we are converting an xml_nodeset to a vector. We used str_match_all to find files that start with 2018. The %<>% operator passes repo to as_list and also sets repo to the return value of the chain. At this point if you were to call repo you would see the following…

Downloading and Unzipping

You could merge both actions into one loop, but I made two loops. The %T>% operator comes in handy here to unlink the concatenated filename instead of the return of unzip.

Read and Merge into a Dataframe

Now that we have the files unzipped we can read them into a list and merge the list into a dataframe. You could do this manually, but this will allow for you to change the year easily, or merge 2018 and 2017 into a dataframe to company ridership between years.

Additional Columns

Now that we have the data in rides, we can build some extra columns/features. Age is a much more intuitive field than birth_year, and grouping ages into bins will come in handy later.

We probably will want to see things in a more macro level view, so we should build out some date/time columns as well.

Some Tables for Context

You can print the info in a different way, I am just doing this to format everything better for the web.

From this we can see that tossing out anything below a half a percent would remove 80-100 and 100-120. We could probably remove 0-20 as well, but that is up to you.

The Delicious Part

Now comes the time to visually explore the data. For the same of simplicity I am just using theme_fivethirtyeight() and scale_.*_viridis.*() for the theme and colors of most of these plots.

Visualizing Ridership by Month

Ride Frequency Histogram

The first thing I notice is the 0-20 and 60-80 year old frequencies are so low they are nearly impossible to identify from this plot. If we want to learn what week of the year people are riding a density plot might work better. There is probably a better way to do this, but in order to get the red line I added an additional layer.

Ride Frequency Density

Looks like we are seeing a downward trend toward the end of the year for subscribers/20-80 and customers/20-80 with the biggest downward trends for subscribers/60-80. Makes sense, from the previous plot we saw week 40 fell in October and Minneapolis can get chilly.

Visualizing Start Time

I am dumping a big chunk of code here, don’t be alarmed. The first bit of code filters thanks to dplyr and only keeps the more active months. I also filtered out those age bins we talked about earlier. The droplevels() is because ggplot didn’t like when I had unused levels (from the months).

I also added breaks and labels for the time to correspond with peak times.

Ride Start Times

This plot gives us a lot of really interesting information! You can see how NiceRide is used by many people who are presumably commuting. There is also a lunch rush for NiceRide bikes.

Maps!

So far all of this information has been on a macro level for time based ridership. Now we will switch to location based ridership.

Static Network Map

In order to generate static heatmaps we will be using the ggmap package along with the ggplot2::geom_segment() function. There is also a ggplot2::geom_curve() function which you can use to generate curved line segments. For the curved line segments you must also use coord_cartesian()

In order to speed up the rendering of geom_segment() I aggregated the data and then scaled the opacity of the line segments. I don’t think this was a necessary step, but it made rendering (and tweaking the plot) much easier.

Now we have a handy dandy dataframe which has the number of rides between stations. The start_station_name and end_station_name are not required, but they are helpful if you want to further explore station data. You can also summarize on the fly for station information with plyr

Neat, right? Also, that Lake Street & Knox Ave S location had over six thousand round trips. Also pretty interesting that all of the top 10 were round trips except for Weisman Art Museum/Willey Hall. Changing to the top 20 stations we see that pop four times as a starting station.

Getting the Base Map

The package ggmap has a function called get_map() which we will be using. You can pass the name of the city, but I am passing the boundaries of our start/end coordinates.

Now that we have our city mpls tiles, we can add some layers, including our geom_segment() layer. The function scale_alpha() was a lifesaver here, and without that we wouldn’t have been able to aggregate our data in df.line as the whole plot would be either blacked out when using aes() or not determined by df.line$rides. The function theme_nothing() comes from ggmap and is useful for removing axis labels among other attributes.

I made the executive decision to aggregate data for geom_point() on the fly. Without aggregating, we would be left with multiple points for each start station.

Station Network

Some of those roundtrips threw off the scale, which is why sqrt() is used. You could substitute sqrt() for log(), but I found the results more palatable as written.

There are a couple of insights to be garnered from this plot. There is a nice little rectangle of trips down at the bottom along Minnehaha Creak, the chain of lanes, and then on Lake Street and either West River Parkway or Hiawatha. There is also a ton more activity in Minneapolis than St. Paul, as expected. This might also be a good reference for where to put additional NiceRide stations. You may want to first replace geom_segment() with geom_path() from the ggmap::route() function which I will go over in another, shorter tutorial.

Leaflet Interactive Map

This one also required a new dataframe. Essentially we are summarizing the start stations and end stations separately so we can see if there is a change in the heatmap when cycling between the two layers.

Hot Hot Heat

You don’t have to add two layers of heatmaps, but you can! I segmented the data between start and stop to see if there was any change. The function addLayersControl() will allow us to toggle between the two layers. You can tweak the parameters to your liking, but I found these settings produced a plot I was happy with.

There was some change, and it was negligible.

Bonus

Ride Calendar

Wrap Up

All in all there is a ton of data to go through, and plenty more can be done. Some ideas for those who want to continue this project…

  • Compare ridership by years
  • Append rides with weather information
  • Determine features with significant deviation
  • Use geom_path() and map suggested bike routes
  • Create static heatmaps using 2d density plots
  • Calendar Views

10 Comments

  1. Adi Sarid Adi Sarid March 3, 2019

    Very nice post! good work on the analysis and visualizations.
    What are the spikes in the ride frequency at June? (week 25 I think) and September (week 38)?

    P.S.: Axis titles!!!

  2. Richard Richard March 3, 2019

    Hello,

    I’ve got an error message when I run the below code.
    ————————————————————————
    ggplot(rides %>% filter(age 4,
    month % droplevels(),
    aes(x=mm, fill= age_bin)) +
    geom_density(alpha=.6) +
    scale_x_continuous(labels = c(“5am”,”8am”,”1:30pm”,”5pm”,”8pm”),
    breaks = c(300,480,750,1020,1200)) +
    labs(fill=””,title=”NiceRide 2018 Start Times”) +
    theme_fivethirtyeight() +
    theme(strip.background = element_rect(fill = “#FFFFFF”))+
    facet_grid(vars(usertype), vars(start_day_type)) + scale_fill_viridis_d(option=”A”)

    • matt matt March 3, 2019

      What was the error message? Did you make sure you have all of the libraries loaded?

      • Eszter Arato Eszter Arato March 3, 2019

        Hi All,

        I also got an error message at this chunk, it was this one:
        > ggplot(rides %>% filter(age 4,
        + month % droplevels(),
        + aes(x=mm, fill= age_bin)) +
        + geom_density(alpha=.6) +
        + scale_x_continuous(labels = c(“5am”,”8am”,”1:30pm”,”5pm”,”8pm”),
        + breaks = c(300,480,750,1020,1200)) +
        + labs(fill=””,title=”NiceRide 2018 Start Times”) +
        + theme_fivethirtyeight() +
        + theme(strip.background = element_rect(fill = “#FFFFFF”)) +
        + facet_grid(vars(usertype), vars(start_day_type)) + scale_fill_viridis_d(option=”A”)
        Error: Faceting variables must have at least one value

        • matt matt March 3, 2019

          Sounds like you may not have set start_day_type. Try table(rides$start_day_type) and see what you get.

          • Eszter Arato Eszter Arato March 3, 2019

            Hi Matt,

            thank you for your quick response! I tried your suggestion and here is the result:
            > table(rides$start_day_type)

            Weekday Weekend
            281226 131197

          • Eszter Arato Eszter Arato March 3, 2019

            I think it is strange that in the chunck previous to this one

            ggplot(data=rides[which(rides$age<=80),], aes(x=week, fill= age_bin)) +
            geom_histogram(alpha=.9,aes(y=..density..)) + theme_fivethirtyeight() + ggtitle("Ride Distribution by Week of Year") +
            geom_density(alpha=0,color=rgb(1,0,0,.4)) +
            facet_grid(vars(usertype), vars(age_bin)) + scale_fill_viridis_d()
            ggsave(filename = "ride-frequency-density.png",width = 8,units = "in")

            the faceting parameters are exactly the same, and that part of the code runs without any error.

          • matt matt March 4, 2019

            Maybe try removing that facet_grid() line and see what happens. They may just be an error that gets hit before and suppressed. While building the blog post I ran those segments again, so they should work.

          • Eszter Arato Eszter Arato March 4, 2019

            Sorry Mat, my last comment on the facet_grid() parameters being exactly the same in the preceding chunk is not correct. I made a mistake in choosing the right chunk to check on, and could not delete the comment once it was posted.

            Anyway, I left out the facet_grid() line from the part throwing the error, and now another error pops up:

            > ggplot(rides %>% filter(age 4,
            + month % droplevels(),
            + aes(x=mm, fill= age_bin)) +
            + geom_density(alpha=.6) +
            + scale_x_continuous(labels = c(“5am”,”8am”,”1:30pm”,”5pm”,”8pm”),
            + breaks = c(300,480,750,1020,1200)) +
            + labs(fill=””,title=”NiceRide 2018 Start Times”) +
            + theme_fivethirtyeight() +
            + theme(strip.background = element_rect(fill = “#FFFFFF”))
            Error in f(…, self = self) : Breaks and labels are different lengths
            > #+
            > # facet_grid(vars(usertype), vars(start_day_type)) + scale_fill_viridis_d(option=”A”)

            Apart from this, I wanted to thank you for the whole code. I find this case to be a beautiful analysis, very useful for my personal development. too.

  3. James James March 28, 2019

    The problem is the Month value, I changed as this code, it works.
    #Visualizing Start Time
    df=rides %>% filter(age % droplevels()

    ggplot(df,aes(x=mm, fill= age_bin)) +
    geom_density(alpha=.6) +
    scale_x_continuous(labels = c(“5am”,”8am”,”1:30pm”,”5pm”,”8pm”),
    breaks = c(300,480,750,1020,1200)) +
    labs(fill=””,title=”NiceRide 2018 Start Times”) +
    theme_fivethirtyeight() +
    theme(strip.background = element_rect(fill = “#FFFFFF”)) +
    facet_grid(vars(usertype), vars(start_day_type)) + scale_fill_viridis_d(option=”A”)

Leave a Reply

Your email address will not be published. Required fields are marked *