Another Bike-Share Data Post : Using Google’s Big Query and Maps APIs to Investigate Austin Bike Data

Alex
8 min readMar 26, 2020

Having seen a ton of these kinds of posts online, I felt I had to write a vanity post of my own. I decided to outline all the steps I took in building a “final product” to highlight some of the additional steps involved in this (/any other) data project. A disclaimer that I haven’t edited the notebooks for clarity and cleanliness, so it might be harder to follow along with this post than a typical code-along tutorial.

This project came about because I was interested in testing out Google’s Big Query (BQ) and its included public datasets. In part because of all the bike-share stats posts I’d seen in the past, it made sense to look at the Austin bike-share dataset. The goals here will be to illustrate using BQ, then the Distance Matrix API to grab distances, then cutting up some of the data for additional analysis. At the end of the project, I’ll add in some maps to show off some additional functionality.

Part 1: Setting Up and Getting the Data

The first step in the process is to get the data — we’ll look at it in more detail later. To use the BQ API, we need to begin at the Google Cloud Console, and create a new project. (The relevant documentation is here.) The workflow as outlined on the docs is essentially:

  • Register a new project in the Cloud Console
  • Enable the APIs you’re going to use in the project. In this case we’ll be getting the data with the BQ API, and a few of the Google Maps APIs: Distance Matrix and Directions;
  • Get credentials and oauth setup — you’ll get a JSON file with the proper credentials. You’ll need these to enable the APIs.

We should now have our credentials and, more or less, be ready to get working. Let’s get started by importing the modules we’ll need to get the data. We need to import a couple of modules specific to the BQ API including six and biquery. I’m including snapshots of the Jupyter notebook I used along the way for reference (and practice for me).

Modules used at import — six, pandas, bigquery

So now that we have the tools we’ll be using, let’s actually get the data. I actually ran into some issues with the Google documentation, which recommends the following code:

from google_auth_oauthlib import flow 
appflow = flow.InstalledAppFlow.from_client_secrets_file(
'client_secrets.json',
scopes = ['https://www.googleapis.com/auth/bigquery'])
# token = [token]
if launch_browser:
appflow_run_local_server()
else:
appflow.run_console()

credentials = appflow.credentials

I couldn’t get this flow to work, so I called appflow.run_console()directly, which takes you to a familiar google oauth screen. Because my app isn’t verified, I was redirected to another familiar Google screen:

Clicking through the Advanced menu will allow access in your notebook. Now, we can actually get our hands on some data.

We’ll keep this pretty basic at the outset. We’ll look at the Austin bike-share data available on BQ. First, let’s get our hands on the trip records. These cells indicate the basic syntax. There’s really nothing fancy about either the SQL or Python syntax here — my goal here was to demonstrate using the BQ API, and then play around with the data retrieved. Note the quirk in the table syntax here, we have to include back-ticks because of the character in the table name.

We’ve now got two DataFrames — one for actual trips (1,166,411 rows) and one of the 96 unique stations in the Austin system.

Now that we have our data, we can start having some more fun with it. This is a pretty rich dataset, and opens up lots of possibilities for exploration and analysis. First, let’s do some basic EDA.

Part 2: Analysis and More

Since this is a post about, among other things, using Google APIs in the course of a project, let’s focus on some things things we know Google has good APIs for: mapping and distance calculations.

We can already say some things about the average length of a bike trip in Austin with df.describe(), so let’s first see how long, on average, people are riding their bikes:

df['duration_minutes'].describe()count    1.166411e+06
mean 2.877973e+01
std 1.123899e+02
min 0.000000e+00
25% 6.000000e+00
50% 1.200000e+01
75% 2.600000e+01
max 3.423800e+04

So looks like the average trip is about 29 minutes long; one person took a trip that lasted 23.78 days.

Who’s riding these bikes? Turns out that in this dataset there are 74 different subscriber types. With that many factors, the data can be a bit cumbersome to work with. Since we’re not going for maximalism here, we can trim this down. I wanted to work with the fewest factors that accounted for most of the data. I ran the following code to get an idea of how to select those:

Let’s make a histogram to check out the distribution.

So looks like there’s a pretty wide distribution in how folks are using the bikes. Weekenders tend to be concentrated in shorter trips, while users of the 24 hour kiosk option have a very wide distribution. Let’s see if we see the same pattern in the distances people are traveling.

For this, we can use the Distance Matrix API to build, well, a distance matrix. The API isn’t really built for this; instead, it’s designed for what we typically think of when we think Google Maps: directions. The Distance Matrix API is super useful for those tasks — getting distances and directions is easy — but that’s not what we want. It’s an expensive API, too: $5 / 1,000 calls, according to Google’s pricing guide. I’m using the free trial and have some credit left, so I’m not explicitly going to think about potential costs here, but we should be mindful of the pricing.

Part 3: Requests to the Distance Matrix API

Let’s get into the third part of this, which is finding the distance between each station. We’ll use a few familiar modules to make requests to the Distance Matrix API. From here, we have two options. 1) We can get a “Distance DF” we can use to map in the actual biking distance travelled during each trip; 2) We can build an actual distance matrix. We’ll take the second approach. This is a bit more complex than it might be because we need to use Google’s biking directions to fill in the values. If we were just looking at something like Euclidean distances (or even great circle Euclidean distances) we could use some of the SciPy functionality to build a matrix, but that might be less fun/instructive.

First, we build our calls to the API. The API is basically a REST API, so we’ll expect a bit of URL formatting and JSON parsing to come up. Again, first things first, we install modules and set up some parameters

import requests, json # import modules 
url = 'https://maps.googleapis.com/maps/api/distancematrix/json?' # set api url from Google developers guide
key = "&key=[api_key]"
how = "&transit_mode=biking"

for the distance matrix, one of the articles I linked at the top of this post actually beat me to this idea, but credit where credit’s due. The approach in this article is actually really smart: what we’re going to do is enumerate a list of URLs to call the API. For example, we’ll take the first station at Capital Metro HQ — East 5th at Broadway (coordinates 30.2563,-97.71007) and create a url with that station at the origin and all 94 other stations listed as a destination. We should get a loooong JSON object returned. Here’s the code I used to build the URLs, following the approach in the linked article.

We can use the build_url function to call the API and build up our distance matrix.

The approach I’m following actually used curl instead of the Python wrapper to get results, so what I outline here will be slightly different, but the basic steps will be the same:

  • Call the API
  • Save the JSON output
  • Construct a DataFrame (which is our distance matrix) out of the saved files

Let’s get to building that. This code will send a request to the Distance Matrix API for a given station and return the distance between that station and all others in the system, then saves the output in a separate folder.

So now we should have 95 JSON files to be the foundation of our distance matrix. We can check that we actually have all the required files with the os module:

import os 
os.chdir(<path/to/json_output_folder>)
len(os.listdir())
[output] 95

Looks like we have all the required files. Let’s go ahead and build the distance matrix. First we need to actually build a matrix. What we’re going to do is build out the matrix row by row, create another matrix that is the transpose of the first, and add them together—hopefully it’s a bit more clear in the code:

(Have a look at this stack overflow question on the sum of matrices and their transposes for proof that this will indeed produce a symmetrical matrix.)

Now we have our distance matrix and can compute the *minimum* distance travelled during each trip. Note that this only acts as a lower bound, because riders could be taking circuitous routes to/from each station before docking again.

With our distance DataFrame, we can add in the (minimum) distance travelled per trip with

stacked = distance_df.stack()
stacked.name = 'distance'
df = df.join(stacked, on = ['start_station_name', 'end_station_name'])

And here’s a preview of the final product:

Let’s check the distance distribution and see how that compares with the duration distribution we generated earlier:

To put some numbers behind this, here are the median durations and distances for the most active subscriber types:

So Weekenders tend to go the farthest (people probably have a specific destination in mind that’s farther than they’d typically travel)

Wrapping up

The primary aim of this blog was to illustrate the steps involved in seeing a project through from data acquisition to playing around with some fun things we can get out of that data. There are obviously some other things we can explore, and the posts I linked at the head of this article go on to do some pretty cool things — estimating rider speed, looking at the difference in trip duration among subscriber types, predicting rides based on the weather — but I’ll leave it here for now.

--

--

Alex

Delivering the finest gymnosophistry west of the Indus. An occasional blog about projects I’ve undertaken, usually focusing on data and analytics.