Closed Captioning Closed captioning available on our YouTube channel

How to use R with BigQuery

InfoWorld | Jul 8, 2021

Learn how to analyze data in BigQuery with R and the bigrquery package.

Copyright © 2021 IDG Communications, Inc.

Similar
Hi, I’m Sharon Machlis at IDG, here with Episode 61 of Do More With R: Use Google BigQuery data in R.
Do you need to analyze data in BigQuery as part of an R workflow? Thanks to the bigrquery R package, it's a pretty seamless experience. Set up a Google Cloud project, make a couple of very small tweaks to your code, and you’re good to go. Let’s take a look.
You’ll need a general Google accounts – many people already have one to use with services like Google Drive or Gmail. If you don’t have one yet, go to Google.com and create one.
Then, head to Google’s cloud console at console dot cloud dot google dot com and log in with your Google account. If you haven’t used Google Cloud before, you’ll need to create a new cloud project.
If you work in RStudio, you probably know that using projects is a good idea, but you don’t have to. But projects are mandatory in Google Cloud.
Click the New Project option to create a new project.
Once you finish your new project setup, you’ll see the general Google Cloud dashboard. It can be a bit overwhelming if you’ve never used Google Cloud before. What are all these things? I just want to use BigQuery, where is it?
Fortunately, you can “pin” BigQuery to the top of your left navigation menu. (If you don’t see a left nav, click the 3-line “hamburger” at the very top left to open it.) I can scroll all the way down, find BigQuery, hover my mouse over it until I see a pin icon, and click the pin.
Now BigQuery will always show up at the top of my Google Cloud Console left nav. Click on it, and you’ll get to the BigQuery console. You should see the name of your project and no data inside.
If you don’t see an Editor tab, click on the “Compose New Query” button at the top right.
So now what? People often start learning BigQuery by playing with an available public data set. You can “pin” other users' public data projects to your own project, including some data sets collected by Google. If I go to the URL for Google’s public data set in the same BigQuery browser tab I’m working in, the Google public data project should automatically pin itself to my project.
Now if I click on the triangle next to public data, I’ll see all data sets available there. A BigQuery data set is like a conventional database: It has one or more data tables. Click on the triangle next to a data set to see the table(s) it contains.
Clicking on a table lets you see its schema, and a tab for previewing data.
But before we start playing around with this data, it’s important to understand how BigQuery pricing works.
BigQuery charges for both data storage and data queries.
If you use a data set created by someone else, like this one, they pay for the storage. If you create and store your own data in BigQuery, you pay - and the rate is the same whether you’re the only one using it, you share it with a few other people, or you make it public. You get 10 gigabytes a month of free storage.
Note that if you analyze someone else’s data and store your results in BigQuery, the new table becomes part of your storage allocation.
The price of a query is based on how much data the query processes and not how much data is returned. This is the key point about pricing. If your query returns the top 10 results after analyzing a 4-gigabyte data set, the query will still use 4 gigabytes of your analysis quota. It’s not important that the query only returned a tiny amount of data in those 10 rows. What matters is how much data was analyzed.
You get 1 terabyte of data queries free each month. Each additional terabyte is $5. So if your data set is 4 gigs, you still have a lot of free queries. If your query is running through 400 gigs, not so much.
Google advises never running a query on all columns in your data set, unless you really have to. Instead, query only the specific columns you need. That cuts down the data that’ll be processed – so less cost, plus your query will run faster.
So if you’re wondering, “How can I possibly know how much data my query will use before it runs?”, there’s an easy answer. In the BigQuery cloud editor, you can type a query without running it and see how much data it will process.
Remember to use the syntax `project name, dot data set name, dot table name` for the table in your query, not just the table name.
I don’t need to run the query. Look at the line at the top right, I see that the query would run through 21.1 kilobytes of data. Whatever else my R code might do, it shouldn't matter for the query cost if I’m only selecting that column.
Queries on metadata are free, but you need to make sure your query is actually doing that. For example SELECT COUNT(*), which gets number of rows in a data set, isn't charged.
You don’t need to enter a credit card as a billing source to start using BigQuery. But without billing enabled, not all queries will work. I suggest adding billing to your account even if you don’t expect to exceed your free quotas. I’ll switch over to an existing project that has billing already enabled
Now - finally! - let’s look at some R.
To work with BigQuery data in R, Step 1 is setting up a connection to a data set. In this first code block I load the bigrquery and dplyr packages, and use the dbConnect() function to create that connection.
The first argument here is the bigquery() function from the bigrquery package. That tells dbConnect I want to connect to a BigQuery data source. The other arguments outline the project ID, data set name, and billing project ID. (I stored my billing ID in an R system environment variable, but you could just put the project ID in a regular character string, too).
Bigquery public data is the project argument because that’s where the data set lives. My project id is the billing argument because my project's quota will be "billed" for queries.
Nothing much happens when I run this code, except creating the connection variable. But the first time I try to use the connection, I'll be asked to authenticate my Google account in a browser window.
For example, to list all available tables in the baseball data set, I’d use the dbListTables() function on my connection.
To query one specific BigQuery table in R, use dplyr's tbl() function. That creates a table object referencing the table.
If you use the base R str() command to find out more about skeds, you’ll see a list, not a data frame
Fortunately, dplyr functions like glimpse() often work pretty seamlessly with a table object.
That’s mostly what you would expect -- except glimpse() doesn’t know how many rows are in the data.
In fact, glimpse()looks like it’s querying metadata. And I haven’t seen any data billed for glimpse() queries.
You can run most dplyr commands on a table variable almost like they’re a regular data frame. But there’s usually one important addition: Pipe your dplyr results into the collect() function to actually get results.
This first block of code uses dplyr’s select() function to see what unique home team names are in the skeds table, and saves results to an R variable
When I checked a similar query using a SQL statement, the BigQuery Web editor showed that only 21.1 kilobytes of data processed, not more than 10 megabytes.

Why so much more in R? In this case, it’s because queries have a 10- megabyte minimum.
If you wanted to store results of an R query in a temporary BigQuery table instead of a local data frame, you need the compute() function at the end of your pipe instead of collect(). But you’d need to be working in a project where you have permission to create tables; and Google’s public data project definitely is not that.
If you run the same code without collect() or compute()

You are saving the query and not the query results. Available teams is now a query object.
You can run a saved query object by using its name alone in your script.
Here’s something super useful: You can see the underlying SQL being generated by your dplyr statements! Just use show_query() at the end of your chain instead of collect()
You can cut and paste this SQL into the BigQuery Web interface to see how much data your specific use. Just remember to change the plain table name such as `schedules` to include the project name and data set name.
If you run the same exact query a second time in your R session, you won’t be billed again because BigQuery will use cached results.
That’s it for this episode, thanks for watching! That’s it for this episode, thanks for watching! For more R tips, head to the Do More With R page at bit-dot-l-y slash do more with R, all lowercase except for the R.

You can also find the Do More With R playlist on YouTube’s IDG Tech Talk channel -- where you can subscribe so you never miss an episode. Hope to see you next time. Stay healthy and safe, everyone!
Popular
Featured videos from IDG.tv