Hi. I’m Sharon Machlis at IDG, here with episode 39 of Do More With R: points in polygons.
Calculating which polygon a point belongs to has all sorts of uses. For example, this powers apps that tell you where you vote or who your legislators are. In this episode, I’ll use points in polygons to answer a simple question: What Zip Code is that street address?
We’ll need just three things for this. 1) A way to geocode an address to get its latitude and longitude. 2) Shapefiles that outline Zip Code boundaries, and 3) the sf package.
For geocoding, I usually use the geocode-i-o API. It’s free for 2,500 lookups a day, but you need an API key to use it. To get around that bit of complexity for this video, I’ll use the free, open-source Open Street Map API . It doesn’t require a key. The tmaptools package has a function, geocode_OSM(), to use that API.
Let’s start coding.
I’ll load the packages I want: sf, tmaptools and also tmap and dplyr. Next, I’ll create a vector with two addresses: our IDG office in Framingham, Massachusetts on Old Connecticut Path; and the RStudio office in Boston. And then I’ll geocode them.
For Zip Code boundaries, the US Census Bureau has one “Zip Code Tabulation Area” file for the entire US. It’s a pretty big file. Instead, I went to censusreporter.org and searched for data just for Massachusetts by Zip Code Area and downloaded that.
I’ve already done the download to save some time. (Note: You could also use the tigris package to download a Census shapefile.)
I could unzip my downloaded file manually, but it’s easier in R. Here I use base R’s unzip() function. That junkpaths = TRUE argument says I don’t want unzip adding another subdirectory based on the name of the zip file.
OK, now finally some geospatial work. I’ll import the shapefile into R using sf’s st_read() function.
Look at the console. There’s some information displayed there, and at least one of those data points is important: the “e.p.s.g”. That says what coordinate reference system was used to create the file. Here it was 4326. Without getting too deep into the weeds, an e.p.s.g. basically says what system was used to translate areas on a three-dimensional globe – the Earth – to two-dimensional coordinates (latitude and longitude). This is important because there are a lot of different coordinate reference systems. I want my Zip Code polygons and my address points to use the same one, so they line up properly.
This file happens to include a polygon for the entire state of Massachusetts, which I don't need. So I'll filter out that Massachusetts row with dplyr::filter().
This next bit of code isn’t necessary, but it’s a nice check of my shapefile: I do a quick plot of my zip code sf object with tmap’s qtm() (for quick theme map). And it looks like my geometry makes sense for Zip Codes.
Now I want to use the geocoded address data. This is the one part that gave me trouble when I first started using sf. Right now, the latitude and longitude are stored in a plain data frame. But this needs to be converted into an sf geospatial object – and with the right coordinate system.
I can do that with sf’s st_as_sf() function. (sf functions that operate on spatial data generally start with st_. If you’re wondering, that stands for “spatial” and “temporal”.).
st_as_sf() takes several arguments. In this code, the first argument is the object to transform, my geocoded addresses. Second, I’m telling sf which columns have the x (longitude) and y (latitude) values. Third, I’m setting the coordinate reference system to 4326, so it’s the same as my Zip Code polygons.
Now that I’ve set up my two data sets, calculating the Zip Code for each address is super easy: I just use sf’s st_join() function. That uses this syntax: The point sf object, the polygon sf object, and the type of join. In this case, I’m running st_join() on the geocoded points first and the Zip Code polygons second. That order means all points are included, but only Zip Codes that match. And my join type is st_within.
That’s it! Now if I look at my results you can see that Zip Codes were assigned to both addresses.
If you’d like to map that, here’s how you would do it with tmap .
That’s it for this episode, thanks for watching! For more R tips, head to the Do More With R page at go dot infoworld dot com slash more with R, all lowercase except for the R
You can also find the Do More With R playlist on the YouTube IDG Tech Talk channel -- where you can subscribe so you never miss an episode. Hope to see you next time!