ggplot2 package is known for its beautiful plots in R, such as histograms, boxplots, scatter plots etc. However, one can do geospatial mapping as well with this package. Here we’ll first build the data step by step using map_data and fips data, and then use the geom_polygon that will enable us to perform geospatial mapping.
The ggplot2::map_data function has the polygon data of longitudes, and latitudes imported from the maps package.
# Import libraries
library(tidyverse)
library(data.table)
library(DT)
library(knitr)
library(stringr)
Let’s first check whether there are some counties which have special endings such as: city, county, parish.
x = map_data('county') %>% as.data.table()
setnames(x, c('region', 'subregion'), c('state','county'))
d_city = x[county %>% str_detect('city'), .(group = unique(group)), keyby = .(state, county)]
d_city
## state county group
## 1: maryland baltimore city 1183
## 2: missouri st louis city 1545
## 3: nevada carson city 1731
## 4: virginia charles city 2809
## 5: virginia james city 2837
x[county %>% str_detect('county'), .N]
## [1] 0
x[county %>% str_detect('parish'), .N]
## [1] 0
v_without_city = str_replace(d_city$county, 'city', '') %>%
str_trim(side = 'right')
d_city[, without_city := v_without_city]
v = list()
for (i in seq(nrow(d_city)) ) {
counties = x[state == d_city[i = i, state], unique(county)]
present = str_detect(string = counties, pattern = d_city$without_city[i])
v[[i]] = data.table(state = d_city$state[i], county = counties[present])
}
rbindlist(v) %>% datatable()
Two counties have a plain name, and same name ending with “city”, namely St Louis City, and Baltimore City. We need to keep this in mind when working with fips data, and other datasets in the future. Since, the “city” ending is unique and is separate area from the corresponding plain name, they will not be removed from their names.
fun_load_polygon = function(map_type) {
x = map_data(map = map_type) %>% as.data.table()
setnames(x, c('region', 'subregion'), c('state', 'county'))
x2 = x[, .(state, county)] %>%
map_df(.f = str_to_lower) %>%
map_df(.f = str_replace_all, pattern = '[[:punct:]]', replacement = '') %>%
map_df(.f = str_replace_all, pattern = '[[:space:]]', replacement = '') %>%
as.data.table()
if (map_type == 'county') {
x2[state == 'districtofcolumbia' & county == 'washington',
county := 'districtofcolumbia']
x2[state == 'montana' & county == 'yellowstonenational',
county := 'yellowstone']
}
x$state = x2$state
x$county = x2$county
return(x)
}
polygon_state = fun_load_polygon('state')
polygon_county = fun_load_polygon('county')
polygon_state[1:1000, ] %>% datatable(caption = 'State Polygon', rownames = F)
polygon_county[1:1000, ] %>% datatable(caption = 'County Polygon', rownames =F)
# polygon_county[state == 'montana' & county == 'yellowstone']
# polygon_county[state == 'montana' & county == 'park']
The punctutations and spacings have been removed, and the names have been converted to all lower case. The aim of this is to make the datasets as universal as possible, so that they can be merged with other datasets easily.
One interesting fact that I found out is that county Chesapeake in Virginia is not present in the county polygon data, but is present in state polygon data. However, Chesapeake in state polygon data is named wrong, and here together with Chincoteague represents the northern peninsula of Virginia that stretches to Maryland. It is actually the counties of Accomack, and Northampton. This peninsula is shown in the map below. So, it should not be confused with the real Chesapeake.
ggplot(data = polygon_state[state == 'virginia'], mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(fill = 'ivory1', color = "limegreen", size = 0.1) +
geom_polygon(data = polygon_state[county %in% c('chesapeake', 'chincoteague')], col = 'black', size = 2) +
geom_polygon(data = polygon_county[state == 'virginia' & county %in% c('accomack', 'northampton')], aes(fill = county)) +
coord_map('mercator') +
theme_bw()
The county name “Washington” within the state of “District of Columbia” has been renamed to “District of Columbia”, and the county name “Yellowstone National” has been renamed to “Yellowstone”, as these names are more common in databases. Depending on your data that contains the values to be analyzed, one can play with the names of the counties to match them. Even better, using a common unique identifier can help in the process and decrease the chances of making an error. Fips is such a unique identifier used in the geographical mapping of state and county data. States and counties each have their own unique fips code.
Next step will involve the process of preparing the fips data. Maps package in R contains the county fips dataset.
library(maps)
data("county.fips")
head(county.fips) %>% kable()
fips | polyname |
---|---|
1001 | alabama,autauga |
1003 | alabama,baldwin |
1005 | alabama,barbour |
1007 | alabama,bibb |
1009 | alabama,blount |
1011 | alabama,bullock |
county.fips = str_split(string = county.fips$polyname, pattern = ',', simplify = T) %>%
as.data.frame() %>%
map_df(.f = str_to_lower) %>%
map_df(.f = str_replace_all, pattern = '[[:punct:]]', replacement = '') %>%
map_df(.f = str_replace_all, pattern = '[[:space:]]', replacement = '') %>%
cbind(county.fips$fips) %>%
as.data.table() %>%
setnames(c('state', 'county', 'fips'))
county.fips %>% datatable()
Let’s check whether the fips data has some common suffixes at the end of counties:
# Suspecting for some counties having suffixes such as:
x = county.fips
d_city = x[county %>% str_detect('city'), .SD]
d_city
## state county fips
## 1: maryland baltimorecity 24510
## 2: missouri stlouiscity 29510
## 3: nevada carsoncity 32510
## 4: virginia charlescity 51036
## 5: virginia jamescity 51095
d_county = x[county %>% str_detect('county'), .SD]
d_county
## Empty data.table (0 rows) of 3 cols: state,county,fips
d_parish = x[county %>% str_detect('parish'), .SD]
d_parish
## Empty data.table (0 rows) of 3 cols: state,county,fips
d_main = x[county %>% str_detect('main'), .SD]
d_main
## state county fips
## 1: florida okaloosamain 12091
## 2: northcarolina currituckmain 37053
## 3: texas galvestonmain 48167
## 4: virginia accomackmain 51001
## 5: washington piercemain 53053
fun_counties_with_extra_names = function(d = county.fips, d_extra, extra = 'main') {
v_without = str_sub(d_extra$county, 1, str_length(d_extra$county) - str_length(extra))
counties_extra = list_along(v_without)
for (i in seq_along(v_without)) {
counties_extra[[i]] = d[county %>% str_detect(v_without[i]) &
state == d_extra[i, state], .SD]
}
return(data.table::rbindlist(counties_extra))
}
Counties in the fips set do not end with either county or parish.
Counties that have city as their ending:
extra = 'city'
county_extra = fun_counties_with_extra_names(d = county.fips,
d_extra = d_city, extra)
county_extra %>% datatable()
The counties that have a plain name, and same name ending with city are unique, and so do not constitute a problem.
Counties that have main as their ending:
extra = 'main'
county_extra = fun_counties_with_extra_names(d = county.fips,
d_extra = d_main, extra)
county_extra %>% datatable()
The counties that have a plain name with various suffixes, and one ending with main are identical areas, indicated by the same fips code, and so do constitute a problem. We will need to get rid of these.
Let’s first get rid of the extra entries for counties with main.
fun_clean_counties_with_extra_names = function(d, d_extra, extra = 'main') {
# get rid of any county that has "extra" and other suffixes
d = d[!fips %in% county_extra$fips]
# get rid of "extra" part of counties
d_extra$county = str_sub(d_extra$county, 1, str_length(d_extra$county) - str_length(extra))
# append it to county.fips data
return( rbind(d, d_extra) )
}
extra = 'main'
county.fips = fun_clean_counties_with_extra_names(d = county.fips,
d_extra = d_main, extra)
Now, let’s check whether there are any other duplicates.
dup = county.fips[duplicated(x = county.fips$fips), fips ]
county.fips[fips %in% dup, ]
## state county fips
## 1: louisiana stmartinnorth 22099
## 2: louisiana stmartinsouth 22099
## 3: montana park 30067
## 4: montana yellowstonenational 30067
## 5: washington sanjuanlopezisland 53055
## 6: washington sanjuanorcasisland 53055
## 7: washington sanjuansanjuanisland 53055
county.fips[county %in% c('yellowstone', 'park') & state == 'montana', ] %>% kable()
state | county | fips |
---|---|---|
montana | park | 30067 |
montana | yellowstone | 30111 |
Indeed there are other duplicate entries. Let’s clean these remaining duplicates. It is found out Yellowstone in Montana has fips code 30111. From other datasets it is also verified that county Park in Montana has fips code 30067. Thus, for the repeated values in Montana, the one to be kept is Park.
county.fips[fips == 22099, county := 'stmartin']
county.fips[fips == 30067, county := 'park']
county.fips[fips == 53055, county := 'sanjuan']
county.fips = county.fips[!duplicated(county.fips)]
county.fips[duplicated(county.fips$fips)]
## Empty data.table (0 rows) of 3 cols: state,county,fips
county.fips[state == 'districtofcolumbia', county := 'districtofcolumbia']
Finally the fips dataset is clean and ready to be integrated into the polygon data.
Now, I will join the county polygon data with the fips code and save for later use. Since the polygon_county data’s counties match one to one with the ones at county.fips, we can join these to reach the desired dataset.
polygon_county[, .(county = unique(county)), by = state][, county] %>%
setdiff( county.fips[, county])
## character(0)
polygon_county = left_join(polygon_county, county.fips,
by = c('state', 'county')) %>% as.data.table()
polygon_county[is.na(fips)]
## Empty data.table (0 rows) of 7 cols: long,lat,group,order,state,county...
polygon_county[1:1000] %>% datatable()
Good. We have cleaned the fips data and joined with the cleaned polygon data.
The final step involves computing the center latitude and longitude of states and counties. This sort of data can be useful when we want to plot something in the center of states or counties.
# Input data is the polygon data
# d has long and lat variables for the state-county combinations (or fips)
fun_get_centers = function(d) {
# d[, .(atan2( sum(cos(long)), sum(cos(lat)) ) )]
d = copy(d)
d[, `:=`(long_rad = long*pi/180, lat_rad = lat*pi/180)]
xyz = d[, .( x = mean( cos(lat_rad) * cos(long_rad) ),
y = mean( cos(lat_rad) * sin(long_rad) ),
z = mean( sin(lat_rad) ) ), by = .(state, county, group)]
# central_long = atan2(y,x)
# central_square_root = sqrt(x^2 + y^2)
# central_lat = atan2(z, central_square_root)
xyz = xyz[, {
cent_long = atan2(y,x)
cent_sqrt = sqrt(x^2 + y^2)
cent_lat = atan2(z, cent_sqrt)
cent_lat = cent_lat * 180/pi
cent_long = cent_long * 180/pi
list(long = cent_long,
lat = cent_lat)
}, by = .(state, county, group)]
return(xyz)
}
center_county = fun_get_centers(polygon_county)
center_state = fun_get_centers(polygon_state)
center_county = left_join(center_county, county.fips) %>% as.data.table()
dup = center_county[duplicated(center_county$fips), fips]
center_county[fips %in% dup,] %>% kable()
state | county | group | long | lat | fips |
---|---|---|---|---|---|
florida | okaloosa | 335 | -86.56789 | 30.58918 | 12091 |
florida | okaloosa | 336 | -86.45881 | 30.43240 | 12091 |
louisiana | stmartin | 1128 | -91.72565 | 30.23448 | 22099 |
louisiana | stmartin | 1129 | -91.28414 | 29.84270 | 22099 |
montana | yellowstone | 1620 | -108.18860 | 46.02341 | 30111 |
montana | yellowstone | 1621 | -110.75128 | 44.98961 | 30111 |
northcarolina | currituck | 1883 | -75.95462 | 36.52606 | 37053 |
northcarolina | currituck | 1884 | -75.96341 | 36.35219 | 37053 |
northcarolina | currituck | 1885 | -75.70764 | 36.11839 | 37053 |
texas | galveston | 2575 | -95.04942 | 29.46282 | 48167 |
texas | galveston | 2576 | -94.56639 | 29.52865 | 48167 |
virginia | accomack | 2790 | -75.31457 | 37.94509 | 51001 |
virginia | accomack | 2791 | -75.69053 | 37.72515 | 51001 |
washington | pierce | 2917 | -122.05552 | 47.02515 | 53053 |
washington | pierce | 2918 | -122.66869 | 47.31016 | 53053 |
washington | sanjuan | 2919 | -122.82764 | 48.45774 | 53055 |
washington | sanjuan | 2920 | -122.83491 | 48.64715 | 53055 |
washington | sanjuan | 2921 | -123.02817 | 48.52786 | 53055 |
center_county[is.na(fips)]
## Empty data.table (0 rows) of 6 cols: state,county,group,long,lat,fips
# fwrite(polygon_county, str_c(files_dir, 'polygon_county.csv'))
# fwrite(polygon_state, str_c(files_dir, 'polygon_state.csv'))
# fwrite(center_county, str_c(files_dir, 'center_county.csv'))
# fwrite(center_state, str_c(files_dir, 'center_state.csv'))
It should be noted that 8 counties are divided by waterways, and thus have different groupings of polygon values. They are not duplicate records. This can also be seen in the center_state data. There is also no missing fips code in the polygon set. So the center_county and center_state datasets are clean and ready for analysis.