Before, I had taken the fips data from maps package, joined with polygon data, and we were ready to perform geographical mapping.
I also have found the fips data from USA Census Bureau. Here we will explore this dataset, do cleaning, and see if it is more beneficial than the fips data within maps package.
This dataset, as we’ll see shortly, seems to be more convoluted, and requires more steps to make it ready. Nevertheless, I’ll still do it.
library(stringr)
library(knitr)
library(tidyverse)
library(data.table)
library(DT)
county.fips2 = fread(str_c(files_dir, 'fips_code_state_county.csv'), colClasses = "character")
county.fips2[, V5 := NULL]
setnames(county.fips2, c('state_alpha', 'state_fips', 'county_fips', 'county'))
county.fips2 = unite_(data = county.fips2, col = 'fips', from = c('state_fips', 'county_fips'), sep = '', remove = T)
state_names = fread(str_c(files_dir, 'state_names.csv'))
county.fips2 = merge(county.fips2, state_names, by = 'state_alpha')
setnames(county.fips2, 'state_name', 'state')
county.fips2 = county.fips2[, c(1,4:5,3,2)]
county_clean = county.fips2 %>%
select(county, state) %>%
map_df(.f = str_to_lower) %>%
map_df(.f = str_replace_all, pattern = '[[:punct:]]', replacement = '') %>%
map_df(.f = str_replace_all, pattern = '[[:space:]]', replacement = '')
county.fips2$county = county_clean$county
county.fips2$state = county_clean$state
county.fips2[, fips := as.integer(fips)]
# Check for duplicates
county.fips2[duplicated(county.fips2$fips), ]
## Empty data.table (0 rows) of 5 cols: state_alpha,state,state_ansi,county,fips
After importing the fips data, I have done some basic cleaning by getting rid of punctutation and spacings. There are no duplicated entries in this dataset. Before anything else, I’d like to see whether the number of entries, and the fips values are the same as the previous fips data that we obtained from the maps package.
# county.fips 3075 points
county.fips2[, .N]
## [1] 3235
county.fips2[, .(unique(state_alpha))][, .N]
## [1] 57
The number of rows are different, but that’s due to county.fips2 obtained from Census Bureau having states from overseas territories. The county.fips from maps package contains only 48 contiguous states and the District of Columbia. It does not have Alaska nor Hawaii. This might be a good reason to use Census Bureau data for people needing these other states.
The following are the states present in the fips data obtained from Census Bureau, but not available in the one obtained from maps package:
AK: Alaska HI: Hawaii MP: Mariana Islands PR: Puerto Rico AS: American Samoa VI: US Virgin Islands GU: Guam UM: U.S. Minor Outlying Islands
What if we exclude these states and only check for contiguous USA?
non_contiguous = c('AK', 'HI', 'MP', 'PR', 'AS', 'VI', 'GU', 'UM')
# Number of counties in 48 contiguous states
county.fips2[!state_alpha %in% non_contiguous, .N]
## [1] 3109
OK! So there are 34 more counties in the dataset obtained from US Census Bureau. This data is likely to be more up-to-date. Thus, this might be another good reason for people to use this dataset instead of the one in the maps package.
Let’s now check for some suspected suffixes or prefixes and see whether they are present.
d_city = county.fips2[county %>% str_detect('city'), .SD]
d_city %>% datatable()
d_borough = county.fips2[county %>% str_detect('borough'), .SD]
d_borough %>% datatable()
d_county = county.fips2[county %>% str_detect('county'), .SD]
d_county %>% datatable()
d_parish = county.fips2[county %>% str_detect('parish'), .SD]
d_parish %>% datatable()
d_muni = county.fips2[county %>% str_detect('municip'), .SD]
d_muni %>% datatable()
d_main = county.fips2[county %>% str_detect('main'), .SD]
d_main
## Empty data.table (0 rows) of 5 cols: state_alpha,state,state_ansi,county,fips
Investigation results: 1. borough in Florida and New Hampshire is part of the name. 2. city is part of the name for James City and Charles City in Virginia, the rest can be discarded.
# Remove the "county" and "parish" from the endings of the counties
county.fips2$county = county.fips2 %>%
select(county) %>%
map_df(.f = str_replace_all, pattern = 'county', '') %>%
map_df(.f = str_replace_all, pattern = 'parish', '')
# Clean "borough"
county_clean = county.fips2 %>%
select(state, county) %>%
filter(state == 'alaska') %>%
map_df(.f = str_replace_all, pattern = 'borough', '') %>%
map_df(.f = str_replace_all, pattern = 'censusarea', '') %>%
map_df(.f = str_replace_all, pattern = 'and', '')
county.fips2[state == 'alaska', county := county_clean$county]
# Clean "municip"
county.fips2$county = county.fips2 %>%
select(county) %>%
map_df(.f = str_replace_all, pattern = 'municip', '')
# Clean "city"
va = county.fips2[state %in% 'virginia', ][, row_id := .I]
va2 = va %>%
select(state, county, row_id) %>%
filter(state == 'virginia') %>%
filter(county %in% c('jamescity', 'charlescity'))
va3 = va %>%
select(state, county, row_id) %>%
filter(state == 'virginia') %>%
filter(!county %in% c('jamescity', 'charlescity')) %>%
map_df(.f = str_replace_all, pattern = 'city', '') %>%
rbind(va2) %>%
arrange(as.integer(row_id)) %>%
as.data.table()
county.fips2[state %in% 'virginia', county := va3$county]
Finally, the county.fips2 dataset is clean. Ther might be a few more issues, but that’s left to the user who is interested in this dataset.