Defining Similarity Between Two Counties Using Industrial Composition

Introduction

The code here is also part of the work that I did for studying the impact of anti-immigration laws on employment. To study the effect of an event, the best approach is to find two groups that are generally similar, while one received the treatment (e.g., passing anti-immigration law) and the other did not. We identify the event’s impact by observing the difference between the two groups’ characteristics pre and post treatment.

For our purpose, we construct a series of county pairs in which one county is in the state that passed some anti-immigration law, and the other one is in its neighboring state that does not have similar laws. We ensure the two counties in each pair are highly alike through establishing similarity scores based on the county’s industrial composition and only selecting the top 5% pairs with the highest scores. Here is how I did it:

Step 1: Load and Clean Data

setwd('E:/Github/proj1')
qwi = fread('qwiNeeded.csv')
qwibad = readRDS('qwi_datacheck.rds')
qwi = qwi[race == 'A0' & ethnicity == 'A0' & education == 'E0']
qwi = merge(qwi, qwibad, by = c('geo', 'industry'), all.x = T)
qwi = qwi[bad == 0]
names(qwi)[7] = 'Emp'
sample_n(qwi, 10)

##       geo industry         dt race ethnicity education  Emp HirA HirAS
##  1: 47081       52 2012-07-01   A0        A0        E0   46   NA    NA
##  2: 13085       42 2011-01-01   A0        A0        E0 1046  130    95
##  3: 39027       72 2011-04-01   A0        A0        E0 1618  415    94
##  4: 28001       71 2010-04-01   A0        A0        E0  418   53    25
##  5: 17069    48-49 2014-07-01   A0        A0        E0   80    8     7
##  6: 28115       23 2011-07-01   A0        A0        E0  360   60    28
##  7: 13177       52 2010-10-01   A0        A0        E0  146   19     8
##  8: 21213       23 2011-04-01   A0        A0        E0  316   78    16
##  9: 35025       53 2014-07-01   A0        A0        E0  638  108    46
## 10: 13287       92 2010-07-01   A0        A0        E0  267   26    28
##     statefipsn cntyfipsn bad
##  1:         47        81   0
##  2:         13        85   0
##  3:         39        27   0
##  4:         28         1   0
##  5:         17        69   0
##  6:         28       115   0
##  7:         13       177   0
##  8:         21       213   0
##  9:         35        25   0
## 10:         13       287   0

naics = unique(qwi$industry)
naicsdt = data.table(industry = naics)
naicsdt = naicsdt[industry != '00']
geo = unique(qwi$geo)
edu = unique(qwi$education)
time = unique(qwi$dt)

Some counties in qwi are missing one or more industries, we add those back and fill the cells with NA:

qwiIndus = qwi[dt=='2010-01-01' , .(geo, industry, Emp)]
framework = as.data.table(expand.grid(geo = unique(qwiIndus$geo), industry = unique(qwiIndus$industry)))
qwiIndus = merge(framework, qwiIndus, by = c('geo','industry'), all.x = T) 
qwiIndus = qwiIndus[,statefipsn := floor(geo/1000)][,cntyfipsn := geo - statefipsn*1000] # get state/cnty fips
sample_n(qwiIndus, 10)

##       geo industry  Emp statefipsn cntyfipsn
##  1:  1105       22   NA          1       105
##  2: 21165       21   NA         21       165
##  3: 18161       00 1306         18       161
##  4: 47087    31-33  169         47        87
##  5:  8051       92  614          8        51
##  6: 13079       00 1638         13        79
##  7: 18121       00 3272         18       121
##  8: 26043       52  345         26        43
##  9: 16081       42   67         16        81
## 10: 39119       23  897         39       119

Step 2: Compute Within County Industrial Composition

Here we create a variable called pctInState that reveals the industrial compositions of each county. The value in pctInState is the percentage of the total employment for a certain industry of a given county. This variable will later be used to calculate two counties’ similarity score:

qwiIndus = qwiIndus[, EmpTotCounty := sum(Emp, na.rm = T), by = .(geo)]
qwiIndus = qwiIndus[, EmpShareCounty := Emp/EmpTotCounty]
qwiIndus = qwiIndus[!is.na(EmpShareCounty), 
                    pctInState := rank(EmpShareCounty)/length(EmpShareCounty), 
                    by = c('statefipsn','industry')]
qwiIndus = qwiIndus[order(statefipsn, industry, EmpShareCounty)]
setkey(qwiIndus, geo, industry)
head(qwiIndus, 20)

##      geo industry   Emp statefipsn cntyfipsn EmpTotCounty EmpShareCounty
##  1: 1001       00 12364          1         1        24658    0.501419418
##  2: 1001       11   169          1         1        24658    0.006853759
##  3: 1001       21    43          1         1        24658    0.001743856
##  4: 1001       22   133          1         1        24658    0.005393787
##  5: 1001       23   601          1         1        24658    0.024373429
##  6: 1001    31-33  1305          1         1        24658    0.052924000
##  7: 1001       42   262          1         1        24658    0.010625355
##  8: 1001    44-45  2064          1         1        24658    0.083705086
##  9: 1001    48-49   288          1         1        24658    0.011679779
## 10: 1001       51   104          1         1        24658    0.004217698
## 11: 1001       52   321          1         1        24658    0.013018087
## 12: 1001       53   165          1         1        24658    0.006691540
## 13: 1001       54   356          1         1        24658    0.014437505
## 14: 1001       55    NA          1         1        24658             NA
## 15: 1001       56   268          1         1        24658    0.010868684
## 16: 1001       61  1453          1         1        24658    0.058926109
## 17: 1001       62  1488          1         1        24658    0.060345527
## 18: 1001       71    81          1         1        24658    0.003284938
## 19: 1001       72  1937          1         1        24658    0.078554627
## 20: 1001       81   439          1         1        24658    0.017803553
##     pctInState
##  1:  0.5671642
##  2:  0.4696970
##  3:  0.5263158
##  4:  0.5102041
##  5:  0.5151515
##  6:  0.1791045
##  7:  0.3166667
##  8:  0.8805970
##  9:  0.2968750
## 10:  0.4347826
## 11:  0.5873016
## 12:  0.8260870
## 13:  0.7222222
## 14:         NA
## 15:  0.3396226
## 16:  0.6865672
## 17:  0.5074627
## 18:  0.4705882
## 19:  0.9846154
## 20:  0.9491525

Step 3: Establish Measurement Of Similarity Between Two Counties

The chunk below defines function getCountyPairs that generates all possible pairs given a list of counties:

getCountyPairs = function(statelist) {
  treat_state = statelist[1]
  control_state = statelist[-1]
  treat_counties = unique(qwiIndus[statefipsn %in% treat_state, geo])
  control_counties = unique(qwiIndus[statefipsn %in% control_state, geo])
  pairs = as.data.table(expand.grid(county1 = treat_counties, 
                                    county2 = control_counties))
  return(pairs)
}

The following chunk defines function matchScores that calculates the correlation of the industrial compositions of two counties, and uses the value as the similarity score:

matchScores = function(countyPairs) {
  scores = numeric()
  for (i in 1:nrow(countyPairs)) {
    #if (i %% 1000 == 0) {print(i)}
    geo1 = countyPairs[i, county1]
    geo2 = countyPairs[i, county2]
    vec1 = qwiIndus[.(geo1), pctInState]
    vec2 = qwiIndus[.(geo2), pctInState]
    scores[i] = cor(vec1, vec2, use='pairwise.complete.obs', method = 'pearson')
  }
  return(scores)
}

Step 4: Obtain Similarity Scores

The states of interest are Indiana, Arizona, Utah, Alabama, Georgia, and South Carolina, which all passed harsh anti-immigration laws around the early 2010s. These states all enacted anti-immigration laws in the early 2010s. The chunk below creates six vectors that specifies these states and their neighbor states that do not have such laws:

stateIN = c(18,26,17,21,39)
stateAZ = c(4,32,8,35,6)
stateUT = c(49,32,16,56,8)
stateAL = c(1,12,28,47)
stateGA = c(13,37,12,47)
stateSC = c(45,37)

Now we can run the previously defined functions on some list of states. For each county in Alabama, how similar is it to the counties in the neighboring states? Let’s make a table about this:

pairs_AL = getCountyPairs(stateAL)
scores_AL = matchScores(pairs_AL)
scores_AL_dt = cbind(pairs_AL, scores_AL)
scores_AL_Jefferson = scores_AL_dt[county1 == 1073]
sample_n(scores_AL_Jefferson, 15)

##     county1 county2   scores_AL
##  1:    1073   47187  0.77440535
##  2:    1073   28057 -0.39996481
##  3:    1073   28093 -0.18778672
##  4:    1073   28109 -0.10970364
##  5:    1073   47057 -0.58552775
##  6:    1073   47105  0.05074975
##  7:    1073   47183  0.18787882
##  8:    1073   12111  0.07783896
##  9:    1073   12029 -0.90976931
## 10:    1073   12127  0.33768272
## 11:    1073   47165  0.46295008
## 12:    1073   47103 -0.01739118
## 13:    1073   28129  0.09167980
## 14:    1073   47155  0.24747681
## 15:    1073   47071  0.19843300

Output

We can easily visualize the county pairs’ similarity scores. Let’s use Jefferson county, AL as an example:

z1 = scores_AL_Jefferson[, -1]
names(z1) = c('region','value')

z2 = data.table(region = unique(scores_AL_dt$county1), value = -1)
z2[z2$region == 1073, 2] = 1

z1 = rbind(z1, z2)
z1[is.na(z1$value), 2] = -1

county_choropleth(z1, state_zoom = c("alabama", "tennessee", "mississippi",'florida'),legend = "Similarity Scores")

We can also do it for other states. Here are some examples:

Note: we did not include Utah in the figure above because Utah also passed some anti-immigration bill around the same time.

Note: Again, we did not include Alabama and South Carolina in the figure above because these two passed some anti-immigration bill around the same time.