Introduction

My another mentor, Dr. Javier Mella-Barahona at Universidad de los Andes, invited me to help out with his research project that studies the effects of political cycles on the cross-section of U.S. corporate bond returns. For this project, we need to merge two data sets together: bond returns (contains bond issuing company’s SIC code) and US government spending (uses IO code to classify industries), and the SIC-IO concordance table plays a key role here.

The Standard Industry Classification system (SIC for short) and the industry code employed by the Input-Output Account (IO for short) by the Bureau of Economic Analysis (BEA) have a many-to-many relationship. For instance, SIC 138 has four IO matches: 80000, 110601, 110602, and 120215. Meanwhile, IO 80000 has three matches in SIC, which are 131, 132, and 138. In the raw concordance table (txt file) provided by BEA, each observation is a unique IO code with its SIC matches. However, we want the opposite table in which each row is a unique SIC code with the corresponding IO codes. This task is challenging because it requires extensive string manipulation and careful use of regular expression. Here is how I did it.

Overview of The Raw Table

Remove Headers and Sub-headers

The observations in the raw table contains the following:

  1. Header: “      AGRICULTURE, FORESTRY, AND FISHERIES”

    • There are white spaces in front of the actual information in each header.
    • All letters are CAPITAL.
  2. Subheader: “1 Livestock and livestock products:”

  3. IO-SIC Pairs: “    1.0100 Dairy farm products 024,*019, *0259, *029”

    • The first part “1.0100 Dairy farm products” is the IO classification code, and the second part shows the SIC codes that corresponds to this IO code. Remove the Headers and Sub-headers of Each Sector

Firstly, we get rid of the white spaces in front of the headers, so the observations that start with white space are all actual SIC-IO pairs, which will be used as the condition to filter out the rows we want: the SIC-IO pairs. The loop below does this job for us:

z = as.data.table(io_sic_1987_raw)
for (i in 1:nrow(z)) {
  if (!str_detect(z$X1[i], "[:lower:]") & !str_detect(z$X1[i], "[:digit:]")) {
    z$X1[i] = str_remove(z$X1[i], "^[:blank:]+")
  }
}
print(z$X1[1:25])
##  [1] "AGRICULTURE, FORESTRY, AND FISHERIES"                                                                   
##  [2] "1     Livestock and livestock products:"                                                                
##  [3] "       1.0100 Dairy farm products                                           024,*019, *0259, *029"      
##  [4] "       1.0200 Poultry and eggs                                              0251-3, *0259, *019, *0219" 
##  [5] "                                                                              *029"                     
##  [6] "       1.0301 Meat animals                                                  0211-4, *0219, *019, *0259 "
##  [7] "                                                                             *029"                      
##  [8] "       1.0302 Miscellaneous livestock                                      0271-3, *0279, *019, *0219," 
##  [9] "                                                                              *0259, *029"              
## [10] "2     Other agricultural products:"                                                                     
## [11] "       2.0100 Cotton                                                       0131, *019, *0219, *0259,"   
## [12] "                                                                             *029"                      
## [13] "       2.0201 Food grains                                                  *011, *019, *0219, *0259,"   
## [14] "                                                                             *029"                      
## [15] "       2.0202 Feed grains                                                  *011, *0139, *019, *0219,"   
## [16] "                                                                             *0259, *029"               
## [17] "       2.0203 Grass seeds                                                  *0139, *019, *0219, *0259,"  
## [18] "                                                                             *029"                      
## [19] "       2.0300 Tobacco                                                      0132, *019, *0219, *0259,"   
## [20] "                                                                             *029"                      
## [21] "       2.0401 Fruits                                                       0171-2, 0174-5, *0179, "     
## [22] "                                                                             *019, *0219, *0259, *029"  
## [23] "       2.0402 Tree nuts                                                    0173, *0179, *019, *0219, "  
## [24] "                                                                             *0259, *029"               
## [25] "       2.0501 Vegetables                                                   0134, *0139, 016, *019,"

Now only the headers and sub-headers do not start with white space, I can get rid of these rows using this condition:

z1 = rep(NA, nrow(z))
for (i in 1:nrow(z)) {
  if (str_detect(z$X1[i], "^[:blank:]")) {
    z1[i] = z$X1[i]
  }
}
z1 = z1[!is.na(z1)]
print(z1[1:25])
##  [1] "       1.0100 Dairy farm products                                           024,*019, *0259, *029"      
##  [2] "       1.0200 Poultry and eggs                                              0251-3, *0259, *019, *0219" 
##  [3] "                                                                              *029"                     
##  [4] "       1.0301 Meat animals                                                  0211-4, *0219, *019, *0259 "
##  [5] "                                                                             *029"                      
##  [6] "       1.0302 Miscellaneous livestock                                      0271-3, *0279, *019, *0219," 
##  [7] "                                                                              *0259, *029"              
##  [8] "       2.0100 Cotton                                                       0131, *019, *0219, *0259,"   
##  [9] "                                                                             *029"                      
## [10] "       2.0201 Food grains                                                  *011, *019, *0219, *0259,"   
## [11] "                                                                             *029"                      
## [12] "       2.0202 Feed grains                                                  *011, *0139, *019, *0219,"   
## [13] "                                                                             *0259, *029"               
## [14] "       2.0203 Grass seeds                                                  *0139, *019, *0219, *0259,"  
## [15] "                                                                             *029"                      
## [16] "       2.0300 Tobacco                                                      0132, *019, *0219, *0259,"   
## [17] "                                                                             *029"                      
## [18] "       2.0401 Fruits                                                       0171-2, 0174-5, *0179, "     
## [19] "                                                                             *019, *0219, *0259, *029"  
## [20] "       2.0402 Tree nuts                                                    0173, *0179, *019, *0219, "  
## [21] "                                                                             *0259, *029"               
## [22] "       2.0501 Vegetables                                                   0134, *0139, 016, *019,"     
## [23] "                                                                             *0219, *0259, *029, *0119" 
## [24] "       2.0502 Sugar crops                                                  0133, *019, *0219, *0259,"   
## [25] "                                                                             *029"

After removing all the headers and sub-headers, I can go ahead and delete the white spaces in front of the observations left:

for (i in 1:length(z1)) {
  z1[i] = str_remove(z1[i], "^      ")
}
z = as.data.table(z1)
print(z$z1[1:25])
##  [1] " 1.0100 Dairy farm products                                           024,*019, *0259, *029"      
##  [2] " 1.0200 Poultry and eggs                                              0251-3, *0259, *019, *0219" 
##  [3] "                                                                        *029"                     
##  [4] " 1.0301 Meat animals                                                  0211-4, *0219, *019, *0259 "
##  [5] "                                                                       *029"                      
##  [6] " 1.0302 Miscellaneous livestock                                      0271-3, *0279, *019, *0219," 
##  [7] "                                                                        *0259, *029"              
##  [8] " 2.0100 Cotton                                                       0131, *019, *0219, *0259,"   
##  [9] "                                                                       *029"                      
## [10] " 2.0201 Food grains                                                  *011, *019, *0219, *0259,"   
## [11] "                                                                       *029"                      
## [12] " 2.0202 Feed grains                                                  *011, *0139, *019, *0219,"   
## [13] "                                                                       *0259, *029"               
## [14] " 2.0203 Grass seeds                                                  *0139, *019, *0219, *0259,"  
## [15] "                                                                       *029"                      
## [16] " 2.0300 Tobacco                                                      0132, *019, *0219, *0259,"   
## [17] "                                                                       *029"                      
## [18] " 2.0401 Fruits                                                       0171-2, 0174-5, *0179, "     
## [19] "                                                                       *019, *0219, *0259, *029"  
## [20] " 2.0402 Tree nuts                                                    0173, *0179, *019, *0219, "  
## [21] "                                                                       *0259, *029"               
## [22] " 2.0501 Vegetables                                                   0134, *0139, 016, *019,"     
## [23] "                                                                       *0219, *0259, *029, *0119" 
## [24] " 2.0502 Sugar crops                                                  0133, *019, *0219, *0259,"   
## [25] "                                                                       *029"

Make Each IO-SIC Pair Occupies Only One Row

As we can see, some IO-SIC pairs can take two or three rows, but I want each pair only occupies one row, so I have to paste the multiple cells of one pair together. To do so, I create four columns. The first column control is the first 7 characters of the observation, the second column control2 is equal to 1 if the corresponding control cell is empty, which indicates that row does not start a new io-sic pair. The third column control3 equals to 2 if the row above it does not start a new pair, and the forth column control4 is the max of control2 and control3.

z = z[, control := substr(z1, 1, 7)]
z = z[control == '       ', control2 := 1]
z$control3 = NA
z$control2[is.na(z$control2)] = 0
for (i in 2:nrow(z)) {
  if (z$control2[i] == 1 & z$control2[i+1] == 1) {
    z$control3[i+1] = 2
  }
}
z$control3[is.na(z$control3)] = 0
z$control4 = mapply(max, z$control2, z$control3)
print(z[1:25])
##                                                                                                    z1
##  1:        1.0100 Dairy farm products                                           024,*019, *0259, *029
##  2:   1.0200 Poultry and eggs                                              0251-3, *0259, *019, *0219
##  3:                                                                                              *029
##  4:  1.0301 Meat animals                                                  0211-4, *0219, *019, *0259 
##  5:                                                                                              *029
##  6:   1.0302 Miscellaneous livestock                                      0271-3, *0279, *019, *0219,
##  7:                                                                                       *0259, *029
##  8:     2.0100 Cotton                                                       0131, *019, *0219, *0259,
##  9:                                                                                              *029
## 10:     2.0201 Food grains                                                  *011, *019, *0219, *0259,
## 11:                                                                                              *029
## 12:     2.0202 Feed grains                                                  *011, *0139, *019, *0219,
## 13:                                                                                       *0259, *029
## 14:    2.0203 Grass seeds                                                  *0139, *019, *0219, *0259,
## 15:                                                                                              *029
## 16:     2.0300 Tobacco                                                      0132, *019, *0219, *0259,
## 17:                                                                                              *029
## 18:       2.0401 Fruits                                                       0171-2, 0174-5, *0179, 
## 19:                                                                          *019, *0219, *0259, *029
## 20:    2.0402 Tree nuts                                                    0173, *0179, *019, *0219, 
## 21:                                                                                       *0259, *029
## 22:       2.0501 Vegetables                                                   0134, *0139, 016, *019,
## 23:                                                                         *0219, *0259, *029, *0119
## 24:     2.0502 Sugar crops                                                  0133, *019, *0219, *0259,
## 25:                                                                                              *029
##                                                                                                    z1
##     control control2 control3 control4
##  1:  1.0100        0        0        0
##  2:  1.0200        0        0        0
##  3:                1        0        1
##  4:  1.0301        0        0        0
##  5:                1        0        1
##  6:  1.0302        0        0        0
##  7:                1        0        1
##  8:  2.0100        0        0        0
##  9:                1        0        1
## 10:  2.0201        0        0        0
## 11:                1        0        1
## 12:  2.0202        0        0        0
## 13:                1        0        1
## 14:  2.0203        0        0        0
## 15:                1        0        1
## 16:  2.0300        0        0        0
## 17:                1        0        1
## 18:  2.0401        0        0        0
## 19:                1        0        1
## 20:  2.0402        0        0        0
## 21:                1        0        1
## 22:  2.0501        0        0        0
## 23:                1        0        1
## 24:  2.0502        0        0        0
## 25:                1        0        1
##     control control2 control3 control4

We leave only the data column and control4 column. Now if the value in control4 column is 0, the corresponding pair takes only one row. And if the value is 1, the next row is also part of the pair; and if the value is 2, then the next two rows are both part of this item. Knowing that, I can paste the multiple cells of one IO-SIC pair together:

z = z[, c("z1", "control4")]
names(z) = c("info","control")
z$temp = NA
for (i in 1:(nrow(z)-2)) {
  if (z$control[i] == 0 & z$control[i+1] == 0) {
    z$temp[i] = z$info[i]
  } else if (z$control[i] == 0 & z$control[i+1] == 1 & z$control[i+2] == 0) {
    z$temp[i] = paste0(z$info[i],z$info[i+1])
  } else if (z$control[i] == 0 & z$control[i+1] == 1 & z$control[i+2] == 2) {
    z$temp[i] = paste0(z$info[i],z$info[i+1],z$info[i+2])
  }
}

i = nrow(z)-1
if (z$control[i] == 0 & z$control[i+1] == 0) {
  z$temp[i] = z$info[i]
  z$temp[i+1] = z$info[i+1]
} else if (z$control[i] == 0 & z$control[i+1] == 1) {
  z$temp[i] = paste0(z$info[i],z$info[i+1])
}
print(z$temp[1:25])
##  [1] " 1.0100 Dairy farm products                                           024,*019, *0259, *029"                                                                                                 
##  [2] " 1.0200 Poultry and eggs                                              0251-3, *0259, *019, *0219                                                                        *029"                
##  [3] NA                                                                                                                                                                                            
##  [4] " 1.0301 Meat animals                                                  0211-4, *0219, *019, *0259                                                                        *029"                
##  [5] NA                                                                                                                                                                                            
##  [6] " 1.0302 Miscellaneous livestock                                      0271-3, *0279, *019, *0219,                                                                        *0259, *029"         
##  [7] NA                                                                                                                                                                                            
##  [8] " 2.0100 Cotton                                                       0131, *019, *0219, *0259,                                                                       *029"                   
##  [9] NA                                                                                                                                                                                            
## [10] " 2.0201 Food grains                                                  *011, *019, *0219, *0259,                                                                       *029"                   
## [11] NA                                                                                                                                                                                            
## [12] " 2.0202 Feed grains                                                  *011, *0139, *019, *0219,                                                                       *0259, *029"            
## [13] NA                                                                                                                                                                                            
## [14] " 2.0203 Grass seeds                                                  *0139, *019, *0219, *0259,                                                                       *029"                  
## [15] NA                                                                                                                                                                                            
## [16] " 2.0300 Tobacco                                                      0132, *019, *0219, *0259,                                                                       *029"                   
## [17] NA                                                                                                                                                                                            
## [18] " 2.0401 Fruits                                                       0171-2, 0174-5, *0179,                                                                        *019, *0219, *0259, *029" 
## [19] NA                                                                                                                                                                                            
## [20] " 2.0402 Tree nuts                                                    0173, *0179, *019, *0219,                                                                        *0259, *029"           
## [21] NA                                                                                                                                                                                            
## [22] " 2.0501 Vegetables                                                   0134, *0139, 016, *019,                                                                       *0219, *0259, *029, *0119"
## [23] NA                                                                                                                                                                                            
## [24] " 2.0502 Sugar crops                                                  0133, *019, *0219, *0259,                                                                       *029"                   
## [25] NA

Now remove the rows that are NAs and get rid of all the ‘*’ and ‘.’ in each observation:

z1 = z[!is.na(z$temp)]
z1 = z1[, temp]
z1 = str_remove_all(z1, "\\*")
z1 = str_remove_all(z1, "\\.")
print(z1[1:25])
##  [1] " 10100 Dairy farm products                                           024,019, 0259, 029"                                                                                              
##  [2] " 10200 Poultry and eggs                                              0251-3, 0259, 019, 0219                                                                        029"              
##  [3] " 10301 Meat animals                                                  0211-4, 0219, 019, 0259                                                                        029"              
##  [4] " 10302 Miscellaneous livestock                                      0271-3, 0279, 019, 0219,                                                                        0259, 029"        
##  [5] " 20100 Cotton                                                       0131, 019, 0219, 0259,                                                                       029"                 
##  [6] " 20201 Food grains                                                  011, 019, 0219, 0259,                                                                       029"                  
##  [7] " 20202 Feed grains                                                  011, 0139, 019, 0219,                                                                       0259, 029"            
##  [8] " 20203 Grass seeds                                                  0139, 019, 0219, 0259,                                                                       029"                 
##  [9] " 20300 Tobacco                                                      0132, 019, 0219, 0259,                                                                       029"                 
## [10] " 20401 Fruits                                                       0171-2, 0174-5, 0179,                                                                        019, 0219, 0259, 029"
## [11] " 20402 Tree nuts                                                    0173, 0179, 019, 0219,                                                                        0259, 029"          
## [12] " 20501 Vegetables                                                   0134, 0139, 016, 019,                                                                       0219, 0259, 029, 0119"
## [13] " 20502 Sugar crops                                                  0133, 019, 0219, 0259,                                                                       029"                 
## [14] " 20503 Miscellaneous crops                                          0119, 0139, 019, 0219,                                                                       0259, 029"           
## [15] " 20600 Oil bearing crops                                            0116, 0119, 0139, 0219,                                                                       0259, 029"          
## [16] " 20701 Forest products                                              018, 019, 0219, 0259,                                                                       029"                  
## [17] " 20702 Greenhouse and nursery products                              018, 019, 0219, 0259                                                                       029"                   
## [18] " 30001 Forestry products                                            081, 083, 097"                                                                                                    
## [19] " 30002 Commercial fishing                                           091"                                                                                                              
## [20] " 40001 Agricultural, forestry, and fishery services                 0254, 0279, 071-2,                                                                       075-6, 085, 092"         
## [21] " 40002 Landscape and horticultural services                         078"                                                                                                              
## [22] " 50000 Iron and ferroalloy ores                                     101, 106"                                                                                                         
## [23] " 60100 Copper ore                                                   102"                                                                                                              
## [24] " 60200 Nonferrous metal ores, except copper                         103-4, 109, 108"                                                                                                  
## [25] " 70000 Coal                                                         122-3, 124"

Modify SIC Code

We need 3-digit (or less than 3-digit) sic code, so I replace all 4-digit sic with its first three digits using regular expression to detect the 4-digit pattern and make the replacement. Because there may be multiple matches in one cell, so I let it run 10 times:

for (k in 1:10) {
  for (i in 1:length(z1)) {
    if (grepl('\\d\\d\\d\\d-\\d', z1[i])) {
      i.str = str_extract(z1[i],"\\d\\d\\d\\d-\\d")
      i.str = str_extract(i.str,"\\d\\d\\d")
      z1[i] = str_replace(z1[i], "\\d\\d\\d\\d-\\d", i.str)
    } 
    if (grepl('\\d\\d\\d\\d,', z1[i])) {  
      i.str = str_extract(z1[i],"\\d\\d\\d\\d,")
      i.str = str_extract(i.str,"\\d\\d\\d")
      z1[i] = str_replace(z1[i], "\\d\\d\\d\\d,", paste0(i.str,", "))
    }
    if (grepl('\\(.+\\)', z1[i])) {
      z1[i] = str_remove_all(z1[i], "\\(.+\\)")
    }
    if (grepl('\\d\\d\\d\\d$', z1[i])) {
      i.str = str_extract(z1[i],"\\d\\d\\d\\d$")
      i.str = str_extract(i.str,"\\d\\d\\d")
      z1[i] = str_replace(z1[i], "\\d\\d\\d\\d$", i.str)
    }
  }
}
print(z1[1:25])
##  [1] " 10100 Dairy farm products                                           024,019, 025,  029"                                                                                             
##  [2] " 10200 Poultry and eggs                                              025, 025,  019, 0219                                                                        029"                
##  [3] " 10301 Meat animals                                                  021, 021,  019, 0259                                                                        029"                
##  [4] " 10302 Miscellaneous livestock                                      027, 027,  019, 021,                                                                         025,  029"          
##  [5] " 20100 Cotton                                                       013,  019, 021,  025,                                                                        029"                
##  [6] " 20201 Food grains                                                  011, 019, 021,  025,                                                                        029"                 
##  [7] " 20202 Feed grains                                                  011, 013,  019, 021,                                                                        025,  029"           
##  [8] " 20203 Grass seeds                                                  013,  019, 021,  025,                                                                        029"                
##  [9] " 20300 Tobacco                                                      013,  019, 021,  025,                                                                        029"                
## [10] " 20401 Fruits                                                       017, 017, 017,                                                                         019, 021,  025,  029"     
## [11] " 20402 Tree nuts                                                    017,  017,  019, 021,                                                                         025,  029"         
## [12] " 20501 Vegetables                                                   013,  013,  016, 019,                                                                       021,  025,  029, 011"
## [13] " 20502 Sugar crops                                                  013,  019, 021,  025,                                                                        029"                
## [14] " 20503 Miscellaneous crops                                          011,  013,  019, 021,                                                                        025,  029"          
## [15] " 20600 Oil bearing crops                                            011,  011,  013,  021,                                                                        025,  029"         
## [16] " 20701 Forest products                                              018, 019, 021,  025,                                                                        029"                 
## [17] " 20702 Greenhouse and nursery products                              018, 019, 021,  0259                                                                       029"                  
## [18] " 30001 Forestry products                                            081, 083, 097"                                                                                                   
## [19] " 30002 Commercial fishing                                           091"                                                                                                             
## [20] " 40001 Agricultural, forestry, and fishery services                 025,  027,  071-2,                                                                       075-6, 085, 092"        
## [21] " 40002 Landscape and horticultural services                         078"                                                                                                             
## [22] " 50000 Iron and ferroalloy ores                                     101, 106"                                                                                                        
## [23] " 60100 Copper ore                                                   102"                                                                                                             
## [24] " 60200 Nonferrous metal ores, except copper                         103-4, 109, 108"                                                                                                 
## [25] " 70000 Coal                                                         122-3, 124"

By now, no more 4-digit sic is left, but we still have some issues to deal with. There are sic codes like 15-17 or 155-7, but I want them to be 15, 16, 17 and 155, 156, 157. The chunck below does the job. it first converts 15-17 to numeric items: 15 and 17, then creates a sequence that starts with 15 and end with 17, and finally convert the sequence back to a string as “15, 16, 17”. Again, there may be multiple matches in one cell, so I let it run 5 times:

for (k in 1:5) {
  for (i in 1:length(z1)) {
    if (grepl('-\\d,', z1[i])) {
      i.start = as.integer(substr(z1[i], regexpr('-\\d,', z1[i])[1]-1,regexpr('-\\d,', z1[i])[1]-1))
      i.end = as.integer(substr(z1[i], regexpr('-\\d,', z1[i])[1]+1, regexpr('-\\d,', z1[i])[1]+1))
      i.str = str_extract(z1[i],"[:digit:]+-")
      i.str = as.integer(str_remove(i.str, "-"))
      i.str1 = ""
      for (j in 0:(i.end - i.start)) {
        i.str1 = paste0(i.str1, as.character(i.str+j), ", ")
      }
      z1[i] = str_replace(z1[i], "[:digit:]+-[:digit:],", i.str1)
    }
    if (grepl('-\\d$', z1[i])) {
      i.start = as.integer(substr(z1[i], regexpr('-\\d$', z1[i])[1]-1,regexpr('-\\d$', z1[i])[1]-1))
      i.end = as.integer(substr(z1[i], regexpr('-\\d$', z1[i])[1]+1, regexpr('-\\d$', z1[i])[1]+1))
      i.str = str_extract(z1[i],"[:digit:]+-")
      i.str = as.integer(str_remove(i.str, "-"))
      i.str1 = ""
      for (j in 0:(i.end - i.start)) {
        i.str1 = paste0(i.str1, as.character(i.str+j), ", ")
      }
      i.str1 = str_remove(i.str1, ", $")
      z1[i] = str_replace(z1[i], "[:digit:]+-[:digit:]", i.str1)
    }
    if (grepl('-\\d\\d', z1[i])) {
      i.start = as.integer(substr(z1[i], regexpr('-\\d\\d', z1[i])[1]-1,regexpr('-\\d\\d', z1[i])[1]-1))
      i.end = as.integer(substr(z1[i], regexpr('-\\d\\d', z1[i])[1]+2, regexpr('-\\d\\d', z1[i])[1]+2))
      i.str = str_extract(z1[i],"[:digit:]+-")
      i.str = as.integer(str_remove(i.str, "-"))
      i.str1 = ""
      for (j in 0:(i.end - i.start)) {
        i.str1 = paste0(i.str1, as.character(i.str+j), ", ")
      }
      i.str1 = str_remove(i.str1, ", $")
      z1[i] = str_replace(z1[i], "[:digit:]+-[:digit:]+", i.str1)
    }
  }
}
z1 = as.data.table(z1)
print(z1$z1[1:25])
##  [1] " 10100 Dairy farm products                                           024,019, 025,  029"                                                                                             
##  [2] " 10200 Poultry and eggs                                              025, 025,  019, 0219                                                                        029"                
##  [3] " 10301 Meat animals                                                  021, 021,  019, 0259                                                                        029"                
##  [4] " 10302 Miscellaneous livestock                                      027, 027,  019, 021,                                                                         025,  029"          
##  [5] " 20100 Cotton                                                       013,  019, 021,  025,                                                                        029"                
##  [6] " 20201 Food grains                                                  011, 019, 021,  025,                                                                        029"                 
##  [7] " 20202 Feed grains                                                  011, 013,  019, 021,                                                                        025,  029"           
##  [8] " 20203 Grass seeds                                                  013,  019, 021,  025,                                                                        029"                
##  [9] " 20300 Tobacco                                                      013,  019, 021,  025,                                                                        029"                
## [10] " 20401 Fruits                                                       017, 017, 017,                                                                         019, 021,  025,  029"     
## [11] " 20402 Tree nuts                                                    017,  017,  019, 021,                                                                         025,  029"         
## [12] " 20501 Vegetables                                                   013,  013,  016, 019,                                                                       021,  025,  029, 011"
## [13] " 20502 Sugar crops                                                  013,  019, 021,  025,                                                                        029"                
## [14] " 20503 Miscellaneous crops                                          011,  013,  019, 021,                                                                        025,  029"          
## [15] " 20600 Oil bearing crops                                            011,  011,  013,  021,                                                                        025,  029"         
## [16] " 20701 Forest products                                              018, 019, 021,  025,                                                                        029"                 
## [17] " 20702 Greenhouse and nursery products                              018, 019, 021,  0259                                                                       029"                  
## [18] " 30001 Forestry products                                            081, 083, 097"                                                                                                   
## [19] " 30002 Commercial fishing                                           091"                                                                                                             
## [20] " 40001 Agricultural, forestry, and fishery services                 025,  027,  71, 72,                                                                        75, 76,  085, 092"    
## [21] " 40002 Landscape and horticultural services                         078"                                                                                                             
## [22] " 50000 Iron and ferroalloy ores                                     101, 106"                                                                                                        
## [23] " 60100 Copper ore                                                   102"                                                                                                             
## [24] " 60200 Nonferrous metal ores, except copper                         103, 104,  109, 108"                                                                                             
## [25] " 70000 Coal                                                         122, 123,  124"

Divide z1 into two columns: io and info (includes industry names and SIC codes)

z1$io = str_sub(z1$z1, 1, 7)
z1$info = str_sub(z1$z1, 8, )
z1 = z1[, c("io","info")]
print(z1[1:25])
##          io
##  1:  10100 
##  2:  10200 
##  3:  10301 
##  4:  10302 
##  5:  20100 
##  6:  20201 
##  7:  20202 
##  8:  20203 
##  9:  20300 
## 10:  20401 
## 11:  20402 
## 12:  20501 
## 13:  20502 
## 14:  20503 
## 15:  20600 
## 16:  20701 
## 17:  20702 
## 18:  30001 
## 19:  30002 
## 20:  40001 
## 21:  40002 
## 22:  50000 
## 23:  60100 
## 24:  60200 
## 25:  70000 
##          io
##                                                                                                                                                                              info
##  1:                                                                                              Dairy farm products                                           024,019, 025,  029
##  2:                 Poultry and eggs                                              025, 025,  019, 0219                                                                        029
##  3:                 Meat animals                                                  021, 021,  019, 0259                                                                        029
##  4:           Miscellaneous livestock                                      027, 027,  019, 021,                                                                         025,  029
##  5:                 Cotton                                                       013,  019, 021,  025,                                                                        029
##  6:                  Food grains                                                  011, 019, 021,  025,                                                                        029
##  7:            Feed grains                                                  011, 013,  019, 021,                                                                        025,  029
##  8:                 Grass seeds                                                  013,  019, 021,  025,                                                                        029
##  9:                 Tobacco                                                      013,  019, 021,  025,                                                                        029
## 10:      Fruits                                                       017, 017, 017,                                                                         019, 021,  025,  029
## 11:          Tree nuts                                                    017,  017,  019, 021,                                                                         025,  029
## 12: Vegetables                                                   013,  013,  016, 019,                                                                       021,  025,  029, 011
## 13:                 Sugar crops                                                  013,  019, 021,  025,                                                                        029
## 14:           Miscellaneous crops                                          011,  013,  019, 021,                                                                        025,  029
## 15:          Oil bearing crops                                            011,  011,  013,  021,                                                                        025,  029
## 16:                  Forest products                                              018, 019, 021,  025,                                                                        029
## 17:                   Greenhouse and nursery products                              018, 019, 021,  0259                                                                       029
## 18:                                                                                                    Forestry products                                            081, 083, 097
## 19:                                                                                                              Commercial fishing                                           091
## 20:     Agricultural, forestry, and fishery services                 025,  027,  71, 72,                                                                        75, 76,  085, 092
## 21:                                                                                                              Landscape and horticultural services                         078
## 22:                                                                                                         Iron and ferroalloy ores                                     101, 106
## 23:                                                                                                              Copper ore                                                   102
## 24:                                                                                              Nonferrous metal ores, except copper                         103, 104,  109, 108
## 25:                                                                                                   Coal                                                         122, 123,  124
##                                                                                                                                                                              info

Remove all the unnecessary information in the IO code:

z1$io = str_remove_all(z1$io, "\\.")
z = z1$info
z = str_replace_all(z, ",", " ")
z = str_remove_all(z, "[:punct:]+")
z = str_remove_all(z, "[:alpha:]+")
z = str_remove(z, '^[:blank:]+')
print(z[1:25])
##  [1] "024 019  025   029"                                                                                              
##  [2] "025  025   019  0219                                                                        029"                 
##  [3] "021  021   019  0259                                                                        029"                 
##  [4] "027  027   019  021                                                                          025   029"          
##  [5] "013   019  021   025                                                                         029"                
##  [6] "011  019  021   025                                                                         029"                 
##  [7] "011  013   019  021                                                                         025   029"           
##  [8] "013   019  021   025                                                                         029"                
##  [9] "013   019  021   025                                                                         029"                
## [10] "017  017  017                                                                          019  021   025   029"     
## [11] "017   017   019  021                                                                          025   029"         
## [12] "013   013   016  019                                                                        021   025   029  011"
## [13] "013   019  021   025                                                                         029"                
## [14] "011   013   019  021                                                                         025   029"          
## [15] "011   011   013   021                                                                         025   029"         
## [16] "018  019  021   025                                                                         029"                 
## [17] "018  019  021   0259                                                                       029"                  
## [18] "081  083  097"                                                                                                   
## [19] "091"                                                                                                             
## [20] "025   027   71  72                                                                         75  76   085  092"    
## [21] "078"                                                                                                             
## [22] "101  106"                                                                                                        
## [23] "102"                                                                                                             
## [24] "103  104   109  108"                                                                                             
## [25] "122  123   124"

Final Output

Assume that there are at most 12 SIC codes corresponding to a certain IO code, we can separate the SIC character column into numeric columns where each new column contains one SIC code:

z = data.table(io = z1$io, sic = z)
nsic = 12
z1 = as.data.table(str_split_fixed(z$sic, "[:blank:]+", n = nsic))
z = cbind(z$io, z1)
z_names = vector(mode = "character", length = nsic)
for (i in 1:nsic) {
  z_names[i] = paste0("sic",i)
}
z_names = c("io", z_names)
names(z) = z_names
print(z[1:25])
##          io sic1 sic2 sic3 sic4 sic5 sic6 sic7 sic8 sic9 sic10 sic11 sic12
##  1:  10100   024  019  025  029                                           
##  2:  10200   025  025  019 0219  029                                      
##  3:  10301   021  021  019 0259  029                                      
##  4:  10302   027  027  019  021  025  029                                 
##  5:  20100   013  019  021  025  029                                      
##  6:  20201   011  019  021  025  029                                      
##  7:  20202   011  013  019  021  025  029                                 
##  8:  20203   013  019  021  025  029                                      
##  9:  20300   013  019  021  025  029                                      
## 10:  20401   017  017  017  019  021  025  029                            
## 11:  20402   017  017  019  021  025  029                                 
## 12:  20501   013  013  016  019  021  025  029  011                       
## 13:  20502   013  019  021  025  029                                      
## 14:  20503   011  013  019  021  025  029                                 
## 15:  20600   011  011  013  021  025  029                                 
## 16:  20701   018  019  021  025  029                                      
## 17:  20702   018  019  021 0259  029                                      
## 18:  30001   081  083  097                                                
## 19:  30002   091                                                          
## 20:  40001   025  027   71   72   75   76  085  092                       
## 21:  40002   078                                                          
## 22:  50000   101  106                                                     
## 23:  60100   102                                                          
## 24:  60200   103  104  109  108                                           
## 25:  70000   122  123  124                                                
##          io sic1 sic2 sic3 sic4 sic5 sic6 sic7 sic8 sic9 sic10 sic11 sic12