My another mentor, Dr. Javier Mella-Barahona at Universidad de los Andes, invited me to help out with his research project that studies the effects of political cycles on the cross-section of U.S. corporate bond returns. For this project, we need to merge two data sets together: bond returns (contains bond issuing company’s SIC code) and US government spending (uses IO code to classify industries), and the SIC-IO concordance table plays a key role here.
The Standard Industry Classification system (SIC for short) and the industry code employed by the Input-Output Account (IO for short) by the Bureau of Economic Analysis (BEA) have a many-to-many relationship. For instance, SIC 138 has four IO matches: 80000, 110601, 110602, and 120215. Meanwhile, IO 80000 has three matches in SIC, which are 131, 132, and 138. In the raw concordance table (txt file) provided by BEA, each observation is a unique IO code with its SIC matches. However, we want the opposite table in which each row is a unique SIC code with the corresponding IO codes. This task is challenging because it requires extensive string manipulation and careful use of regular expression. Here is how I did it.
The observations in the raw table contains the following:
Header: “ AGRICULTURE, FORESTRY, AND FISHERIES”
Subheader: “1 Livestock and livestock products:”
IO-SIC Pairs: “ 1.0100 Dairy farm products 024,*019, *0259, *029”
Firstly, we get rid of the white spaces in front of the headers, so the observations that start with white space are all actual SIC-IO pairs, which will be used as the condition to filter out the rows we want: the SIC-IO pairs. The loop below does this job for us:
z = as.data.table(io_sic_1987_raw)
for (i in 1:nrow(z)) {
if (!str_detect(z$X1[i], "[:lower:]") & !str_detect(z$X1[i], "[:digit:]")) {
z$X1[i] = str_remove(z$X1[i], "^[:blank:]+")
}
}
print(z$X1[1:25])
## [1] "AGRICULTURE, FORESTRY, AND FISHERIES"
## [2] "1 Livestock and livestock products:"
## [3] " 1.0100 Dairy farm products 024,*019, *0259, *029"
## [4] " 1.0200 Poultry and eggs 0251-3, *0259, *019, *0219"
## [5] " *029"
## [6] " 1.0301 Meat animals 0211-4, *0219, *019, *0259 "
## [7] " *029"
## [8] " 1.0302 Miscellaneous livestock 0271-3, *0279, *019, *0219,"
## [9] " *0259, *029"
## [10] "2 Other agricultural products:"
## [11] " 2.0100 Cotton 0131, *019, *0219, *0259,"
## [12] " *029"
## [13] " 2.0201 Food grains *011, *019, *0219, *0259,"
## [14] " *029"
## [15] " 2.0202 Feed grains *011, *0139, *019, *0219,"
## [16] " *0259, *029"
## [17] " 2.0203 Grass seeds *0139, *019, *0219, *0259,"
## [18] " *029"
## [19] " 2.0300 Tobacco 0132, *019, *0219, *0259,"
## [20] " *029"
## [21] " 2.0401 Fruits 0171-2, 0174-5, *0179, "
## [22] " *019, *0219, *0259, *029"
## [23] " 2.0402 Tree nuts 0173, *0179, *019, *0219, "
## [24] " *0259, *029"
## [25] " 2.0501 Vegetables 0134, *0139, 016, *019,"
Now only the headers and sub-headers do not start with white space, I can get rid of these rows using this condition:
z1 = rep(NA, nrow(z))
for (i in 1:nrow(z)) {
if (str_detect(z$X1[i], "^[:blank:]")) {
z1[i] = z$X1[i]
}
}
z1 = z1[!is.na(z1)]
print(z1[1:25])
## [1] " 1.0100 Dairy farm products 024,*019, *0259, *029"
## [2] " 1.0200 Poultry and eggs 0251-3, *0259, *019, *0219"
## [3] " *029"
## [4] " 1.0301 Meat animals 0211-4, *0219, *019, *0259 "
## [5] " *029"
## [6] " 1.0302 Miscellaneous livestock 0271-3, *0279, *019, *0219,"
## [7] " *0259, *029"
## [8] " 2.0100 Cotton 0131, *019, *0219, *0259,"
## [9] " *029"
## [10] " 2.0201 Food grains *011, *019, *0219, *0259,"
## [11] " *029"
## [12] " 2.0202 Feed grains *011, *0139, *019, *0219,"
## [13] " *0259, *029"
## [14] " 2.0203 Grass seeds *0139, *019, *0219, *0259,"
## [15] " *029"
## [16] " 2.0300 Tobacco 0132, *019, *0219, *0259,"
## [17] " *029"
## [18] " 2.0401 Fruits 0171-2, 0174-5, *0179, "
## [19] " *019, *0219, *0259, *029"
## [20] " 2.0402 Tree nuts 0173, *0179, *019, *0219, "
## [21] " *0259, *029"
## [22] " 2.0501 Vegetables 0134, *0139, 016, *019,"
## [23] " *0219, *0259, *029, *0119"
## [24] " 2.0502 Sugar crops 0133, *019, *0219, *0259,"
## [25] " *029"
After removing all the headers and sub-headers, I can go ahead and delete the white spaces in front of the observations left:
for (i in 1:length(z1)) {
z1[i] = str_remove(z1[i], "^ ")
}
z = as.data.table(z1)
print(z$z1[1:25])
## [1] " 1.0100 Dairy farm products 024,*019, *0259, *029"
## [2] " 1.0200 Poultry and eggs 0251-3, *0259, *019, *0219"
## [3] " *029"
## [4] " 1.0301 Meat animals 0211-4, *0219, *019, *0259 "
## [5] " *029"
## [6] " 1.0302 Miscellaneous livestock 0271-3, *0279, *019, *0219,"
## [7] " *0259, *029"
## [8] " 2.0100 Cotton 0131, *019, *0219, *0259,"
## [9] " *029"
## [10] " 2.0201 Food grains *011, *019, *0219, *0259,"
## [11] " *029"
## [12] " 2.0202 Feed grains *011, *0139, *019, *0219,"
## [13] " *0259, *029"
## [14] " 2.0203 Grass seeds *0139, *019, *0219, *0259,"
## [15] " *029"
## [16] " 2.0300 Tobacco 0132, *019, *0219, *0259,"
## [17] " *029"
## [18] " 2.0401 Fruits 0171-2, 0174-5, *0179, "
## [19] " *019, *0219, *0259, *029"
## [20] " 2.0402 Tree nuts 0173, *0179, *019, *0219, "
## [21] " *0259, *029"
## [22] " 2.0501 Vegetables 0134, *0139, 016, *019,"
## [23] " *0219, *0259, *029, *0119"
## [24] " 2.0502 Sugar crops 0133, *019, *0219, *0259,"
## [25] " *029"
As we can see, some IO-SIC pairs can take two or three rows, but I want each pair only occupies one row, so I have to paste the multiple cells of one pair together. To do so, I create four columns. The first column control is the first 7 characters of the observation, the second column control2 is equal to 1 if the corresponding control cell is empty, which indicates that row does not start a new io-sic pair. The third column control3 equals to 2 if the row above it does not start a new pair, and the forth column control4 is the max of control2 and control3.
z = z[, control := substr(z1, 1, 7)]
z = z[control == ' ', control2 := 1]
z$control3 = NA
z$control2[is.na(z$control2)] = 0
for (i in 2:nrow(z)) {
if (z$control2[i] == 1 & z$control2[i+1] == 1) {
z$control3[i+1] = 2
}
}
z$control3[is.na(z$control3)] = 0
z$control4 = mapply(max, z$control2, z$control3)
print(z[1:25])
## z1
## 1: 1.0100 Dairy farm products 024,*019, *0259, *029
## 2: 1.0200 Poultry and eggs 0251-3, *0259, *019, *0219
## 3: *029
## 4: 1.0301 Meat animals 0211-4, *0219, *019, *0259
## 5: *029
## 6: 1.0302 Miscellaneous livestock 0271-3, *0279, *019, *0219,
## 7: *0259, *029
## 8: 2.0100 Cotton 0131, *019, *0219, *0259,
## 9: *029
## 10: 2.0201 Food grains *011, *019, *0219, *0259,
## 11: *029
## 12: 2.0202 Feed grains *011, *0139, *019, *0219,
## 13: *0259, *029
## 14: 2.0203 Grass seeds *0139, *019, *0219, *0259,
## 15: *029
## 16: 2.0300 Tobacco 0132, *019, *0219, *0259,
## 17: *029
## 18: 2.0401 Fruits 0171-2, 0174-5, *0179,
## 19: *019, *0219, *0259, *029
## 20: 2.0402 Tree nuts 0173, *0179, *019, *0219,
## 21: *0259, *029
## 22: 2.0501 Vegetables 0134, *0139, 016, *019,
## 23: *0219, *0259, *029, *0119
## 24: 2.0502 Sugar crops 0133, *019, *0219, *0259,
## 25: *029
## z1
## control control2 control3 control4
## 1: 1.0100 0 0 0
## 2: 1.0200 0 0 0
## 3: 1 0 1
## 4: 1.0301 0 0 0
## 5: 1 0 1
## 6: 1.0302 0 0 0
## 7: 1 0 1
## 8: 2.0100 0 0 0
## 9: 1 0 1
## 10: 2.0201 0 0 0
## 11: 1 0 1
## 12: 2.0202 0 0 0
## 13: 1 0 1
## 14: 2.0203 0 0 0
## 15: 1 0 1
## 16: 2.0300 0 0 0
## 17: 1 0 1
## 18: 2.0401 0 0 0
## 19: 1 0 1
## 20: 2.0402 0 0 0
## 21: 1 0 1
## 22: 2.0501 0 0 0
## 23: 1 0 1
## 24: 2.0502 0 0 0
## 25: 1 0 1
## control control2 control3 control4
We leave only the data column and control4 column. Now if the value in control4 column is 0, the corresponding pair takes only one row. And if the value is 1, the next row is also part of the pair; and if the value is 2, then the next two rows are both part of this item. Knowing that, I can paste the multiple cells of one IO-SIC pair together:
z = z[, c("z1", "control4")]
names(z) = c("info","control")
z$temp = NA
for (i in 1:(nrow(z)-2)) {
if (z$control[i] == 0 & z$control[i+1] == 0) {
z$temp[i] = z$info[i]
} else if (z$control[i] == 0 & z$control[i+1] == 1 & z$control[i+2] == 0) {
z$temp[i] = paste0(z$info[i],z$info[i+1])
} else if (z$control[i] == 0 & z$control[i+1] == 1 & z$control[i+2] == 2) {
z$temp[i] = paste0(z$info[i],z$info[i+1],z$info[i+2])
}
}
i = nrow(z)-1
if (z$control[i] == 0 & z$control[i+1] == 0) {
z$temp[i] = z$info[i]
z$temp[i+1] = z$info[i+1]
} else if (z$control[i] == 0 & z$control[i+1] == 1) {
z$temp[i] = paste0(z$info[i],z$info[i+1])
}
print(z$temp[1:25])
## [1] " 1.0100 Dairy farm products 024,*019, *0259, *029"
## [2] " 1.0200 Poultry and eggs 0251-3, *0259, *019, *0219 *029"
## [3] NA
## [4] " 1.0301 Meat animals 0211-4, *0219, *019, *0259 *029"
## [5] NA
## [6] " 1.0302 Miscellaneous livestock 0271-3, *0279, *019, *0219, *0259, *029"
## [7] NA
## [8] " 2.0100 Cotton 0131, *019, *0219, *0259, *029"
## [9] NA
## [10] " 2.0201 Food grains *011, *019, *0219, *0259, *029"
## [11] NA
## [12] " 2.0202 Feed grains *011, *0139, *019, *0219, *0259, *029"
## [13] NA
## [14] " 2.0203 Grass seeds *0139, *019, *0219, *0259, *029"
## [15] NA
## [16] " 2.0300 Tobacco 0132, *019, *0219, *0259, *029"
## [17] NA
## [18] " 2.0401 Fruits 0171-2, 0174-5, *0179, *019, *0219, *0259, *029"
## [19] NA
## [20] " 2.0402 Tree nuts 0173, *0179, *019, *0219, *0259, *029"
## [21] NA
## [22] " 2.0501 Vegetables 0134, *0139, 016, *019, *0219, *0259, *029, *0119"
## [23] NA
## [24] " 2.0502 Sugar crops 0133, *019, *0219, *0259, *029"
## [25] NA
Now remove the rows that are NAs and get rid of all the ‘*’ and ‘.’ in each observation:
z1 = z[!is.na(z$temp)]
z1 = z1[, temp]
z1 = str_remove_all(z1, "\\*")
z1 = str_remove_all(z1, "\\.")
print(z1[1:25])
## [1] " 10100 Dairy farm products 024,019, 0259, 029"
## [2] " 10200 Poultry and eggs 0251-3, 0259, 019, 0219 029"
## [3] " 10301 Meat animals 0211-4, 0219, 019, 0259 029"
## [4] " 10302 Miscellaneous livestock 0271-3, 0279, 019, 0219, 0259, 029"
## [5] " 20100 Cotton 0131, 019, 0219, 0259, 029"
## [6] " 20201 Food grains 011, 019, 0219, 0259, 029"
## [7] " 20202 Feed grains 011, 0139, 019, 0219, 0259, 029"
## [8] " 20203 Grass seeds 0139, 019, 0219, 0259, 029"
## [9] " 20300 Tobacco 0132, 019, 0219, 0259, 029"
## [10] " 20401 Fruits 0171-2, 0174-5, 0179, 019, 0219, 0259, 029"
## [11] " 20402 Tree nuts 0173, 0179, 019, 0219, 0259, 029"
## [12] " 20501 Vegetables 0134, 0139, 016, 019, 0219, 0259, 029, 0119"
## [13] " 20502 Sugar crops 0133, 019, 0219, 0259, 029"
## [14] " 20503 Miscellaneous crops 0119, 0139, 019, 0219, 0259, 029"
## [15] " 20600 Oil bearing crops 0116, 0119, 0139, 0219, 0259, 029"
## [16] " 20701 Forest products 018, 019, 0219, 0259, 029"
## [17] " 20702 Greenhouse and nursery products 018, 019, 0219, 0259 029"
## [18] " 30001 Forestry products 081, 083, 097"
## [19] " 30002 Commercial fishing 091"
## [20] " 40001 Agricultural, forestry, and fishery services 0254, 0279, 071-2, 075-6, 085, 092"
## [21] " 40002 Landscape and horticultural services 078"
## [22] " 50000 Iron and ferroalloy ores 101, 106"
## [23] " 60100 Copper ore 102"
## [24] " 60200 Nonferrous metal ores, except copper 103-4, 109, 108"
## [25] " 70000 Coal 122-3, 124"
We need 3-digit (or less than 3-digit) sic code, so I replace all 4-digit sic with its first three digits using regular expression to detect the 4-digit pattern and make the replacement. Because there may be multiple matches in one cell, so I let it run 10 times:
for (k in 1:10) {
for (i in 1:length(z1)) {
if (grepl('\\d\\d\\d\\d-\\d', z1[i])) {
i.str = str_extract(z1[i],"\\d\\d\\d\\d-\\d")
i.str = str_extract(i.str,"\\d\\d\\d")
z1[i] = str_replace(z1[i], "\\d\\d\\d\\d-\\d", i.str)
}
if (grepl('\\d\\d\\d\\d,', z1[i])) {
i.str = str_extract(z1[i],"\\d\\d\\d\\d,")
i.str = str_extract(i.str,"\\d\\d\\d")
z1[i] = str_replace(z1[i], "\\d\\d\\d\\d,", paste0(i.str,", "))
}
if (grepl('\\(.+\\)', z1[i])) {
z1[i] = str_remove_all(z1[i], "\\(.+\\)")
}
if (grepl('\\d\\d\\d\\d$', z1[i])) {
i.str = str_extract(z1[i],"\\d\\d\\d\\d$")
i.str = str_extract(i.str,"\\d\\d\\d")
z1[i] = str_replace(z1[i], "\\d\\d\\d\\d$", i.str)
}
}
}
print(z1[1:25])
## [1] " 10100 Dairy farm products 024,019, 025, 029"
## [2] " 10200 Poultry and eggs 025, 025, 019, 0219 029"
## [3] " 10301 Meat animals 021, 021, 019, 0259 029"
## [4] " 10302 Miscellaneous livestock 027, 027, 019, 021, 025, 029"
## [5] " 20100 Cotton 013, 019, 021, 025, 029"
## [6] " 20201 Food grains 011, 019, 021, 025, 029"
## [7] " 20202 Feed grains 011, 013, 019, 021, 025, 029"
## [8] " 20203 Grass seeds 013, 019, 021, 025, 029"
## [9] " 20300 Tobacco 013, 019, 021, 025, 029"
## [10] " 20401 Fruits 017, 017, 017, 019, 021, 025, 029"
## [11] " 20402 Tree nuts 017, 017, 019, 021, 025, 029"
## [12] " 20501 Vegetables 013, 013, 016, 019, 021, 025, 029, 011"
## [13] " 20502 Sugar crops 013, 019, 021, 025, 029"
## [14] " 20503 Miscellaneous crops 011, 013, 019, 021, 025, 029"
## [15] " 20600 Oil bearing crops 011, 011, 013, 021, 025, 029"
## [16] " 20701 Forest products 018, 019, 021, 025, 029"
## [17] " 20702 Greenhouse and nursery products 018, 019, 021, 0259 029"
## [18] " 30001 Forestry products 081, 083, 097"
## [19] " 30002 Commercial fishing 091"
## [20] " 40001 Agricultural, forestry, and fishery services 025, 027, 071-2, 075-6, 085, 092"
## [21] " 40002 Landscape and horticultural services 078"
## [22] " 50000 Iron and ferroalloy ores 101, 106"
## [23] " 60100 Copper ore 102"
## [24] " 60200 Nonferrous metal ores, except copper 103-4, 109, 108"
## [25] " 70000 Coal 122-3, 124"
By now, no more 4-digit sic is left, but we still have some issues to deal with. There are sic codes like 15-17 or 155-7, but I want them to be 15, 16, 17 and 155, 156, 157. The chunck below does the job. it first converts 15-17 to numeric items: 15 and 17, then creates a sequence that starts with 15 and end with 17, and finally convert the sequence back to a string as “15, 16, 17”. Again, there may be multiple matches in one cell, so I let it run 5 times:
for (k in 1:5) {
for (i in 1:length(z1)) {
if (grepl('-\\d,', z1[i])) {
i.start = as.integer(substr(z1[i], regexpr('-\\d,', z1[i])[1]-1,regexpr('-\\d,', z1[i])[1]-1))
i.end = as.integer(substr(z1[i], regexpr('-\\d,', z1[i])[1]+1, regexpr('-\\d,', z1[i])[1]+1))
i.str = str_extract(z1[i],"[:digit:]+-")
i.str = as.integer(str_remove(i.str, "-"))
i.str1 = ""
for (j in 0:(i.end - i.start)) {
i.str1 = paste0(i.str1, as.character(i.str+j), ", ")
}
z1[i] = str_replace(z1[i], "[:digit:]+-[:digit:],", i.str1)
}
if (grepl('-\\d$', z1[i])) {
i.start = as.integer(substr(z1[i], regexpr('-\\d$', z1[i])[1]-1,regexpr('-\\d$', z1[i])[1]-1))
i.end = as.integer(substr(z1[i], regexpr('-\\d$', z1[i])[1]+1, regexpr('-\\d$', z1[i])[1]+1))
i.str = str_extract(z1[i],"[:digit:]+-")
i.str = as.integer(str_remove(i.str, "-"))
i.str1 = ""
for (j in 0:(i.end - i.start)) {
i.str1 = paste0(i.str1, as.character(i.str+j), ", ")
}
i.str1 = str_remove(i.str1, ", $")
z1[i] = str_replace(z1[i], "[:digit:]+-[:digit:]", i.str1)
}
if (grepl('-\\d\\d', z1[i])) {
i.start = as.integer(substr(z1[i], regexpr('-\\d\\d', z1[i])[1]-1,regexpr('-\\d\\d', z1[i])[1]-1))
i.end = as.integer(substr(z1[i], regexpr('-\\d\\d', z1[i])[1]+2, regexpr('-\\d\\d', z1[i])[1]+2))
i.str = str_extract(z1[i],"[:digit:]+-")
i.str = as.integer(str_remove(i.str, "-"))
i.str1 = ""
for (j in 0:(i.end - i.start)) {
i.str1 = paste0(i.str1, as.character(i.str+j), ", ")
}
i.str1 = str_remove(i.str1, ", $")
z1[i] = str_replace(z1[i], "[:digit:]+-[:digit:]+", i.str1)
}
}
}
z1 = as.data.table(z1)
print(z1$z1[1:25])
## [1] " 10100 Dairy farm products 024,019, 025, 029"
## [2] " 10200 Poultry and eggs 025, 025, 019, 0219 029"
## [3] " 10301 Meat animals 021, 021, 019, 0259 029"
## [4] " 10302 Miscellaneous livestock 027, 027, 019, 021, 025, 029"
## [5] " 20100 Cotton 013, 019, 021, 025, 029"
## [6] " 20201 Food grains 011, 019, 021, 025, 029"
## [7] " 20202 Feed grains 011, 013, 019, 021, 025, 029"
## [8] " 20203 Grass seeds 013, 019, 021, 025, 029"
## [9] " 20300 Tobacco 013, 019, 021, 025, 029"
## [10] " 20401 Fruits 017, 017, 017, 019, 021, 025, 029"
## [11] " 20402 Tree nuts 017, 017, 019, 021, 025, 029"
## [12] " 20501 Vegetables 013, 013, 016, 019, 021, 025, 029, 011"
## [13] " 20502 Sugar crops 013, 019, 021, 025, 029"
## [14] " 20503 Miscellaneous crops 011, 013, 019, 021, 025, 029"
## [15] " 20600 Oil bearing crops 011, 011, 013, 021, 025, 029"
## [16] " 20701 Forest products 018, 019, 021, 025, 029"
## [17] " 20702 Greenhouse and nursery products 018, 019, 021, 0259 029"
## [18] " 30001 Forestry products 081, 083, 097"
## [19] " 30002 Commercial fishing 091"
## [20] " 40001 Agricultural, forestry, and fishery services 025, 027, 71, 72, 75, 76, 085, 092"
## [21] " 40002 Landscape and horticultural services 078"
## [22] " 50000 Iron and ferroalloy ores 101, 106"
## [23] " 60100 Copper ore 102"
## [24] " 60200 Nonferrous metal ores, except copper 103, 104, 109, 108"
## [25] " 70000 Coal 122, 123, 124"
Divide z1 into two columns: io and info (includes industry names and SIC codes)
z1$io = str_sub(z1$z1, 1, 7)
z1$info = str_sub(z1$z1, 8, )
z1 = z1[, c("io","info")]
print(z1[1:25])
## io
## 1: 10100
## 2: 10200
## 3: 10301
## 4: 10302
## 5: 20100
## 6: 20201
## 7: 20202
## 8: 20203
## 9: 20300
## 10: 20401
## 11: 20402
## 12: 20501
## 13: 20502
## 14: 20503
## 15: 20600
## 16: 20701
## 17: 20702
## 18: 30001
## 19: 30002
## 20: 40001
## 21: 40002
## 22: 50000
## 23: 60100
## 24: 60200
## 25: 70000
## io
## info
## 1: Dairy farm products 024,019, 025, 029
## 2: Poultry and eggs 025, 025, 019, 0219 029
## 3: Meat animals 021, 021, 019, 0259 029
## 4: Miscellaneous livestock 027, 027, 019, 021, 025, 029
## 5: Cotton 013, 019, 021, 025, 029
## 6: Food grains 011, 019, 021, 025, 029
## 7: Feed grains 011, 013, 019, 021, 025, 029
## 8: Grass seeds 013, 019, 021, 025, 029
## 9: Tobacco 013, 019, 021, 025, 029
## 10: Fruits 017, 017, 017, 019, 021, 025, 029
## 11: Tree nuts 017, 017, 019, 021, 025, 029
## 12: Vegetables 013, 013, 016, 019, 021, 025, 029, 011
## 13: Sugar crops 013, 019, 021, 025, 029
## 14: Miscellaneous crops 011, 013, 019, 021, 025, 029
## 15: Oil bearing crops 011, 011, 013, 021, 025, 029
## 16: Forest products 018, 019, 021, 025, 029
## 17: Greenhouse and nursery products 018, 019, 021, 0259 029
## 18: Forestry products 081, 083, 097
## 19: Commercial fishing 091
## 20: Agricultural, forestry, and fishery services 025, 027, 71, 72, 75, 76, 085, 092
## 21: Landscape and horticultural services 078
## 22: Iron and ferroalloy ores 101, 106
## 23: Copper ore 102
## 24: Nonferrous metal ores, except copper 103, 104, 109, 108
## 25: Coal 122, 123, 124
## info
Remove all the unnecessary information in the IO code:
z1$io = str_remove_all(z1$io, "\\.")
z = z1$info
z = str_replace_all(z, ",", " ")
z = str_remove_all(z, "[:punct:]+")
z = str_remove_all(z, "[:alpha:]+")
z = str_remove(z, '^[:blank:]+')
print(z[1:25])
## [1] "024 019 025 029"
## [2] "025 025 019 0219 029"
## [3] "021 021 019 0259 029"
## [4] "027 027 019 021 025 029"
## [5] "013 019 021 025 029"
## [6] "011 019 021 025 029"
## [7] "011 013 019 021 025 029"
## [8] "013 019 021 025 029"
## [9] "013 019 021 025 029"
## [10] "017 017 017 019 021 025 029"
## [11] "017 017 019 021 025 029"
## [12] "013 013 016 019 021 025 029 011"
## [13] "013 019 021 025 029"
## [14] "011 013 019 021 025 029"
## [15] "011 011 013 021 025 029"
## [16] "018 019 021 025 029"
## [17] "018 019 021 0259 029"
## [18] "081 083 097"
## [19] "091"
## [20] "025 027 71 72 75 76 085 092"
## [21] "078"
## [22] "101 106"
## [23] "102"
## [24] "103 104 109 108"
## [25] "122 123 124"
Assume that there are at most 12 SIC codes corresponding to a certain IO code, we can separate the SIC character column into numeric columns where each new column contains one SIC code:
z = data.table(io = z1$io, sic = z)
nsic = 12
z1 = as.data.table(str_split_fixed(z$sic, "[:blank:]+", n = nsic))
z = cbind(z$io, z1)
z_names = vector(mode = "character", length = nsic)
for (i in 1:nsic) {
z_names[i] = paste0("sic",i)
}
z_names = c("io", z_names)
names(z) = z_names
print(z[1:25])
## io sic1 sic2 sic3 sic4 sic5 sic6 sic7 sic8 sic9 sic10 sic11 sic12
## 1: 10100 024 019 025 029
## 2: 10200 025 025 019 0219 029
## 3: 10301 021 021 019 0259 029
## 4: 10302 027 027 019 021 025 029
## 5: 20100 013 019 021 025 029
## 6: 20201 011 019 021 025 029
## 7: 20202 011 013 019 021 025 029
## 8: 20203 013 019 021 025 029
## 9: 20300 013 019 021 025 029
## 10: 20401 017 017 017 019 021 025 029
## 11: 20402 017 017 019 021 025 029
## 12: 20501 013 013 016 019 021 025 029 011
## 13: 20502 013 019 021 025 029
## 14: 20503 011 013 019 021 025 029
## 15: 20600 011 011 013 021 025 029
## 16: 20701 018 019 021 025 029
## 17: 20702 018 019 021 0259 029
## 18: 30001 081 083 097
## 19: 30002 091
## 20: 40001 025 027 71 72 75 76 085 092
## 21: 40002 078
## 22: 50000 101 106
## 23: 60100 102
## 24: 60200 103 104 109 108
## 25: 70000 122 123 124
## io sic1 sic2 sic3 sic4 sic5 sic6 sic7 sic8 sic9 sic10 sic11 sic12