将复杂的字符串列解析为R中的新列

时间:2022-10-16 22:54:17

I have the following data:

我有以下数据:

id,response,date
123,{"showAgain":1421547783703,"answer":null,"details":null,"user_id":2423553}, 2015-01-11 02:23:03
124,{"showAgain":1421683620119,"answer":["Never"],"details":null,"user_id":4933822,"company_id":992211,"category":"apple"}, 2015-01-12 16:06:56
125,{"showAgain":1421692043509,"answer":["Sometimes","other"],"details":"I like bread.","user_id":2390922,"company_id":119988,"category":"banana"},2015-01-12 18:27:23

To be clear, the "response" column values are what you see within the curly brackets.

为清楚起见,“响应”列值是您在大括号中看到的。

I'd need to break that response into new columns, but the string doesn't always have the same number of values. The desired output would be this:

我需要将响应分解为新列,但字符串并不总是具有相同数量的值。期望的输出是这样的:

id,answer,details,user_id,company_id,category,date
123,NA,NA,2423553,NA,NA,2015-01-11 02:23:03
124,Never,NA,4933822,992211,apple,2015-01-12 16:06:56
125,Other,"I like bread",2390922,119988,banana,2015-01-12 18:27:23

The NA can also be blank or NULL, I'm indifferent. On row 3 "answer" could also be a concatenation of the two replies "Sometimes.Other". Or it could be broken out into a new column called answer2. There will never be more than 2 values in the incoming "answer" field (95% of time it will be 1 value).

NA也可以是空白或NULL,我无动于衷。在第3行“回答”也可以是两个回复“有时候。其他”的串联。或者它可以分解为一个名为answer2的新列。传入的“答案”字段中永远不会有超过2个值(95%的时间它将是1个值)。

Any clues on how to approach this would be welcome.

关于如何处理这个问题的任何线索都会受到欢迎。

1 个解决方案

#1


1  

Here's a start:

这是一个开始:

library(stringr)
library(dplyr)
library(jsonlite)
library(data.table)

lines <- readLines("data.txt")

build_cols <- function(x) {
  data.frame(cbind(id=x[2], date=x[4], rbind(fromJSON(x[3]))))
}

rbindlist(lapply(str_match_all(lines[2:length(lines)], 
                               "([[:digit:]]+),(\\{.*\\}),(.*$)"),
                 build_cols), fill=TRUE) %>%
  select(id,answer,details,user_id,company_id,category,date)

##     id          answer       details user_id company_id category                 date
## 1: 123            NULL          NULL 2423553       NULL     NULL  2015-01-11 02:23:03
## 2: 124           Never          NULL 4933822     992211    apple  2015-01-12 16:06:56
## 3: 125 Sometimes,other I like bread. 2390922     119988   banana  2015-01-12 18:27:23

#1


1  

Here's a start:

这是一个开始:

library(stringr)
library(dplyr)
library(jsonlite)
library(data.table)

lines <- readLines("data.txt")

build_cols <- function(x) {
  data.frame(cbind(id=x[2], date=x[4], rbind(fromJSON(x[3]))))
}

rbindlist(lapply(str_match_all(lines[2:length(lines)], 
                               "([[:digit:]]+),(\\{.*\\}),(.*$)"),
                 build_cols), fill=TRUE) %>%
  select(id,answer,details,user_id,company_id,category,date)

##     id          answer       details user_id company_id category                 date
## 1: 123            NULL          NULL 2423553       NULL     NULL  2015-01-11 02:23:03
## 2: 124           Never          NULL 4933822     992211    apple  2015-01-12 16:06:56
## 3: 125 Sometimes,other I like bread. 2390922     119988   banana  2015-01-12 18:27:23