SQL Regex - 替换为另一个字段的子字符串

时间:2023-02-01 21:47:20

I have a database table (Oracle 11g) of questionnaire feedback, including multiple choice, multiple answer questions. The Options column has each value the user could choose, and the Answers column has the numerical values of what they chose.

我有一个数据库表(Oracle 11g)的问卷反馈,包括多项选择,多个答案问题。 Options列具有用户可以选择的每个值,Answers列具有他们选择的数值。

ID_NO     OPTIONS                               ANSWERS
1001      Apple Pie|Banana-Split|Cream Tea      1|2
1002      Apple Pie|Banana-Split|Cream Tea      2|3
1003      Apple Pie|Banana-Split|Cream Tea      1|2|3

I need a query that will decode the answers, with the text versions of the answers as a single string.

我需要一个能够解码答案的查询,并将答案的文本版本作为单个字符串。

ID_NO     ANSWERS     ANSWER_DECODE
1001      1|2         Apple Pie|Banana-Split
1002      2|3         Banana-Split|Cream Tea
1003      1|2|3       Apple Pie|Banana-Split|Cream Tea

I have experimented with regular expressions to replace values and get substrings, but I can't work out a way to properly merge the two.

我已经尝试使用正则表达式来替换值并获得子串,但我无法找到一种方法来正确合并这两者。

WITH feedback AS (
  SELECT 1001 id_no, 'Apple Pie|Banana-Split|Cream Tea' options, '1|2' answers FROM DUAL UNION
  SELECT 1002 id_no, 'Apple Pie|Banana-Split|Cream Tea' options, '2|3' answers FROM DUAL UNION
  SELECT 1003 id_no, 'Apple Pie|Banana-Split|Cream Tea' options, '1|2|3' answers FROM DUAL )
SELECT 
  id_no,
  options,
  REGEXP_SUBSTR(options||'|', '(.)+?\|', 1, 2) second_option,
  answers,
  REGEXP_REPLACE(answers, '(\d)+', ' \1 ') answer_numbers,
  REGEXP_REPLACE(answers, '(\d)+', REGEXP_SUBSTR(options||'|', '(.)+?\|', 1, To_Number('2'))) "???"
FROM feedback

I don't want to have to manually define or decode the answers in the SQL; there are many surveys with different questions (and differing numbers of options), so I'm hoping that there's a solution that will dynamically work for all of them.

我不想手动定义或解码SQL中的答案;有许多调查都有不同的问题(以及不同数量的选项),所以我希望有一个解决方案可以动态地为所有问题工作。

I've tried to split the options and answers into separate rows by LEVEL, and re-join them where the codes match, but this runs exceedingly slow with the actual dataset (a 5-option question with 600 rows of responses).

我试图通过LEVEL将选项和答案拆分成单独的行,并在代码匹配的地方重新加入它们,但这对于实际数据集(具有600行响应的5选项问题)运行得非常慢。

WITH feedback AS (
  SELECT 1001 id_no, 'Apple Pie|Banana-Split|Cream Tea' options, '1|2' answers FROM DUAL UNION
  SELECT 1002 id_no, 'Apple Pie|Banana-Split|Cream Tea' options, '2|3' answers FROM DUAL UNION
  SELECT 1003 id_no, 'Apple Pie|Banana-Split|Cream Tea' options, '1|2|3' answers FROM DUAL )
SELECT
    answer_rows.id_no,
    ListAgg(option_rows.answer) WITHIN GROUP(ORDER BY option_rows.lvl)
FROM
  (SELECT DISTINCT
    LEVEL lvl,
    REGEXP_SUBSTR(options||'|', '(.)+?\|', 1, LEVEL) answer
  FROM
    (SELECT DISTINCT
      options,
      REGEXP_COUNT(options||'|', '(.)+?\|') num_choices
    FROM
      feedback)
  CONNECT BY LEVEL <= num_choices
  ) option_rows
  LEFT OUTER JOIN
  (SELECT DISTINCT
    id_no,
    to_number(REGEXP_SUBSTR(answers, '(\d)+', 1, LEVEL)) answer
  FROM
    (SELECT DISTINCT
      id_no,
      answers,
      To_Number(REGEXP_SUBSTR(answers, '(\d)+$')) max_answer
    FROM
      feedback)
  WHERE
    to_number(REGEXP_SUBSTR(answers, '(\d)+', 1, LEVEL)) IS NOT NULL
  CONNECT BY LEVEL <= max_answer
  ) answer_rows
    ON option_rows.lvl = answer_rows.answer
GROUP BY
    answer_rows.id_no
ORDER BY
  answer_rows.id_no

If there isn't a solution just using Regex, is there a more efficient way than LEVEL to split the values? Or is there another approach that would work?

如果没有使用正则表达式的解决方案,是否有比LEVEL更有效的方法来分割值?还是有另一种方法可行吗?

4 个解决方案

#1


1  

It's slow because you're expanding each row too many times; the connect-by clauses you're using are looking across all the rows, so you're ending up with a huge amount of data to then sort - which is presumably why you ended up with the DISTINCT in there.

它很慢,因为你将每行扩展太多次;您正在使用的连接子句正在查看所有行,因此您最终需要大量数据才能进行排序 - 这可能就是为什么您最终会在那里使用DISTINCT。

You can add two PRIOR clauses to the connect-by, firstly so the ID_NO is preserved, and a second to avoid a loop - any non-deterministic function will do for this, I've picked dbms_random.value but you can use sys_guid if you prefer, or something else. You also don't need to many subqueries, you can do it with two; or as CTEs which I think it s a bit clearer:

您可以向连接添加两个PRIOR子句,首先保留ID_NO,然后第二个以避免循环 - 任何非确定性函数都会为此做,我选择了dbms_random.value但是您可以使用sys_guid if你喜欢什么,或者别的什么。你也不需要很多子查询,你可以用两个;或者作为我认为更清楚的CTE:

WITH feedback AS (
  SELECT 1001 id_no, 'Apple Pie|Banana-Split|Cream Tea' options, '1|2' answers FROM DUAL UNION
  SELECT 1002 id_no, 'Apple Pie|Banana-Split|Cream Tea' options, '2|3' answers FROM DUAL UNION
  SELECT 1003 id_no, 'Apple Pie|Banana-Split|Cream Tea' options, '1|2|3' answers FROM DUAL
),
option_rows AS (
  SELECT
    id_no,
    LEVEL answer,
    REGEXP_SUBSTR(options, '[^|]+', 1, LEVEL) answer_text
  FROM feedback
  CONNECT BY LEVEL <= REGEXP_COUNT(options, '[^|]+')
  AND id_no = PRIOR id_no
  AND PRIOR dbms_random.value IS NOT NULL
),
answer_rows AS (
  SELECT
    id_no,
    REGEXP_SUBSTR(answers, '[^|]+', 1, LEVEL) answer
  FROM feedback
  CONNECT BY LEVEL <= REGEXP_COUNT(answers, '[^|]+')
  AND PRIOR id_no = id_no
  AND PRIOR dbms_random.value IS NOT NULL
)
SELECT
  option_rows.id_no,
  LISTAGG(option_rows.answer, '|') WITHIN GROUP (ORDER BY option_rows.answer) AS answers,
  LISTAGG(option_rows.answer_text, '|') WITHIN GROUP (ORDER BY option_rows.answer) AS answer_decode
FROM option_rows
JOIN answer_rows
ON option_rows.id_no = answer_rows.id_no
AND option_rows.answer = answer_rows.answer
GROUP BY option_rows.id_no
ORDER BY option_rows.id_no;

Which gets:

哪个获得:

     ID_NO ANSWERS    ANSWER_DECODE                          
---------- ---------- ----------------------------------------
      1001 1|2        Apple Pie|Banana-Split                  
      1002 2|3        Banana-Split|Cream Tea                  
      1003 1|2|3      Apple Pie|Banana-Split|Cream Tea  

I've also changed your regex pattern so you don't have to append or strip the |.

我也改变了你的正则表达式模式,所以你不必附加或删除|。

#2


1  

Check out this compact solution:

看看这个紧凑的解决方案:

   with sample_data as
(
  select 'ala|ma|kota' options, '1|2' answers from dual
  union all
  select 'apples|oranges|bacon', '1|2|3' from dual
  union all
  select 'a|b|c|d|e|f|h|i','1|3|4|5|8' from dual
)
select answers, options,
regexp_replace(regexp_replace(options,'([^|]+)\|([^|]+)\|([^|]+)','\' || replace(answers,'|','|\')),'[|]+','|') answer_decode
from sample_data;

Output:

输出:

  ANSWERS   OPTIONS              ANSWER_DECODE
--------- -------------------- ---------------------------
1|2       ala|ma|kota          ala|ma
1|2|3     apples|oranges|bacon apples|oranges|bacon
1|3|4|5|8 a|b|c|d|e|f|h|i      a|c|d|f|h|i

#3


0  

I've written a close solution in MySQL (don't have Oracle installed right now) - but I've written what needs to be changed in order for the query to work in Oracle.

我在MySQL中编写了一个紧密的解决方案(现在没有安装Oracle) - 但我已经编写了需要更改的内容,以便查询在Oracle中运行。

Also, the ugliest part of my code will be much better looking in Oracle since it has a much better INSTR function.

此外,我的代码中最丑陋的部分在Oracle中会更好看,因为它具有更好的INSTR功能。

The idea is to do a CROSS JOIN with a list of numbers (1 to 10 in order to support up to 10 options per survey), and break-down the OPTIONS field into different rows... (you do this by using both the list of numbers and Oracle's INSTR function, see the comments).

我们的想法是使用一个数字列表进行CROSS JOIN(1到10,以便每次调查最多支持10个选项),并将OPTIONS字段分解为不同的行...(通过同时使用两个行来实现数字列表和Oracle的INSTR函数,请参阅注释)。

From there you filter out the rows that were not selected and group everything back together.

从那里,您筛选出未选中的行,并将所有内容重新组合在一起。

-- I've used GROUP_CONCAT in MySQL, but in Oracle you'll have to use WM_CONCAT
select ID_NO, ANSWERS, group_concat(broken_down_options,'|') `OPTIONS`
from (
    select your_table.ID_NO, your_table.ANSWERS, 
            -- Luckily, you're using ORACLE so you can use an INSTR function that has the "occurrence" parameter
            -- INSTR(string, substring, [position, [occurrence]])
            -- use the nums.num field as input for the occurrence parameter
            -- and just put '1' under "position"
            case when nums.num = 1 
                then substr(your_table.`OPTIONS`, 1, instr(your_table.`OPTIONS`, '|') - 1)
                when nums.num = 2
                then substr(substr(your_table.`OPTIONS`, instr(your_table.`OPTIONS`, '|') + 1), 1, instr(substr(your_table.`OPTIONS`, instr(your_table.`OPTIONS`, '|') + 1), '|') - 1)
                else substr(your_table.`OPTIONS`,  length(your_table.`OPTIONS`) - instr(reverse(your_table.`OPTIONS`), '|') + 2) end broken_down_options
    from (select 1 num union all
        select 2 num union all
        select 3 num union all
        select 4 num union all
        select 5 num union all
        select 6 num union all
        select 7 num union all
        select 8 num union all
        select 9 num union all
        select 10 num
        ) nums 
        CROSS JOIN
        (select 1001 ID_NO, 'Apple Pie|Banana-Split|Cream Tea' `OPTIONS`, '1|2' ANSWERS union
        select 1002 ID_NO, 'Apple Pie|Banana-Split|Cream Tea' `OPTIONS`, '2|3' ANSWERS union
        select 1003 ID_NO, 'Apple Pie|Banana-Split|Cream Tea' `OPTIONS`, '1|2|3' ANSWERS
        ) your_table
    -- for example: 2|3 matches 2 and 3 but not 1
    where your_table.ANSWERS like concat(concat('%',nums.num),'%')
) some_query
group by ID_NO, ANSWERS

#4


0  

Create a stored predure and do below steps

创建存储的predure并执行以下步骤

  • Declare an array of your size.
  • 声明一个你的大小的数组。
  • Get option data from first row. Use a regex or level to extract values between pipes and then store them in array. Note: This will be only a one time itwration. So you dont need to repeat it for every row.
  • 从第一行获取选项数据。使用正则表达式或级别在管道之间提取值,然后将它们存储在数组中。注意:这只是一次性的。所以你不需要为每一行重复它。
  • Now in a loop, for each row, select answers and use array values to assign the values of answers
  • 现在循环中,对于每一行,选择答案并使用数组值来分配答案的值

#1


1  

It's slow because you're expanding each row too many times; the connect-by clauses you're using are looking across all the rows, so you're ending up with a huge amount of data to then sort - which is presumably why you ended up with the DISTINCT in there.

它很慢,因为你将每行扩展太多次;您正在使用的连接子句正在查看所有行,因此您最终需要大量数据才能进行排序 - 这可能就是为什么您最终会在那里使用DISTINCT。

You can add two PRIOR clauses to the connect-by, firstly so the ID_NO is preserved, and a second to avoid a loop - any non-deterministic function will do for this, I've picked dbms_random.value but you can use sys_guid if you prefer, or something else. You also don't need to many subqueries, you can do it with two; or as CTEs which I think it s a bit clearer:

您可以向连接添加两个PRIOR子句,首先保留ID_NO,然后第二个以避免循环 - 任何非确定性函数都会为此做,我选择了dbms_random.value但是您可以使用sys_guid if你喜欢什么,或者别的什么。你也不需要很多子查询,你可以用两个;或者作为我认为更清楚的CTE:

WITH feedback AS (
  SELECT 1001 id_no, 'Apple Pie|Banana-Split|Cream Tea' options, '1|2' answers FROM DUAL UNION
  SELECT 1002 id_no, 'Apple Pie|Banana-Split|Cream Tea' options, '2|3' answers FROM DUAL UNION
  SELECT 1003 id_no, 'Apple Pie|Banana-Split|Cream Tea' options, '1|2|3' answers FROM DUAL
),
option_rows AS (
  SELECT
    id_no,
    LEVEL answer,
    REGEXP_SUBSTR(options, '[^|]+', 1, LEVEL) answer_text
  FROM feedback
  CONNECT BY LEVEL <= REGEXP_COUNT(options, '[^|]+')
  AND id_no = PRIOR id_no
  AND PRIOR dbms_random.value IS NOT NULL
),
answer_rows AS (
  SELECT
    id_no,
    REGEXP_SUBSTR(answers, '[^|]+', 1, LEVEL) answer
  FROM feedback
  CONNECT BY LEVEL <= REGEXP_COUNT(answers, '[^|]+')
  AND PRIOR id_no = id_no
  AND PRIOR dbms_random.value IS NOT NULL
)
SELECT
  option_rows.id_no,
  LISTAGG(option_rows.answer, '|') WITHIN GROUP (ORDER BY option_rows.answer) AS answers,
  LISTAGG(option_rows.answer_text, '|') WITHIN GROUP (ORDER BY option_rows.answer) AS answer_decode
FROM option_rows
JOIN answer_rows
ON option_rows.id_no = answer_rows.id_no
AND option_rows.answer = answer_rows.answer
GROUP BY option_rows.id_no
ORDER BY option_rows.id_no;

Which gets:

哪个获得:

     ID_NO ANSWERS    ANSWER_DECODE                          
---------- ---------- ----------------------------------------
      1001 1|2        Apple Pie|Banana-Split                  
      1002 2|3        Banana-Split|Cream Tea                  
      1003 1|2|3      Apple Pie|Banana-Split|Cream Tea  

I've also changed your regex pattern so you don't have to append or strip the |.

我也改变了你的正则表达式模式,所以你不必附加或删除|。

#2


1  

Check out this compact solution:

看看这个紧凑的解决方案:

   with sample_data as
(
  select 'ala|ma|kota' options, '1|2' answers from dual
  union all
  select 'apples|oranges|bacon', '1|2|3' from dual
  union all
  select 'a|b|c|d|e|f|h|i','1|3|4|5|8' from dual
)
select answers, options,
regexp_replace(regexp_replace(options,'([^|]+)\|([^|]+)\|([^|]+)','\' || replace(answers,'|','|\')),'[|]+','|') answer_decode
from sample_data;

Output:

输出:

  ANSWERS   OPTIONS              ANSWER_DECODE
--------- -------------------- ---------------------------
1|2       ala|ma|kota          ala|ma
1|2|3     apples|oranges|bacon apples|oranges|bacon
1|3|4|5|8 a|b|c|d|e|f|h|i      a|c|d|f|h|i

#3


0  

I've written a close solution in MySQL (don't have Oracle installed right now) - but I've written what needs to be changed in order for the query to work in Oracle.

我在MySQL中编写了一个紧密的解决方案(现在没有安装Oracle) - 但我已经编写了需要更改的内容,以便查询在Oracle中运行。

Also, the ugliest part of my code will be much better looking in Oracle since it has a much better INSTR function.

此外,我的代码中最丑陋的部分在Oracle中会更好看,因为它具有更好的INSTR功能。

The idea is to do a CROSS JOIN with a list of numbers (1 to 10 in order to support up to 10 options per survey), and break-down the OPTIONS field into different rows... (you do this by using both the list of numbers and Oracle's INSTR function, see the comments).

我们的想法是使用一个数字列表进行CROSS JOIN(1到10,以便每次调查最多支持10个选项),并将OPTIONS字段分解为不同的行...(通过同时使用两个行来实现数字列表和Oracle的INSTR函数,请参阅注释)。

From there you filter out the rows that were not selected and group everything back together.

从那里,您筛选出未选中的行,并将所有内容重新组合在一起。

-- I've used GROUP_CONCAT in MySQL, but in Oracle you'll have to use WM_CONCAT
select ID_NO, ANSWERS, group_concat(broken_down_options,'|') `OPTIONS`
from (
    select your_table.ID_NO, your_table.ANSWERS, 
            -- Luckily, you're using ORACLE so you can use an INSTR function that has the "occurrence" parameter
            -- INSTR(string, substring, [position, [occurrence]])
            -- use the nums.num field as input for the occurrence parameter
            -- and just put '1' under "position"
            case when nums.num = 1 
                then substr(your_table.`OPTIONS`, 1, instr(your_table.`OPTIONS`, '|') - 1)
                when nums.num = 2
                then substr(substr(your_table.`OPTIONS`, instr(your_table.`OPTIONS`, '|') + 1), 1, instr(substr(your_table.`OPTIONS`, instr(your_table.`OPTIONS`, '|') + 1), '|') - 1)
                else substr(your_table.`OPTIONS`,  length(your_table.`OPTIONS`) - instr(reverse(your_table.`OPTIONS`), '|') + 2) end broken_down_options
    from (select 1 num union all
        select 2 num union all
        select 3 num union all
        select 4 num union all
        select 5 num union all
        select 6 num union all
        select 7 num union all
        select 8 num union all
        select 9 num union all
        select 10 num
        ) nums 
        CROSS JOIN
        (select 1001 ID_NO, 'Apple Pie|Banana-Split|Cream Tea' `OPTIONS`, '1|2' ANSWERS union
        select 1002 ID_NO, 'Apple Pie|Banana-Split|Cream Tea' `OPTIONS`, '2|3' ANSWERS union
        select 1003 ID_NO, 'Apple Pie|Banana-Split|Cream Tea' `OPTIONS`, '1|2|3' ANSWERS
        ) your_table
    -- for example: 2|3 matches 2 and 3 but not 1
    where your_table.ANSWERS like concat(concat('%',nums.num),'%')
) some_query
group by ID_NO, ANSWERS

#4


0  

Create a stored predure and do below steps

创建存储的predure并执行以下步骤

  • Declare an array of your size.
  • 声明一个你的大小的数组。
  • Get option data from first row. Use a regex or level to extract values between pipes and then store them in array. Note: This will be only a one time itwration. So you dont need to repeat it for every row.
  • 从第一行获取选项数据。使用正则表达式或级别在管道之间提取值,然后将它们存储在数组中。注意:这只是一次性的。所以你不需要为每一行重复它。
  • Now in a loop, for each row, select answers and use array values to assign the values of answers
  • 现在循环中,对于每一行,选择答案并使用数组值来分配答案的值