如何计算Oracle SQL中分隔字符串中的单词数

时间:2022-09-13 11:41:12

For example i have the table called 'Table1'. and column called 'country'. I want to count the value of word in string.below is my data for column 'country':

例如,我有一个名为'Table1'的表。和列称为“国家/地区”。我想计算string中字的值。贝洛是我的列'country'的数据:

country:
"japan singapore japan chinese chinese chinese"

expected output: in above data we can see the japan appear two time, singapore once and chinese 3 times.i want to count value of word where japan is count as one, singapore as one and chinese as one. hence the ouput will be 3. please help me

预期输出:在上述数据中我们可以看到日本出现两次,新加坡出现一次,中国出现3次。我想计算日本数量为一,新加坡为一,中国为一的词。因此输出将是3.请帮助我

ValueOfWord: 3

3 个解决方案

#1


1  

Firstly, it is a bad design to store multiple values in a single column as delimited string. You should consider normalizing the data as a permanent solution.

首先,将多个值作为分隔字符串存储在单个列中是一种糟糕的设计。您应该考虑将数据规范化为永久解决方案。

With the denormalized data, you could do it in a single SQL using REGEXP_SUBSTR:

使用非规范化数据,您可以使用REGEXP_SUBSTR在单个SQL中执行此操作:

SELECT COUNT(DISTINCT(regexp_substr(country, '[^ ]+', 1, LEVEL))) as "COUNT"
FROM table_name
  CONNECT BY LEVEL <= regexp_count(country, ' ')+1 
/

Demo:

SQL> WITH sample_data AS
  2    ( SELECT 'japan singapore japan chinese chinese chinese' str FROM dual
  3    )
  4  -- end of sample_data mocking real table
  5  SELECT COUNT(DISTINCT(regexp_substr(str, '[^ ]+', 1, LEVEL))) as "COUNT"
  6  FROM sample_data
  7    CONNECT BY LEVEL <= regexp_count(str, ' ')+1
  8  /

     COUNT
----------
         3

See Split single comma delimited string into rows in Oracle to understand how the query works.

请参阅在Oracle中将单个逗号分隔的字符串拆分为行以了解查询的工作方式。


UPDATE

For multiple delimited string rows you need to take care of the number of rows formed by the CONNECT BY clause.

对于多个分隔的字符串行,您需要处理CONNECT BY子句形成的行数。

See Split comma delimited strings in a table in Oracle for more ways of doing the same task.

有关执行相同任务的更多方法,请参阅Oracle中的表中的拆分逗号分隔字符串。

Setup

Let's say you have a table with 3 rows like this:

假设你有一个包含3行的表,如下所示:

SQL> CREATE TABLE t(country VARCHAR2(200));

Table created.

SQL> INSERT INTO t VALUES('japan singapore japan chinese chinese chinese');

1 row created.

SQL> INSERT INTO t VALUES('singapore indian malaysia');

1 row created.

SQL> INSERT INTO t VALUES('french french french');

1 row created.

SQL> COMMIT;

Commit complete.

SQL> SELECT * FROM t;

COUNTRY
---------------------------------------------------------------------------
japan singapore japan chinese chinese chinese
singapore indian malaysia
french french french
  • Using REGEXP_SUBSTR and REGEXP_COUNT:
  • 使用REGEXP_SUBSTR和REGEXP_COUNT:

We expect the output as 6 since there are 6 unique strings.

我们期望输出为6,因为有6个唯一字符串。

SQL> SELECT COUNT(DISTINCT(regexp_substr(t.country, '[^ ]+', 1, lines.column_value))) count
  2    FROM t,
  3      TABLE (CAST (MULTISET
  4      (SELECT LEVEL FROM dual
  5              CONNECT BY LEVEL <= regexp_count(t.country, ' ')+1
  6      ) AS sys.odciNumberList ) ) lines
  7    ORDER BY lines.column_value
  8  /

     COUNT
----------
         6

There are many other methods to achieve the desired output. Let's see how:

还有许多其他方法可以实现所需的输出。我们来看看如何:

  • Using XMLTABLE
SQL> SELECT COUNT(DISTINCT(country)) COUNT
  2  FROM
  3    (SELECT trim(COLUMN_VALUE) country
  4    FROM t,
  5      xmltable(('"'
  6      || REPLACE(country, ' ', '","')
  7      || '"'))
  8    )
  9  /

     COUNT
----------
         6
  • Using MODEL clause
  • 使用MODEL子句

SQL> WITH
  2       model_param AS
  3       (
  4        SELECT country AS orig_str ,
  5               ' '
  6               || country
  7               || ' '                                 AS mod_str ,
  8               1                                      AS start_pos ,
  9              Length(country)                           AS end_pos ,
 10              (LENGTH(country) -
 11              LENGTH(REPLACE(country, ' '))) + 1        AS element_count ,
 12              0                                      AS element_no ,
 13              ROWNUM                                 AS rn
 14        FROM   t )
 15        SELECT COUNT(DISTINCT(Substr(mod_str, start_pos, end_pos-start_pos))) count
 16        FROM (
 17              SELECT *
 18              FROM   model_param
 19              MODEL PARTITION BY (rn, orig_str, mod_str)
 20              DIMENSION BY (element_no)
 21              MEASURES (start_pos, end_pos, element_count)
 22              RULES ITERATE (2000)
 23              UNTIL (ITERATION_NUMBER+1 = element_count[0])
 24            ( start_pos[ITERATION_NUMBER+1] =
 25                      instr(cv(mod_str), ' ', 1, cv(element_no)) + 1,
 26              end_pos[ITERATION_NUMBER+1] =
 27                      instr(cv(mod_str), ' ', 1, cv(element_no) + 1) )
 28            )
 29         WHERE    element_no != 0
 30    ORDER BY      mod_str , element_no
 31   /

     COUNT
----------
         6

#2


0  

Did you store that kind of string in a single entry?

您是否将这种字符串存储在一个条目中?

If not, try

如果没有,试试吧

SELECT COUNT(*)
    FROM (SELECT DISTINCT T.country FROM Table1 T)

If yes, I would write an external program to parse the string and return the result you want.

如果是,我会写一个外部程序来解析字符串并返回你想要的结果。

Like using java.

喜欢用java。

Create a String set.

创建一个字符串集。

I would use JDBC to retrieve the record, and use split to split strings in tokens using ' 'delimiter. For every token, if it is not in the set, add it to the set.

我会使用JDBC来检索记录,并使用split来使用''分隔符在标记中拆分字符串。对于每个标记,如果它不在集合中,则将其添加到集合中。

When parse finishes, get the length of the set, which is the value you want.

解析完成后,获取集合的长度,这是您想要的值。

#3


0  

Break the string based on the space delimiter

根据空格分隔符中断字符串

SELECT COUNT(DISTINCT regexp_substr(col, '[^ ]+', 1, LEVEL))
  FROM T
CONNECT BY LEVEL <= regexp_count(col, ' ')+1

For counting DISTINCT words

用于计算DISTINCT单词

    SELECT col,
COUNT(DISTINCT regexp_substr(col, '[^ ]+', 1, LEVEL))
  FROM T
CONNECT BY LEVEL <= regexp_count(col, ' ')+1
GROUP BY col

FIDDLE

#1


1  

Firstly, it is a bad design to store multiple values in a single column as delimited string. You should consider normalizing the data as a permanent solution.

首先,将多个值作为分隔字符串存储在单个列中是一种糟糕的设计。您应该考虑将数据规范化为永久解决方案。

With the denormalized data, you could do it in a single SQL using REGEXP_SUBSTR:

使用非规范化数据,您可以使用REGEXP_SUBSTR在单个SQL中执行此操作:

SELECT COUNT(DISTINCT(regexp_substr(country, '[^ ]+', 1, LEVEL))) as "COUNT"
FROM table_name
  CONNECT BY LEVEL <= regexp_count(country, ' ')+1 
/

Demo:

SQL> WITH sample_data AS
  2    ( SELECT 'japan singapore japan chinese chinese chinese' str FROM dual
  3    )
  4  -- end of sample_data mocking real table
  5  SELECT COUNT(DISTINCT(regexp_substr(str, '[^ ]+', 1, LEVEL))) as "COUNT"
  6  FROM sample_data
  7    CONNECT BY LEVEL <= regexp_count(str, ' ')+1
  8  /

     COUNT
----------
         3

See Split single comma delimited string into rows in Oracle to understand how the query works.

请参阅在Oracle中将单个逗号分隔的字符串拆分为行以了解查询的工作方式。


UPDATE

For multiple delimited string rows you need to take care of the number of rows formed by the CONNECT BY clause.

对于多个分隔的字符串行,您需要处理CONNECT BY子句形成的行数。

See Split comma delimited strings in a table in Oracle for more ways of doing the same task.

有关执行相同任务的更多方法,请参阅Oracle中的表中的拆分逗号分隔字符串。

Setup

Let's say you have a table with 3 rows like this:

假设你有一个包含3行的表,如下所示:

SQL> CREATE TABLE t(country VARCHAR2(200));

Table created.

SQL> INSERT INTO t VALUES('japan singapore japan chinese chinese chinese');

1 row created.

SQL> INSERT INTO t VALUES('singapore indian malaysia');

1 row created.

SQL> INSERT INTO t VALUES('french french french');

1 row created.

SQL> COMMIT;

Commit complete.

SQL> SELECT * FROM t;

COUNTRY
---------------------------------------------------------------------------
japan singapore japan chinese chinese chinese
singapore indian malaysia
french french french
  • Using REGEXP_SUBSTR and REGEXP_COUNT:
  • 使用REGEXP_SUBSTR和REGEXP_COUNT:

We expect the output as 6 since there are 6 unique strings.

我们期望输出为6,因为有6个唯一字符串。

SQL> SELECT COUNT(DISTINCT(regexp_substr(t.country, '[^ ]+', 1, lines.column_value))) count
  2    FROM t,
  3      TABLE (CAST (MULTISET
  4      (SELECT LEVEL FROM dual
  5              CONNECT BY LEVEL <= regexp_count(t.country, ' ')+1
  6      ) AS sys.odciNumberList ) ) lines
  7    ORDER BY lines.column_value
  8  /

     COUNT
----------
         6

There are many other methods to achieve the desired output. Let's see how:

还有许多其他方法可以实现所需的输出。我们来看看如何:

  • Using XMLTABLE
SQL> SELECT COUNT(DISTINCT(country)) COUNT
  2  FROM
  3    (SELECT trim(COLUMN_VALUE) country
  4    FROM t,
  5      xmltable(('"'
  6      || REPLACE(country, ' ', '","')
  7      || '"'))
  8    )
  9  /

     COUNT
----------
         6
  • Using MODEL clause
  • 使用MODEL子句

SQL> WITH
  2       model_param AS
  3       (
  4        SELECT country AS orig_str ,
  5               ' '
  6               || country
  7               || ' '                                 AS mod_str ,
  8               1                                      AS start_pos ,
  9              Length(country)                           AS end_pos ,
 10              (LENGTH(country) -
 11              LENGTH(REPLACE(country, ' '))) + 1        AS element_count ,
 12              0                                      AS element_no ,
 13              ROWNUM                                 AS rn
 14        FROM   t )
 15        SELECT COUNT(DISTINCT(Substr(mod_str, start_pos, end_pos-start_pos))) count
 16        FROM (
 17              SELECT *
 18              FROM   model_param
 19              MODEL PARTITION BY (rn, orig_str, mod_str)
 20              DIMENSION BY (element_no)
 21              MEASURES (start_pos, end_pos, element_count)
 22              RULES ITERATE (2000)
 23              UNTIL (ITERATION_NUMBER+1 = element_count[0])
 24            ( start_pos[ITERATION_NUMBER+1] =
 25                      instr(cv(mod_str), ' ', 1, cv(element_no)) + 1,
 26              end_pos[ITERATION_NUMBER+1] =
 27                      instr(cv(mod_str), ' ', 1, cv(element_no) + 1) )
 28            )
 29         WHERE    element_no != 0
 30    ORDER BY      mod_str , element_no
 31   /

     COUNT
----------
         6

#2


0  

Did you store that kind of string in a single entry?

您是否将这种字符串存储在一个条目中?

If not, try

如果没有,试试吧

SELECT COUNT(*)
    FROM (SELECT DISTINCT T.country FROM Table1 T)

If yes, I would write an external program to parse the string and return the result you want.

如果是,我会写一个外部程序来解析字符串并返回你想要的结果。

Like using java.

喜欢用java。

Create a String set.

创建一个字符串集。

I would use JDBC to retrieve the record, and use split to split strings in tokens using ' 'delimiter. For every token, if it is not in the set, add it to the set.

我会使用JDBC来检索记录,并使用split来使用''分隔符在标记中拆分字符串。对于每个标记,如果它不在集合中,则将其添加到集合中。

When parse finishes, get the length of the set, which is the value you want.

解析完成后,获取集合的长度,这是您想要的值。

#3


0  

Break the string based on the space delimiter

根据空格分隔符中断字符串

SELECT COUNT(DISTINCT regexp_substr(col, '[^ ]+', 1, LEVEL))
  FROM T
CONNECT BY LEVEL <= regexp_count(col, ' ')+1

For counting DISTINCT words

用于计算DISTINCT单词

    SELECT col,
COUNT(DISTINCT regexp_substr(col, '[^ ]+', 1, LEVEL))
  FROM T
CONNECT BY LEVEL <= regexp_count(col, ' ')+1
GROUP BY col

FIDDLE