String Tokenizer:用逗号分隔字符串,用双引号忽略逗号

时间:2021-10-21 03:54:21

I have a string like below -

我有一个像下面的字符串 -

value1, value2, value3, value4, "value5, 1234", value6, value7, "value8", value9, "value10, 123.23"

value1,value2,value3,value4,“value5,1234”,value6,value7,“value8”,value9,“value10,123.23”

If I tokenize above string I'm getting comma separated tokens. But I would like to say to string tokenizer ignore comma's after double quotes while doing splits. How can I say this?

如果我在上面的字符串上标记,我会用逗号分隔标记。但是我想在进行拆分时用双引号后的字符串标记器忽略逗号。我该怎么说呢?

Thanks in advance

提前致谢

Shashi

沙市

6 个解决方案

#1


6  

Use a CSV parser like OpenCSV to take care of things like commas in quoted elements, values that span multiple lines etc. automatically. You can use the library to serialize your text back as CSV as well.

使用像OpenCSV这样的CSV解析器来处理引用元素中的逗号,自动跨越多行等的值。您也可以使用该库将文本序列化为CSV格式。

String str = "value1, value2, value3, value4, \"value5, 1234\", " +
        "value6, value7, \"value8\", value9, \"value10, 123.23\"";

CSVReader reader = new CSVReader(new StringReader(str));

String [] tokens;
while ((tokens = reader.readNext()) != null) {
    System.out.println(tokens[0]); // value1
    System.out.println(tokens[4]); // value5, 1234
    System.out.println(tokens[9]); // value10, 123.23
}

#2


2  

You just need one line and the right regex:

你只需要一行和正确的正则表达式:

String[] values = input.replaceAll("^\"", "").split("\"?(,|$)(?=(([^\"]*\"){2})*[^\"]*$) *\"?");

This also neatly trims off the wrapping double quotes for you too, including the final quote!

这也为你整理了包装双引号,包括最终报价!

Note: Interesting edge case when the first term is quoted required an extra step of trimming the leading quote using replaceAll().

注意:引用第一个术语时的有趣边缘情况需要使用replaceAll()修剪前导引号的额外步骤。

Here's some test code:

这是一些测试代码:

String input= "\"value1, value2\", value3, value4, \"value5, 1234\", " +
    "value6, value7, \"value8\", value9, \"value10, 123.23\"";
String[] values = input.replaceAll("^\"", "").split("\"?(,|$)(?=(([^\"]*\"){2})*[^\"]*$) *\"?");
for (String s : values)
System.out.println(s);

Output:

输出:

value1, value2
value3
value4
value5, 1234
value6
value7
value8
value9
value10, 123.23

#3


2  

I'm allergic to regex; why not double-split as someone suggested?

我对正则表达式过敏;为什么不像有人建议的那样双重拆分?

    String str = "value1, value2, value3, value4, \"value5, 1234\", value6, value7, \"value8\", value9, \"value10, 123.23\"";
    boolean quoted = false;
    for(String q : str.split("\"")) {
        if(quoted)
            System.out.println(q.trim());
        else
            for(String s : q.split(","))
                if(!s.trim().isEmpty())
                    System.out.println(s.trim());
        quoted = !quoted;
    }

#4


1  

You can use several approaches:

您可以使用以下几种方法:

  1. Write code that search for comas and maintain a state weather a particular coma is in quotes or note.
  2. 编写搜索昏迷并维持状态天气的代码,特定的昏迷是引号或注释。
  3. Tokenize by double-quote symbol and than tokenize strings in the result array by comma symbol (make sure you tokenize strings with indexes 0, 2, 4, etc., since they were not in double quotes in the original string)
  4. 使用双引号符号进行标记,然后使用逗号符号标记结果数组中的字符串(确保使用索引0,2,4等标记化字符串,因为它们不是原始字符串中的双引号)

#5


1  

Without any third party library dependency, following code can also parse the fields as per the requirements given:

没有任何第三方库依赖项,以下代码也可以根据给定的要求解析字段:

import java.util.*;

public class CSVSpliter {

  public static void main (String [] args) {
    String inputStr = "value1, value2, value3, value4, \"value5, 1234\", value6, value7, \"value8\", value9, \"value10, 123.23\"";

    StringBuffer sb = new StringBuffer (inputStr);
    List<String> splitStringList = new ArrayList<String> ();
    boolean insideDoubleQuotes = false;
    StringBuffer field = new StringBuffer ();

    for (int i=0; i < sb.length(); i++) {
        if (sb.charAt (i) == '"' && !insideDoubleQuotes) {
            insideDoubleQuotes = true;
        } else if (sb.charAt(i) == '"' && insideDoubleQuotes) {
            insideDoubleQuotes = false;
            splitStringList.add (field.toString().trim());
            field.setLength(0);
        } else if (sb.charAt(i) == ',' && !insideDoubleQuotes) {
            // ignore the comma after double quotes.
            if (field.length() > 0) {
                splitStringList.add (field.toString().trim());
            }
            // clear the field for next word
            field.setLength(0);
        } else {
            field.append (sb.charAt(i));
        }
    }
    for (String str: splitStringList) {
        System.out.println ("Split fields: "+str);
    }
}

}

}

This will give the following output:

这将给出以下输出:

Split fields: value1

拆分字段:value1

Split fields: value2

拆分字段:value2

Split fields: value3

拆分字段:value3

Split fields: value4

拆分字段:value4

Split fields: value5, 1234

拆分字段:value5,1234

Split fields: value6

拆分字段:value6

Split fields: value7

拆分字段:value7

Split fields: value8

拆分字段:value8

Split fields: value9

拆分字段:value9

Split fields: value10, 123.23

拆分字段:value10,123.23

#6


0  

String delimiter = ",";

String v = "value1, value2, value3, value4, \"value5, 1234\", value6, value7, \"value8\", value9, \"value10, 123.23\"";

String[] a = v.split(delimiter + "(?=(?:(?:[^\"]*+\"){2})*+[^\"]*+$)");

#1


6  

Use a CSV parser like OpenCSV to take care of things like commas in quoted elements, values that span multiple lines etc. automatically. You can use the library to serialize your text back as CSV as well.

使用像OpenCSV这样的CSV解析器来处理引用元素中的逗号,自动跨越多行等的值。您也可以使用该库将文本序列化为CSV格式。

String str = "value1, value2, value3, value4, \"value5, 1234\", " +
        "value6, value7, \"value8\", value9, \"value10, 123.23\"";

CSVReader reader = new CSVReader(new StringReader(str));

String [] tokens;
while ((tokens = reader.readNext()) != null) {
    System.out.println(tokens[0]); // value1
    System.out.println(tokens[4]); // value5, 1234
    System.out.println(tokens[9]); // value10, 123.23
}

#2


2  

You just need one line and the right regex:

你只需要一行和正确的正则表达式:

String[] values = input.replaceAll("^\"", "").split("\"?(,|$)(?=(([^\"]*\"){2})*[^\"]*$) *\"?");

This also neatly trims off the wrapping double quotes for you too, including the final quote!

这也为你整理了包装双引号,包括最终报价!

Note: Interesting edge case when the first term is quoted required an extra step of trimming the leading quote using replaceAll().

注意:引用第一个术语时的有趣边缘情况需要使用replaceAll()修剪前导引号的额外步骤。

Here's some test code:

这是一些测试代码:

String input= "\"value1, value2\", value3, value4, \"value5, 1234\", " +
    "value6, value7, \"value8\", value9, \"value10, 123.23\"";
String[] values = input.replaceAll("^\"", "").split("\"?(,|$)(?=(([^\"]*\"){2})*[^\"]*$) *\"?");
for (String s : values)
System.out.println(s);

Output:

输出:

value1, value2
value3
value4
value5, 1234
value6
value7
value8
value9
value10, 123.23

#3


2  

I'm allergic to regex; why not double-split as someone suggested?

我对正则表达式过敏;为什么不像有人建议的那样双重拆分?

    String str = "value1, value2, value3, value4, \"value5, 1234\", value6, value7, \"value8\", value9, \"value10, 123.23\"";
    boolean quoted = false;
    for(String q : str.split("\"")) {
        if(quoted)
            System.out.println(q.trim());
        else
            for(String s : q.split(","))
                if(!s.trim().isEmpty())
                    System.out.println(s.trim());
        quoted = !quoted;
    }

#4


1  

You can use several approaches:

您可以使用以下几种方法:

  1. Write code that search for comas and maintain a state weather a particular coma is in quotes or note.
  2. 编写搜索昏迷并维持状态天气的代码,特定的昏迷是引号或注释。
  3. Tokenize by double-quote symbol and than tokenize strings in the result array by comma symbol (make sure you tokenize strings with indexes 0, 2, 4, etc., since they were not in double quotes in the original string)
  4. 使用双引号符号进行标记,然后使用逗号符号标记结果数组中的字符串(确保使用索引0,2,4等标记化字符串,因为它们不是原始字符串中的双引号)

#5


1  

Without any third party library dependency, following code can also parse the fields as per the requirements given:

没有任何第三方库依赖项,以下代码也可以根据给定的要求解析字段:

import java.util.*;

public class CSVSpliter {

  public static void main (String [] args) {
    String inputStr = "value1, value2, value3, value4, \"value5, 1234\", value6, value7, \"value8\", value9, \"value10, 123.23\"";

    StringBuffer sb = new StringBuffer (inputStr);
    List<String> splitStringList = new ArrayList<String> ();
    boolean insideDoubleQuotes = false;
    StringBuffer field = new StringBuffer ();

    for (int i=0; i < sb.length(); i++) {
        if (sb.charAt (i) == '"' && !insideDoubleQuotes) {
            insideDoubleQuotes = true;
        } else if (sb.charAt(i) == '"' && insideDoubleQuotes) {
            insideDoubleQuotes = false;
            splitStringList.add (field.toString().trim());
            field.setLength(0);
        } else if (sb.charAt(i) == ',' && !insideDoubleQuotes) {
            // ignore the comma after double quotes.
            if (field.length() > 0) {
                splitStringList.add (field.toString().trim());
            }
            // clear the field for next word
            field.setLength(0);
        } else {
            field.append (sb.charAt(i));
        }
    }
    for (String str: splitStringList) {
        System.out.println ("Split fields: "+str);
    }
}

}

}

This will give the following output:

这将给出以下输出:

Split fields: value1

拆分字段:value1

Split fields: value2

拆分字段:value2

Split fields: value3

拆分字段:value3

Split fields: value4

拆分字段:value4

Split fields: value5, 1234

拆分字段:value5,1234

Split fields: value6

拆分字段:value6

Split fields: value7

拆分字段:value7

Split fields: value8

拆分字段:value8

Split fields: value9

拆分字段:value9

Split fields: value10, 123.23

拆分字段:value10,123.23

#6


0  

String delimiter = ",";

String v = "value1, value2, value3, value4, \"value5, 1234\", value6, value7, \"value8\", value9, \"value10, 123.23\"";

String[] a = v.split(delimiter + "(?=(?:(?:[^\"]*+\"){2})*+[^\"]*+$)");