Java:分割逗号分隔的字符串,但忽略引号中的逗号

时间:2022-08-22 00:19:44

I have a string vaguely like this:

我有一条像这样的线:

foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"

that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)

我想用逗号分开,但我需要忽略引号中的逗号。我该怎么做呢?似乎regexp方法失败了;我想当我看到一个引用时,我可以手动扫描并输入一个不同的模式,但是最好使用现有的库。(编辑:我想我指的是那些已经是JDK的一部分或者已经是Apache Commons等常用库的一部分的库。)

the above string should split into:

上述字符串应分为:

foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"

note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure

注意:这不是一个CSV文件,它是一个包含在一个更大的总体结构的文件中的单个字符串。

9 个解决方案

#1


375  

Try:

试一试:

public class Main { 
    public static void main(String[] args) {
        String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
        String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
        for(String t : tokens) {
            System.out.println("> "+t);
        }
    }
}

Output:

输出:

> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"

In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.

换句话说:只在逗号前面有0或偶数个引号的情况下在逗号上分割。

Or, a bit friendlier for the eyes:

或者,更友好的眼神:

public class Main { 
    public static void main(String[] args) {
        String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";

        String otherThanQuote = " [^\"] ";
        String quotedString = String.format(" \" %s* \" ", otherThanQuote);
        String regex = String.format("(?x) "+ // enable comments, ignore white spaces
                ",                         "+ // match a comma
                "(?=                       "+ // start positive look ahead
                "  (?:                     "+ //   start non-capturing group 1
                "    %s*                   "+ //     match 'otherThanQuote' zero or more times
                "    %s                    "+ //     match 'quotedString'
                "  )*                      "+ //   end group 1 and repeat it zero or more times
                "  %s*                     "+ //   match 'otherThanQuote'
                "  $                       "+ // match the end of the string
                ")                         ", // stop positive look ahead
                otherThanQuote, quotedString, otherThanQuote);

        String[] tokens = line.split(regex, -1);
        for(String t : tokens) {
            System.out.println("> "+t);
        }
    }
}

which produces the same as the first example.

它产生的结果与第一个例子相同。

EDIT

As mentioned by @MikeFHay in the comments:

@MikeFHay在评论中提到:

I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:

我更喜欢使用Guava的Splitter,因为它有更清晰的默认值(参见上面关于通过String#split()修剪空匹配的讨论),所以我这样做了:

Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

#2


37  

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:

虽然我一般喜欢正则表达式,但对于这种依赖状态的标记化,我认为简单的解析器(在这种情况下,它比那个词听起来要简单得多)可能是一种更干净的解决方案,特别是在可维护性方面,例如:

String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
    if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
    boolean atLastChar = (current == input.length() - 1);
    if(atLastChar) result.add(input.substring(start));
    else if (input.charAt(current) == ',' && !inQuotes) {
        result.add(input.substring(start, current));
        start = current + 1;
    }
}

If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:

如果你不关心在引号中保留逗号,你可以简化这种方法(不处理开始索引,不处理最后一个字符的特殊情况),用其他东西替换引号中的逗号,然后以逗号分隔:

String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
    char currentChar = builder.charAt(currentIndex);
    if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
    if (currentChar == ',' && inQuotes) {
        builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
    }
}
List<String> result = Arrays.asList(builder.toString().split(","));

#3


20  

http://sourceforge.net/projects/javacsv/

http://sourceforge.net/projects/javacsv/

https://github.com/pupi1985/JavaCSV-Reloaded (fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)

https://github.com/pupi1985/ javacsv重新加载(前一个库的分支,它允许生成的输出在不运行Windows时具有Windows行终止器\r\n)

http://opencsv.sourceforge.net/

http://opencsv.sourceforge.net/

CSV API for Java

CSV Java API

Can you recommend a Java library for reading (and possibly writing) CSV files?

您能推荐一个Java库来读取(也可能是写入)CSV文件吗?

Java lib or app to convert CSV to XML file?

Java lib或app将CSV转换成XML文件?

#4


4  

I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as F* proposed). I've tried regex solution and own parsing implementation I have found that:

我不建议来自Bart的regex回答,我发现在这种情况下解析解决方案更好(正如F*所建议的)。我尝试过regex解决方案和自己的解析实现,我发现:

  1. Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
  2. 解析比使用regex进行反向引用要快得多——短字符串要快20倍,长字符串要快40倍。
  3. Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
  4. Regex在最后一个逗号后找不到空字符串。这不是最初的问题,这是我的要求。

My solution and test below.

我的解决方案和下面的测试。

String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;

start = System.nanoTime(); 
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
    switch (c) {
    case ',':
        if (inQuotes) {
            b.append(c);
        } else {
            tokensList.add(b.toString());
            b = new StringBuilder();
        }
        break;
    case '\"':
        inQuotes = !inQuotes;
    default:
        b.append(c);
    break;
    }
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;

System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);

Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.

当然,如果你对它的丑陋感到不舒服,你可以在这段代码中切换到其他if。注意用隔板开关后没有断路。为了提高速度,StringBuilder被选择为StringBuffer,而线程安全性无关。

#5


2  

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

试试像(?! ? ? ? ? ?! ? ? ? ? ?)这应该是匹配的,不被包围。

#6


2  

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.

您正处于regexps几乎无法处理的恼人边界区域(正如Bart所指出的,避免引用将使工作变得困难),但是一个完整的解析器似乎有点过头了。

If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

如果您可能需要更大的复杂性,我将寻找解析器库。例如这一个

#7


2  

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):

我没有耐心,选择不等待答案……作为参考,做这样的事情看起来并不难(这对于我的应用程序来说是可行的,我不需要担心转义引号,因为引号中的内容仅限于一些受约束的形式):

final static private Pattern splitSearchPattern = Pattern.compile("[\",]"); 
private List<String> splitByCommasNotInQuotes(String s) {
    if (s == null)
        return Collections.emptyList();

    List<String> list = new ArrayList<String>();
    Matcher m = splitSearchPattern.matcher(s);
    int pos = 0;
    boolean quoteMode = false;
    while (m.find())
    {
        String sep = m.group();
        if ("\"".equals(sep))
        {
            quoteMode = !quoteMode;
        }
        else if (!quoteMode && ",".equals(sep))
        {
            int toPos = m.start(); 
            list.add(s.substring(pos, toPos));
            pos = m.end();
        }
    }
    if (pos < s.length())
        list.add(s.substring(pos));
    return list;
}

(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

(读者练习:通过查找反斜线扩展到处理转义引号。)

#8


0  

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.

与其使用lookahead和其他疯狂的regex,不如先拿出引号。也就是说,对于每个引用分组,用__IDENTIFIER_1或其他指示符替换该分组,并将该分组映射为字符串、字符串的映射。

After you split on comma, replace all mapped identifiers with the original string values.

在逗号上分割后,将所有映射标识符替换为原始字符串值。

#9


-1  

I would do something like this:

我会这样做:

boolean foundQuote = false;

if(charAtIndex(currentStringIndex) == '"')
{
   foundQuote = true;
}

if(foundQuote == true)
{
   //do nothing
}

else 

{
  string[] split = currentString.split(',');  
}

#1


375  

Try:

试一试:

public class Main { 
    public static void main(String[] args) {
        String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
        String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
        for(String t : tokens) {
            System.out.println("> "+t);
        }
    }
}

Output:

输出:

> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"

In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.

换句话说:只在逗号前面有0或偶数个引号的情况下在逗号上分割。

Or, a bit friendlier for the eyes:

或者,更友好的眼神:

public class Main { 
    public static void main(String[] args) {
        String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";

        String otherThanQuote = " [^\"] ";
        String quotedString = String.format(" \" %s* \" ", otherThanQuote);
        String regex = String.format("(?x) "+ // enable comments, ignore white spaces
                ",                         "+ // match a comma
                "(?=                       "+ // start positive look ahead
                "  (?:                     "+ //   start non-capturing group 1
                "    %s*                   "+ //     match 'otherThanQuote' zero or more times
                "    %s                    "+ //     match 'quotedString'
                "  )*                      "+ //   end group 1 and repeat it zero or more times
                "  %s*                     "+ //   match 'otherThanQuote'
                "  $                       "+ // match the end of the string
                ")                         ", // stop positive look ahead
                otherThanQuote, quotedString, otherThanQuote);

        String[] tokens = line.split(regex, -1);
        for(String t : tokens) {
            System.out.println("> "+t);
        }
    }
}

which produces the same as the first example.

它产生的结果与第一个例子相同。

EDIT

As mentioned by @MikeFHay in the comments:

@MikeFHay在评论中提到:

I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:

我更喜欢使用Guava的Splitter,因为它有更清晰的默认值(参见上面关于通过String#split()修剪空匹配的讨论),所以我这样做了:

Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

#2


37  

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:

虽然我一般喜欢正则表达式,但对于这种依赖状态的标记化,我认为简单的解析器(在这种情况下,它比那个词听起来要简单得多)可能是一种更干净的解决方案,特别是在可维护性方面,例如:

String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
    if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
    boolean atLastChar = (current == input.length() - 1);
    if(atLastChar) result.add(input.substring(start));
    else if (input.charAt(current) == ',' && !inQuotes) {
        result.add(input.substring(start, current));
        start = current + 1;
    }
}

If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:

如果你不关心在引号中保留逗号,你可以简化这种方法(不处理开始索引,不处理最后一个字符的特殊情况),用其他东西替换引号中的逗号,然后以逗号分隔:

String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
    char currentChar = builder.charAt(currentIndex);
    if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
    if (currentChar == ',' && inQuotes) {
        builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
    }
}
List<String> result = Arrays.asList(builder.toString().split(","));

#3


20  

http://sourceforge.net/projects/javacsv/

http://sourceforge.net/projects/javacsv/

https://github.com/pupi1985/JavaCSV-Reloaded (fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)

https://github.com/pupi1985/ javacsv重新加载(前一个库的分支,它允许生成的输出在不运行Windows时具有Windows行终止器\r\n)

http://opencsv.sourceforge.net/

http://opencsv.sourceforge.net/

CSV API for Java

CSV Java API

Can you recommend a Java library for reading (and possibly writing) CSV files?

您能推荐一个Java库来读取(也可能是写入)CSV文件吗?

Java lib or app to convert CSV to XML file?

Java lib或app将CSV转换成XML文件?

#4


4  

I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as F* proposed). I've tried regex solution and own parsing implementation I have found that:

我不建议来自Bart的regex回答,我发现在这种情况下解析解决方案更好(正如F*所建议的)。我尝试过regex解决方案和自己的解析实现,我发现:

  1. Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
  2. 解析比使用regex进行反向引用要快得多——短字符串要快20倍,长字符串要快40倍。
  3. Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
  4. Regex在最后一个逗号后找不到空字符串。这不是最初的问题,这是我的要求。

My solution and test below.

我的解决方案和下面的测试。

String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;

start = System.nanoTime(); 
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
    switch (c) {
    case ',':
        if (inQuotes) {
            b.append(c);
        } else {
            tokensList.add(b.toString());
            b = new StringBuilder();
        }
        break;
    case '\"':
        inQuotes = !inQuotes;
    default:
        b.append(c);
    break;
    }
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;

System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);

Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.

当然,如果你对它的丑陋感到不舒服,你可以在这段代码中切换到其他if。注意用隔板开关后没有断路。为了提高速度,StringBuilder被选择为StringBuffer,而线程安全性无关。

#5


2  

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

试试像(?! ? ? ? ? ?! ? ? ? ? ?)这应该是匹配的,不被包围。

#6


2  

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.

您正处于regexps几乎无法处理的恼人边界区域(正如Bart所指出的,避免引用将使工作变得困难),但是一个完整的解析器似乎有点过头了。

If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

如果您可能需要更大的复杂性,我将寻找解析器库。例如这一个

#7


2  

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):

我没有耐心,选择不等待答案……作为参考,做这样的事情看起来并不难(这对于我的应用程序来说是可行的,我不需要担心转义引号,因为引号中的内容仅限于一些受约束的形式):

final static private Pattern splitSearchPattern = Pattern.compile("[\",]"); 
private List<String> splitByCommasNotInQuotes(String s) {
    if (s == null)
        return Collections.emptyList();

    List<String> list = new ArrayList<String>();
    Matcher m = splitSearchPattern.matcher(s);
    int pos = 0;
    boolean quoteMode = false;
    while (m.find())
    {
        String sep = m.group();
        if ("\"".equals(sep))
        {
            quoteMode = !quoteMode;
        }
        else if (!quoteMode && ",".equals(sep))
        {
            int toPos = m.start(); 
            list.add(s.substring(pos, toPos));
            pos = m.end();
        }
    }
    if (pos < s.length())
        list.add(s.substring(pos));
    return list;
}

(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

(读者练习:通过查找反斜线扩展到处理转义引号。)

#8


0  

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.

与其使用lookahead和其他疯狂的regex,不如先拿出引号。也就是说,对于每个引用分组,用__IDENTIFIER_1或其他指示符替换该分组,并将该分组映射为字符串、字符串的映射。

After you split on comma, replace all mapped identifiers with the original string values.

在逗号上分割后,将所有映射标识符替换为原始字符串值。

#9


-1  

I would do something like this:

我会这样做:

boolean foundQuote = false;

if(charAtIndex(currentStringIndex) == '"')
{
   foundQuote = true;
}

if(foundQuote == true)
{
   //do nothing
}

else 

{
  string[] split = currentString.split(',');  
}