在java中使用RegEx解析CSV输入

时间:2022-06-01 01:42:51

I know, now I have two problems. But I'm having fun!

我知道,现在我有两个问题。但我很开心!

I started with this advice not to try and split, but instead to match on what is an acceptable field, and expanded from there to this expression.

我从这个建议开始,不是试图分裂,而是匹配什么是可接受的字段,并从那里扩展到这个表达式。

final Pattern pattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?=,|$)");

The expression looks like this without the annoying escaped quotes:

表达式看起来像这样没有恼人的转义引号:

"([^"]*)"|(?<=,|^)([^,]*)(?=,|$)

This is working well for me - either it matches on "two quotes and whatever is between them", or "something between the start of the line or a comma and the end of the line or a comma". Iterating through the matches gets me all the fields, even if they are empty. For instance,

这对我来说效果很好 - 或者它匹配“两个引号和它们之间的任何东西”,或“在行的开头或逗号和行尾或逗号之间的某些东西”。通过匹配迭代可以获得所有字段,即使它们是空的。例如,

the quick, "brown, fox jumps", over, "the",,"lazy dog"

breaks down into

分解成

the quick
"brown, fox jumps"
over
"the"

"lazy dog"

Great! Now I want to drop the quotes, so I added the lookahead and lookbehind non-capturing groups like I was doing for the commas.

大!现在我想删除引号,所以我添加了前瞻和后瞻性非捕获组,就像我为逗号做的那样。

final Pattern pattern = Pattern.compile("(?<=\")([^\"]*)(?=\")|(?<=,|^)([^,]*)(?=,|$)");

again the expression is:

再次表达的是:

(?<=")([^"]*)(?=")|(?<=,|^)([^,]*)(?=,|$)

Instead of the desired result

而不是期望的结果

the quick
brown, fox jumps
over
the

lazy dog

now I get this breakdown:

现在我得到这个细分:

the quick
"brown
 fox jumps"
,over,
"the"
,,
"lazy dog"

What am I missing?

我错过了什么?

4 个解决方案

#1


8  

Operator precedence. Basically there is none. It's all left to right. So the or (|) is applying to the closing quote lookahead and the comma lookahead

运算符优先级。基本上没有。这一切都是从左到右。所以or(|)适用于结束引用前瞻和逗号前瞻

Try:

尝试:

(?:(?<=")([^"]*)(?="))|(?<=,|^)([^,]*)(?=,|$)

#2


5  

(?:^|,)\s*(?:(?:(?=")"([^"].*?)")|(?:(?!")(.*?)))(?=,|$)

This should do what you want.

这应该做你想要的。

Explanation:

说明:

(?:^|,)\s*

The pattern should start with a , or beginning of string. Also, ignore all whitespace at the beginning.

模式应该以字符串或字符串的开头开头。另外,忽略开头的所有空格。

Lookahead and see if the rest starts with a quote

Lookahead并查看其余部分是否以引号开头

(?:(?=")"([^"].*?)")

If it does, then match non-greedily till next quote.

如果确实如此,则非贪婪地匹配到下一个引用。

(?:(?!")(.*?))

If it does not begin with a quote, then match non-greedily till next comma or end of string.

如果它不以引号开头,则匹配非贪婪直到下一个逗号或字符串结尾。

(?=,|$)

The pattern should end with a comma or end of string.

模式应以逗号或字符串结尾结尾。

#3


4  

When I started to understand what I had done wrong, I also started to understand how convoluted the lookarounds were making this. I finally realized that I didn't want all the matched text, I wanted specific groups inside of it. I ended up using something very similar to my original RegEx except that I didn't do a lookahead on the closing comma, which I think should be a little more efficient. Here is my final code.

当我开始理解我做错了什么时,我也开始明白这些看起来有多么复杂。我终于意识到我不想要所有匹配的文本,我想要它内部的特定组。我最终使用的东西与我原来的RegEx非常相似,只是我没有对结束逗号做一个预测,我认为这应该更有效率。这是我的最终代码。

package regex.parser;

import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class CSVParser {

    /*
     * This Pattern will match on either quoted text or text between commas, including
     * whitespace, and accounting for beginning and end of line.
     */
    private final Pattern csvPattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?:,|$)");  
    private ArrayList<String> allMatches = null;    
    private Matcher matcher = null;
    private String match = null;
    private int size;

    public CSVParser() {        
        allMatches = new ArrayList<String>();
        matcher = null;
        match = null;
    }

    public String[] parse(String csvLine) {
        matcher = csvPattern.matcher(csvLine);
        allMatches.clear();
        String match;
        while (matcher.find()) {
            match = matcher.group(1);
            if (match!=null) {
                allMatches.add(match);
            }
            else {
                allMatches.add(matcher.group(2));
            }
        }

        size = allMatches.size();       
        if (size > 0) {
            return allMatches.toArray(new String[size]);
        }
        else {
            return new String[0];
        }           
    }   

    public static void main(String[] args) {        
        String lineinput = "the quick,\"brown, fox jumps\",over,\"the\",,\"lazy dog\"";

        CSVParser myCSV = new CSVParser();
        System.out.println("Testing CSVParser with: \n " + lineinput);
        for (String s : myCSV.parse(lineinput)) {
            System.out.println(s);
        }
    }

}

#4


1  

I know this isn't what the OP wants, but for other readers, one of the String.replace methods could be used to strip the quotes from each element in the result array of the OPs current regex.

我知道这不是OP想要的,但对于其他读者,可以使用String.replace方法中的一个来从OPs当前正则表达式的结果数组中的每个元素中去除引号。

#1


8  

Operator precedence. Basically there is none. It's all left to right. So the or (|) is applying to the closing quote lookahead and the comma lookahead

运算符优先级。基本上没有。这一切都是从左到右。所以or(|)适用于结束引用前瞻和逗号前瞻

Try:

尝试:

(?:(?<=")([^"]*)(?="))|(?<=,|^)([^,]*)(?=,|$)

#2


5  

(?:^|,)\s*(?:(?:(?=")"([^"].*?)")|(?:(?!")(.*?)))(?=,|$)

This should do what you want.

这应该做你想要的。

Explanation:

说明:

(?:^|,)\s*

The pattern should start with a , or beginning of string. Also, ignore all whitespace at the beginning.

模式应该以字符串或字符串的开头开头。另外,忽略开头的所有空格。

Lookahead and see if the rest starts with a quote

Lookahead并查看其余部分是否以引号开头

(?:(?=")"([^"].*?)")

If it does, then match non-greedily till next quote.

如果确实如此,则非贪婪地匹配到下一个引用。

(?:(?!")(.*?))

If it does not begin with a quote, then match non-greedily till next comma or end of string.

如果它不以引号开头,则匹配非贪婪直到下一个逗号或字符串结尾。

(?=,|$)

The pattern should end with a comma or end of string.

模式应以逗号或字符串结尾结尾。

#3


4  

When I started to understand what I had done wrong, I also started to understand how convoluted the lookarounds were making this. I finally realized that I didn't want all the matched text, I wanted specific groups inside of it. I ended up using something very similar to my original RegEx except that I didn't do a lookahead on the closing comma, which I think should be a little more efficient. Here is my final code.

当我开始理解我做错了什么时,我也开始明白这些看起来有多么复杂。我终于意识到我不想要所有匹配的文本,我想要它内部的特定组。我最终使用的东西与我原来的RegEx非常相似,只是我没有对结束逗号做一个预测,我认为这应该更有效率。这是我的最终代码。

package regex.parser;

import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class CSVParser {

    /*
     * This Pattern will match on either quoted text or text between commas, including
     * whitespace, and accounting for beginning and end of line.
     */
    private final Pattern csvPattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?:,|$)");  
    private ArrayList<String> allMatches = null;    
    private Matcher matcher = null;
    private String match = null;
    private int size;

    public CSVParser() {        
        allMatches = new ArrayList<String>();
        matcher = null;
        match = null;
    }

    public String[] parse(String csvLine) {
        matcher = csvPattern.matcher(csvLine);
        allMatches.clear();
        String match;
        while (matcher.find()) {
            match = matcher.group(1);
            if (match!=null) {
                allMatches.add(match);
            }
            else {
                allMatches.add(matcher.group(2));
            }
        }

        size = allMatches.size();       
        if (size > 0) {
            return allMatches.toArray(new String[size]);
        }
        else {
            return new String[0];
        }           
    }   

    public static void main(String[] args) {        
        String lineinput = "the quick,\"brown, fox jumps\",over,\"the\",,\"lazy dog\"";

        CSVParser myCSV = new CSVParser();
        System.out.println("Testing CSVParser with: \n " + lineinput);
        for (String s : myCSV.parse(lineinput)) {
            System.out.println(s);
        }
    }

}

#4


1  

I know this isn't what the OP wants, but for other readers, one of the String.replace methods could be used to strip the quotes from each element in the result array of the OPs current regex.

我知道这不是OP想要的,但对于其他读者,可以使用String.replace方法中的一个来从OPs当前正则表达式的结果数组中的每个元素中去除引号。