使用String.split()将带引号的csv文件拆分为文本分隔符

时间:2021-10-23 03:05:23

I have a comma separated file with many lines similar to one below.

我有一个逗号分隔文件,其中许多行类似于下面的一行。

Sachin,,M,"Maths,Science,English",Need to improve in these subjects.

Quotes is used to escape the delimiter comma used to represent multiple values.

引号用于转义用于表示多个值的分隔符逗号。

Now how do I split the above value on the comma delimiter using String.split() if at all its possible?

现在如何使用String.split()将上述值拆分为逗号分隔符(如果可能的话)?

3 个解决方案

#1


147  

public static void main(String[] args) {
    String s = "Sachin,,M,\"Maths,Science,English\",Need to improve in these subjects.";
    String[] splitted = s.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
    System.out.println(Arrays.toString(splitted));
}

Output:

输出:

[Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]

#2


10  

As your problem/requirements are not all that complex a custom method can be utilized that performs over 20 times faster and produces the same results. This is variable based on the data size and number of rows parsed, and for more complicated problems using regular expressions is a must.

由于您的问题/要求并不是那么复杂,因此可以使用自定义方法,其执行速度提高20倍以上并产生相同的结果。这是基于数据大小和解析行数的变量,对于使用正则表达式的更复杂问题是必须的。

import java.util.Arrays;
import java.util.ArrayList;
public class SplitTest {

public static void main(String[] args) {

    String s = "Sachin,,M,\"Maths,Science,English\",Need to improve in these subjects.";
    String[] splitted = null;

 //Measure Regular Expression
    long startTime = System.nanoTime();
    for(int i=0; i<10; i++)
    splitted = s.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
    long endTime =   System.nanoTime();

    System.out.println("Took: " + (endTime-startTime));
    System.out.println(Arrays.toString(splitted));
    System.out.println("");


    ArrayList<String> sw = null;        
 //Measure Custom Method
            startTime = System.nanoTime();
    for(int i=0; i<10; i++)
    sw = customSplitSpecific(s);
    endTime =   System.nanoTime();

    System.out.println("Took: " + (endTime-startTime));
    System.out.println(sw);         
}

public static ArrayList<String> customSplitSpecific(String s)
{
    ArrayList<String> words = new ArrayList<String>();
    boolean notInsideComma = true;
    int start =0, end=0;
    for(int i=0; i<s.length()-1; i++)
    {
        if(s.charAt(i)==',' && notInsideComma)
        {
            words.add(s.substring(start,i));
            start = i+1;                
        }   
        else if(s.charAt(i)=='"')
        notInsideComma=!notInsideComma;
    }
    words.add(s.substring(start));
    return words;
}   

}

}

On my own computer this produces:

在我自己的电脑上,这会产生:

Took: 6651100
[Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]

Took: 224179
[Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]

#3


6  

If your strings are all well-formed it is possible with the following regular expression:

如果您的字符串都是格式良好的,则可以使用以下正则表达式:

String[] res = str.split(",(?=([^\"]|\"[^\"]*\")*$)");

The expression ensures that a split occurs only at commas which are followed by an even (or zero) number of quotes (and thus not inside such quotes).

该表达式确保仅在逗号处发生拆分,后跟偶数(或零)引号(因此不在此类引号内)。

Nevertheless, it may be easier to use a simple non-regex parser.

然而,使用简单的非正则表达式解析器可能更容易。

#1


147  

public static void main(String[] args) {
    String s = "Sachin,,M,\"Maths,Science,English\",Need to improve in these subjects.";
    String[] splitted = s.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
    System.out.println(Arrays.toString(splitted));
}

Output:

输出:

[Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]

#2


10  

As your problem/requirements are not all that complex a custom method can be utilized that performs over 20 times faster and produces the same results. This is variable based on the data size and number of rows parsed, and for more complicated problems using regular expressions is a must.

由于您的问题/要求并不是那么复杂,因此可以使用自定义方法,其执行速度提高20倍以上并产生相同的结果。这是基于数据大小和解析行数的变量,对于使用正则表达式的更复杂问题是必须的。

import java.util.Arrays;
import java.util.ArrayList;
public class SplitTest {

public static void main(String[] args) {

    String s = "Sachin,,M,\"Maths,Science,English\",Need to improve in these subjects.";
    String[] splitted = null;

 //Measure Regular Expression
    long startTime = System.nanoTime();
    for(int i=0; i<10; i++)
    splitted = s.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
    long endTime =   System.nanoTime();

    System.out.println("Took: " + (endTime-startTime));
    System.out.println(Arrays.toString(splitted));
    System.out.println("");


    ArrayList<String> sw = null;        
 //Measure Custom Method
            startTime = System.nanoTime();
    for(int i=0; i<10; i++)
    sw = customSplitSpecific(s);
    endTime =   System.nanoTime();

    System.out.println("Took: " + (endTime-startTime));
    System.out.println(sw);         
}

public static ArrayList<String> customSplitSpecific(String s)
{
    ArrayList<String> words = new ArrayList<String>();
    boolean notInsideComma = true;
    int start =0, end=0;
    for(int i=0; i<s.length()-1; i++)
    {
        if(s.charAt(i)==',' && notInsideComma)
        {
            words.add(s.substring(start,i));
            start = i+1;                
        }   
        else if(s.charAt(i)=='"')
        notInsideComma=!notInsideComma;
    }
    words.add(s.substring(start));
    return words;
}   

}

}

On my own computer this produces:

在我自己的电脑上,这会产生:

Took: 6651100
[Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]

Took: 224179
[Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]

#3


6  

If your strings are all well-formed it is possible with the following regular expression:

如果您的字符串都是格式良好的,则可以使用以下正则表达式:

String[] res = str.split(",(?=([^\"]|\"[^\"]*\")*$)");

The expression ensures that a split occurs only at commas which are followed by an even (or zero) number of quotes (and thus not inside such quotes).

该表达式确保仅在逗号处发生拆分,后跟偶数(或零)引号(因此不在此类引号内)。

Nevertheless, it may be easier to use a simple non-regex parser.

然而,使用简单的非正则表达式解析器可能更容易。