使用regex - Java解析表

时间:2021-12-12 01:23:00

I'm parsing the following AWS cost instance table:

我正在解析以下AWS成本实例表:

m1.small    1   1   1.7     1 x 160    $0.044 per Hour
m1.medium   1   2   3.75    1 x 410    $0.087 per Hour
m1.large    2   4   7.5     2 x 420    $0.175 per Hour
m1.xlarge   4   8   15      4 x 420    $0.35 per Hour

There's a file with those costs:

有一个文件有这些费用:

input = new Scanner(file);
String[] values;
while (input.hasNextLine()) {
    String line = input.nextLine();
    values = line.split("\\s+"); // <-- not what I want...
    for (String v : values)
        System.out.println(v);
}

However that gives me:

然而,给我:

m1.small
1
1
1.7
1
x
160
$0.044
per
Hour

which is not what I want ... A corrected parsed values (with the right regex) would look like this:

这不是我想要的……正确的解析值(使用正确的regex)应该如下所示:

['m1.small', '1', '1', '1.7', '1 x 160', '$0.044', 'per Hour']

What would be the right regex in order to obtain the right result? One can assume the table will have always the same pattern.

为了获得正确的结果,什么是正确的regex ?可以假定该表将始终具有相同的模式。

3 个解决方案

#1


4  

Split by one oe more spaces. And the spaces must appear in the context below.

再给我一个空间。空格必须出现在下面的上下文中。

DIGIT - SPACES - NOT "x"

数字-空格-不是"x"

or

NOT "x" - SPACES - DIGIT

不是“x”-空格-数字

    values = line.split("(?<=\\d)\\s+(?=[^x])|(?<=[^x])\\s+(?=\\d)")));

#2


5  

Try this fiddle https://regex101.com/r/sP6zW5/1

试试这个小提琴https://regex101.com/r/sP6zW5/1

([^\s]+)\s+(\d+)\s+(\d+)\s+([\d\.]+)\s+(\d+ x \d+)\s+(\$\d+\.\d+)\s+(per \w+)

(^ \[s]+)(\ d +)\ \ s + s +(\ d +)\ s +([\ d \]+)\ s + x \ d +)(\ d + \ s +(\ \ d + \ \ d +)美元\ s +(每\ w +)

match the text and the group is your list.

匹配文本,组就是你的列表。

I think use split in your case is too complicated. If the text is always the same.Just like a reverse procedure of string formatting.

我认为在你的案例中使用分割太复杂了。如果文本总是相同的。就像字符串格式化的反向过程。

#3


4  

If you want to use a regular expression, you'd do this:

如果你想使用正则表达式,你应该这样做:

        String s = "m1.small    1   1   1.7     1 x 160    $0.044 per Hour";
        String spaces = "\\s+";
        String type = "(.*?)";
        String intNumber = "(\\d+)";
        String doubleNumber = "([0-9.]+)";
        String dollarNumber = "([$0-9.]+)";
        String aXb = "(\\d+ x \\d+)";
        String rest = "(.*)";

        Pattern pattern = Pattern.compile(type + spaces + intNumber + spaces + intNumber + spaces + doubleNumber
                + spaces + aXb + spaces + dollarNumber + spaces + rest);
        Matcher matcher = pattern.matcher(s);
        while (matcher.find()) {
            String[] fields = new String[] { matcher.group(1), matcher.group(2), matcher.group(3), matcher.group(4),
                    matcher.group(5), matcher.group(6), matcher.group(7) };
            System.out.println(Arrays.toString(fields));
        }

Notice how I've broken up the regular expression to be readable. (As one long String, it is hard to read/maintain.) There's another way of doing it though. Since you know which fields are being split, you could just do this simple split and build a new array with the combined values:

注意我是如何将正则表达式分解为可读的。(作为一个长字符串,很难读/维护。)还有另外一种方法。由于您知道哪些字段正在被分割,您可以进行这个简单的分割,并使用合并后的值构建一个新的数组:

        String[] allFields = s.split("\\s+");
        String[] result = new String[] { 
            allFields[0], 
            allFields[1],
            allFields[2],
            allFields[3],
            allFields[4] + " " + allFields[5] + " " + allFields[6],         
            allFields[7], 
            allFields[8] + " " + allFields[9] };
        System.out.println(Arrays.toString(result));

#1


4  

Split by one oe more spaces. And the spaces must appear in the context below.

再给我一个空间。空格必须出现在下面的上下文中。

DIGIT - SPACES - NOT "x"

数字-空格-不是"x"

or

NOT "x" - SPACES - DIGIT

不是“x”-空格-数字

    values = line.split("(?<=\\d)\\s+(?=[^x])|(?<=[^x])\\s+(?=\\d)")));

#2


5  

Try this fiddle https://regex101.com/r/sP6zW5/1

试试这个小提琴https://regex101.com/r/sP6zW5/1

([^\s]+)\s+(\d+)\s+(\d+)\s+([\d\.]+)\s+(\d+ x \d+)\s+(\$\d+\.\d+)\s+(per \w+)

(^ \[s]+)(\ d +)\ \ s + s +(\ d +)\ s +([\ d \]+)\ s + x \ d +)(\ d + \ s +(\ \ d + \ \ d +)美元\ s +(每\ w +)

match the text and the group is your list.

匹配文本,组就是你的列表。

I think use split in your case is too complicated. If the text is always the same.Just like a reverse procedure of string formatting.

我认为在你的案例中使用分割太复杂了。如果文本总是相同的。就像字符串格式化的反向过程。

#3


4  

If you want to use a regular expression, you'd do this:

如果你想使用正则表达式,你应该这样做:

        String s = "m1.small    1   1   1.7     1 x 160    $0.044 per Hour";
        String spaces = "\\s+";
        String type = "(.*?)";
        String intNumber = "(\\d+)";
        String doubleNumber = "([0-9.]+)";
        String dollarNumber = "([$0-9.]+)";
        String aXb = "(\\d+ x \\d+)";
        String rest = "(.*)";

        Pattern pattern = Pattern.compile(type + spaces + intNumber + spaces + intNumber + spaces + doubleNumber
                + spaces + aXb + spaces + dollarNumber + spaces + rest);
        Matcher matcher = pattern.matcher(s);
        while (matcher.find()) {
            String[] fields = new String[] { matcher.group(1), matcher.group(2), matcher.group(3), matcher.group(4),
                    matcher.group(5), matcher.group(6), matcher.group(7) };
            System.out.println(Arrays.toString(fields));
        }

Notice how I've broken up the regular expression to be readable. (As one long String, it is hard to read/maintain.) There's another way of doing it though. Since you know which fields are being split, you could just do this simple split and build a new array with the combined values:

注意我是如何将正则表达式分解为可读的。(作为一个长字符串,很难读/维护。)还有另外一种方法。由于您知道哪些字段正在被分割,您可以进行这个简单的分割,并使用合并后的值构建一个新的数组:

        String[] allFields = s.split("\\s+");
        String[] result = new String[] { 
            allFields[0], 
            allFields[1],
            allFields[2],
            allFields[3],
            allFields[4] + " " + allFields[5] + " " + allFields[6],         
            allFields[7], 
            allFields[8] + " " + allFields[9] };
        System.out.println(Arrays.toString(result));