在Java 8流中具有多个Regex以便从行中读取文本

时间:2023-01-27 23:49:20

I want to have more than one regex as below, how can I add that to flatmap iterator to put all matching values of the line to List during a single stream read?

我想要有一个以上的regex,我如何将它添加到flatmap迭代器中,以便在单个流读取时将行的所有匹配值放入列表中?

static String reTimeStamp="((?:2|1)\\d{3}(?:-|\\/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|\\/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|\\s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";
static String reHostName="host=(\\\")((?:[a-z][a-z\\.\\d\\-]+)\\.(?:[a-z][a-z\\-]+))(?![\\w\\.])(\\\")";
static String reServiceTime="service=(\\d+)ms";

private static final PatternStreamer quoteRegex1 = new PatternStreamer(reTimeStamp);
private static final PatternStreamer quoteRegex2 = new PatternStreamer(reHostName);
private static final PatternStreamer quoteRegex3 = new PatternStreamer(reServiceTime);


public static void main(String[] args) throws Exception {
    String inFileName = "Sample.log";
    String outFileName = "Sample_output.log";
    try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
        //stream.forEach(System.out::println);
        List<String> timeStamp = stream.flatMap(quoteRegex1::results)
                                    .map(r -> r.group(1))
                                    .collect(Collectors.toList());

        timeStamp.forEach(System.out::println);
        //Files.write(Paths.get(outFileName), dataSet);
    }
}

This question is a extension from Match a pattern and write the stream to a file using Java 8 Stream

这个问题是从匹配模式到使用Java 8流将流写入文件的扩展

1 个解决方案

#1


3  

You can simply concatenate the streams:

您可以简单地连接这些流:

String inFileName = "Sample.log";
String outFileName = "Sample_output.log";
try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
    List<String> timeStamp = stream
        .flatMap(s -> Stream.concat(quoteRegex1.results(s),
                        Stream.concat(quoteRegex2.results(s), quoteRegex3.results(s))))
        .map(r -> r.group(1))
        .collect(Collectors.toList());

    timeStamp.forEach(System.out::println);
    //Files.write(Paths.get(outFileName), dataSet);
}

but note that this will perform three individual searches through each line, which might not only imply lower performance, but also that the order of the matches within one line will not reflect their actual occurrence. It doesn’t seem to be an issue with your patterns, but individual searches even imply possible overlapping matches.

但是请注意,这将在每行中执行三个单独的搜索,这可能不仅意味着更低的性能,而且在一行内的匹配顺序不会反映它们实际发生的情况。这似乎不是您的模式的问题,但是单独的搜索甚至意味着可能的重叠匹配。

The PatternStreamer of that linked answer also greedily collects the matches of one string into an ArrayList before creating a stream. A Spliterator based solution like in this answer is preferable.

该链接答案的PatternStreamer也在创建流之前贪婪地将一个字符串的匹配收集到ArrayList中。一个基于Spliterator的解决方案,如在这个答案中是可取的。

Since numerical group references preclude just combining the patterns in a (pattern1|pattern2|pattern3) manner, a true streaming over matches of multiple different patterns will be a bit more elaborated:

由于数值组引用排除了仅以a (pattern1|pattern2|pattern3)的方式组合模式,因此对多个不同模式的匹配进行真正的流化将更加详细:

public final class MultiPatternSpliterator
extends Spliterators.AbstractSpliterator<MatchResult> {
    public static Stream<MatchResult> matches(String input, String... patterns) {
        return matches(input, Arrays.stream(patterns)
                .map(Pattern::compile).toArray(Pattern[]::new));
    }
    public static Stream<MatchResult> matches(String input, Pattern... patterns) {
        return StreamSupport.stream(new MultiPatternSpliterator(patterns,input), false);
    }
    private Pattern[] pattern;
    private String input;
    private int pos;
    private PriorityQueue<Matcher> pendingMatches;

    MultiPatternSpliterator(Pattern[] p, String inputString) {
        super(inputString.length(), ORDERED|NONNULL);
        pattern = p;
        input = inputString;
    }

    @Override
    public boolean tryAdvance(Consumer<? super MatchResult> action) {
        if(pendingMatches == null) {
            pendingMatches = new PriorityQueue<>(
                pattern.length, Comparator.comparingInt(MatchResult::start));
            for(Pattern p: pattern) {
                Matcher m = p.matcher(input);
                if(m.find()) pendingMatches.add(m);
            }
        }
        MatchResult mr = null;
        do {
            Matcher m = pendingMatches.poll();
            if(m == null) return false;
            if(m.start() >= pos) {
                mr = m.toMatchResult();
                pos = mr.end();
            }
            if(m.region(pos, m.regionEnd()).find()) pendingMatches.add(m);
        } while(mr == null);
        action.accept(mr);
        return true;
    }
}

This facility allows to match multiple pattern in a (pattern1|pattern2|pattern3) fashion while still having the original groups of each pattern. So when searching for hell and llo in hello, it will find hell and not llo. A difference is that there is no guaranteed order if more than one pattern matches at the same position.

该工具允许以(pattern1|pattern2|pattern3)的方式匹配多个模式,同时仍然保留每个模式的原始组。所以当你用hello搜索地狱和天堂,它会找到地狱而不是天堂。不同之处在于,如果多个模式在同一位置匹配,则没有保证的顺序。

This can be used like

可以这样使用

Pattern[] p = Stream.of(reTimeStamp, reHostName, reServiceTime)
        .map(Pattern::compile)
        .toArray(Pattern[]::new);
try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
    List<String> timeStamp = stream
        .flatMap(s -> MultiPatternSpliterator.matches(s, p))
        .map(r -> r.group(1))
        .collect(Collectors.toList());

    timeStamp.forEach(System.out::println);
    //Files.write(Paths.get(outFileName), dataSet);
}

While the overloaded method would allow to use MultiPatternSpliterator.matches(s, reTimeStamp, reHostName, reServiceTime) using the pattern strings to create a stream, this should be avoided within a flatMap operation that would recompile every regex for every input line. That’s why the code above compiles all patterns into an array first. This is what your original code also does by instantiating the PatternStreamers outside the stream operation.

而重载方法将允许使用MultiPatternSpliterator。使用模式字符串创建流的匹配(s、reTimeStamp、reHostName、reservation time),应该避免使用flatMap操作,该操作将为每个输入行重新编译每个regex。这就是为什么上面的代码首先将所有模式编译成一个数组。这也是原始代码在流操作之外实例化PatternStreamers所做的。

#1


3  

You can simply concatenate the streams:

您可以简单地连接这些流:

String inFileName = "Sample.log";
String outFileName = "Sample_output.log";
try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
    List<String> timeStamp = stream
        .flatMap(s -> Stream.concat(quoteRegex1.results(s),
                        Stream.concat(quoteRegex2.results(s), quoteRegex3.results(s))))
        .map(r -> r.group(1))
        .collect(Collectors.toList());

    timeStamp.forEach(System.out::println);
    //Files.write(Paths.get(outFileName), dataSet);
}

but note that this will perform three individual searches through each line, which might not only imply lower performance, but also that the order of the matches within one line will not reflect their actual occurrence. It doesn’t seem to be an issue with your patterns, but individual searches even imply possible overlapping matches.

但是请注意,这将在每行中执行三个单独的搜索,这可能不仅意味着更低的性能,而且在一行内的匹配顺序不会反映它们实际发生的情况。这似乎不是您的模式的问题,但是单独的搜索甚至意味着可能的重叠匹配。

The PatternStreamer of that linked answer also greedily collects the matches of one string into an ArrayList before creating a stream. A Spliterator based solution like in this answer is preferable.

该链接答案的PatternStreamer也在创建流之前贪婪地将一个字符串的匹配收集到ArrayList中。一个基于Spliterator的解决方案,如在这个答案中是可取的。

Since numerical group references preclude just combining the patterns in a (pattern1|pattern2|pattern3) manner, a true streaming over matches of multiple different patterns will be a bit more elaborated:

由于数值组引用排除了仅以a (pattern1|pattern2|pattern3)的方式组合模式,因此对多个不同模式的匹配进行真正的流化将更加详细:

public final class MultiPatternSpliterator
extends Spliterators.AbstractSpliterator<MatchResult> {
    public static Stream<MatchResult> matches(String input, String... patterns) {
        return matches(input, Arrays.stream(patterns)
                .map(Pattern::compile).toArray(Pattern[]::new));
    }
    public static Stream<MatchResult> matches(String input, Pattern... patterns) {
        return StreamSupport.stream(new MultiPatternSpliterator(patterns,input), false);
    }
    private Pattern[] pattern;
    private String input;
    private int pos;
    private PriorityQueue<Matcher> pendingMatches;

    MultiPatternSpliterator(Pattern[] p, String inputString) {
        super(inputString.length(), ORDERED|NONNULL);
        pattern = p;
        input = inputString;
    }

    @Override
    public boolean tryAdvance(Consumer<? super MatchResult> action) {
        if(pendingMatches == null) {
            pendingMatches = new PriorityQueue<>(
                pattern.length, Comparator.comparingInt(MatchResult::start));
            for(Pattern p: pattern) {
                Matcher m = p.matcher(input);
                if(m.find()) pendingMatches.add(m);
            }
        }
        MatchResult mr = null;
        do {
            Matcher m = pendingMatches.poll();
            if(m == null) return false;
            if(m.start() >= pos) {
                mr = m.toMatchResult();
                pos = mr.end();
            }
            if(m.region(pos, m.regionEnd()).find()) pendingMatches.add(m);
        } while(mr == null);
        action.accept(mr);
        return true;
    }
}

This facility allows to match multiple pattern in a (pattern1|pattern2|pattern3) fashion while still having the original groups of each pattern. So when searching for hell and llo in hello, it will find hell and not llo. A difference is that there is no guaranteed order if more than one pattern matches at the same position.

该工具允许以(pattern1|pattern2|pattern3)的方式匹配多个模式,同时仍然保留每个模式的原始组。所以当你用hello搜索地狱和天堂,它会找到地狱而不是天堂。不同之处在于,如果多个模式在同一位置匹配,则没有保证的顺序。

This can be used like

可以这样使用

Pattern[] p = Stream.of(reTimeStamp, reHostName, reServiceTime)
        .map(Pattern::compile)
        .toArray(Pattern[]::new);
try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
    List<String> timeStamp = stream
        .flatMap(s -> MultiPatternSpliterator.matches(s, p))
        .map(r -> r.group(1))
        .collect(Collectors.toList());

    timeStamp.forEach(System.out::println);
    //Files.write(Paths.get(outFileName), dataSet);
}

While the overloaded method would allow to use MultiPatternSpliterator.matches(s, reTimeStamp, reHostName, reServiceTime) using the pattern strings to create a stream, this should be avoided within a flatMap operation that would recompile every regex for every input line. That’s why the code above compiles all patterns into an array first. This is what your original code also does by instantiating the PatternStreamers outside the stream operation.

而重载方法将允许使用MultiPatternSpliterator。使用模式字符串创建流的匹配(s、reTimeStamp、reHostName、reservation time),应该避免使用flatMap操作,该操作将为每个输入行重新编译每个regex。这就是为什么上面的代码首先将所有模式编译成一个数组。这也是原始代码在流操作之外实例化PatternStreamers所做的。