我该如何解析以下日志？

I need to parse a log in the following format:

我需要以下列格式解析日志:

===== Item 5483/14800  =====
This is the item title
Info: some note
===== Item 5483/14800 (Update 1/3) =====
This is the item title
Info: some other note
===== Item 5483/14800 (Update 2/3) =====
This is the item title
Info: some more notes
===== Item 5483/14800 (Update 3/3) =====
This is the item title
Info: some other note
Test finished. Result Foo. Time 12 secunds.
Stats: CPU 0.5 MEM 5.3
===== Item 5484/14800  =====
This is this items title
Info: some note
Test finished. Result Bar. Time 4 secunds.
Stats: CPU 0.9 MEM 4.7
===== Item 5485/14800  =====
This is the title of this item
Info: some note
Test finished. Result FooBar. Time 7 secunds.
Stats: CPU 2.5 MEM 2.8

I only need to extract each item's title (next line after ===== Item 5484/14800 =====) and the result.
So i need to keep only the line with the item title and the result for that title and discard everything else.
The issue is that sometimes a item has notes (maxim 3) and sometimes the result is displayed without additional notes so this makes it tricky.
Any help would be appreciated. I'm doing the parser in python but don't need the actual code but some pointing in how could i achive this?

我只需要提取每个项目的标题(=====项目5484/14800 =====之后的下一行)和结果。所以我需要只保留项目标题的行和该标题的结果,并丢弃其他所有内容。问题是,有时一个项目有注释(格言3),有时结果显示没有额外的注释,所以这使得它很棘手。任何帮助,将不胜感激。我在python中做解析器,但不需要实际的代码,但有些指向我怎么能得到这个?

LE: The result I'm looking for is to discard everything else and get something like:

LE:我正在寻找的结果就是抛弃其他所有东西,得到类似的东西:

('This is the item title','Foo')
then
('This is this items title','Bar')

9 个解决方案

#1

1) Loop through every line in the log

    a)If line matches appropriate Regex:

      Display/Store Next Line as the item title.
      Look for the next line containing "Result 
      XXXX." and parse out that result for 
      including in the result set.

EDIT: added a bit more now that I see the result you're looking for.

编辑:现在添加了一点,我看到你正在寻找的结果。

#2

I know you didn't ask for real code but this is too great an opportunity for a generator-based text muncher to pass up:

我知道你没有要求真正的代码,但这对于基于生成器的文本管理器来说是太大的机会了:

# data is a multiline string containing your log, but this
# function could be easily rewritten to accept a file handle.
def get_stats(data):

   title = ""
   grab_title = False

   for line in data.split('\n'):
      if line.startswith("====="):
         grab_title = True
      elif grab_title:
         grab_title = False
         title = line
      elif line.startswith("Test finished."):
         start = line.index("Result") + 7
         end   = line.index("Time")   - 2
         yield (title, line[start:end])


for d in get_stats(data):
   print d


# Returns:
# ('This is the item title', 'Foo')
# ('This is this items title', 'Bar')
# ('This is the title of this item', 'FooBar')

Hopefully this is straightforward enough. Do ask if you have questions on how exactly the above works.

希望这很简单。请问您是否对上述工作原理有疑问。

#3

Maybe something like (log.log is your file):

也许像(log.log是你的文件):

def doOutput(s): # process or store data
    print s

s=''
for line in open('log.log').readlines():
    if line.startswith('====='):
        if len(s):
            doOutput(s)
            s=''
    else:
        s+=line
if len(s):
    doOutput(s)

#4

I would recommend starting a loop that looks for the "===" in the line. Let that key you off to the Title which is the next line. Set a flag that looks for the results, and if you don't find the results before you hit the next "===", say no results. Else, log the results with the title. Reset your flag and repeat. You could store the results with the Title in a dictionary as well, just store "No Results" when you don't find the results between the Title and the next "===" line.

我建议启动一个循环,在行中查找“===”。让那个关键你去标题,这是下一行。设置一个查找结果的标志,如果在点击下一个“===”之前没有找到结果,则说没有结果。否则,使用标题记录结果。重置你的旗帜并重复。您可以将结果与标题存储在字典中,只有在标题和下一个“===”行之间找不到结果时才存储“无结果”。

This looks pretty simple to do based on the output.

根据输出,这看起来很简单。

#5

Regular expression with group matching seems to do the job in python:

带有组匹配的正则表达式似乎在python中完成了这项工作:

import re

data = """===== Item 5483/14800  =====
This is the item title
Info: some note
===== Item 5483/14800 (Update 1/3) =====
This is the item title
Info: some other note
===== Item 5483/14800 (Update 2/3) =====
This is the item title
Info: some more notes
===== Item 5483/14800 (Update 3/3) =====
This is the item title
Info: some other note
Test finished. Result Foo. Time 12 secunds.
Stats: CPU 0.5 MEM 5.3
===== Item 5484/14800  =====
This is this items title
Info: some note
Test finished. Result Bar. Time 4 secunds.
Stats: CPU 0.9 MEM 4.7
===== Item 5485/14800  =====
This is the title of this item
Info: some note
Test finished. Result FooBar. Time 7 secunds.
Stats: CPU 2.5 MEM 2.8"""


p =  re.compile("^=====[^=]*=====\n(.*)$\nInfo: .*\n.*Result ([^\.]*)\.",
                re.MULTILINE)
for m in re.finditer(p, data):
     print "title:", m.group(1), "result:", m.group(2)er code here

If You need more info about regular expressions check: python docs.

如果您需要有关正则表达式的更多信息,请检查:python docs。

#6

This is sort of a continuation of maciejka's solution (see the comments there). If the data is in the file daniels.log, then we could go through it item by item with itertools.groupby, and apply a multi-line regexp to each item. This should scale fine.

这是maciejka解决方案的延续(参见那里的评论)。如果数据在daniels.log文件中,那么我们可以使用itertools.groupby逐项查看,并对每个项目应用多行正则表达式。这应该可以扩展。

import itertools, re

p = re.compile("Result ([^.]*)\.", re.MULTILINE)
for sep, item in itertools.groupby(file('daniels.log'),
                                   lambda x: x.startswith('===== Item ')):
    if not sep:
        title = item.next().strip()
        m = p.search(''.join(item))
        if m:
            print (title, m.group(1))

#7

You could try something like this (in c-like pseudocode, since i don't know python):

你可以尝试这样的东西(在类似c的伪代码中,因为我不知道python):

string line=getline();
regex boundary="^==== [^=]+ ====$";
regex info="^Info: (.*)$";
regex test_data="Test ([^.]*)\. Result ([^.]*)\. Time ([^.]*)\.$";
regex stats="Stats: (.*)$";
while(!eof())
{
  // sanity check
  test line against boundary, if they don't match, throw excetion

  string title=getline();

  while(1)
  {  
    // end the loop if we finished the data
    if(eof()) break;

    line=getline();
    test line against boundary, if they match, break
    test line against info, if they match, load the first matched group into "info"
    test line against test_data, if they match, load the first matched group into "test_result", load the 2nd matched group into "result", load the 3rd matched group into "time"
    test line against stats, if they match, load the first matched group into "statistics"
  }

  // at this point you can use the variables set above to do whatever with a line
  // for example, you want to use title and, if set, test_result/result/time.

}

#8

Parsing is not done using regex. If you have a reasonably well structured text (which it looks as you do), you can use faster testing (e.g. line.startswith() or such). A list of dictionaries seems to be a suitable data type for such key-value pairs. Not sure what else to tell you. This seems pretty trivial.

解析不是使用正则表达式完成的。如果你有一个结构合理的文本(它看起来像你那样),你可以使用更快的测试(例如line.startswith()等)。字典列表似乎是这种键值对的合适数据类型。不知道还能告诉你什么。这似乎很微不足道。

OK, so the regexp way proved to be more suitable in this case:

好的,所以在这种情况下,regexp方式更适合:

import re
re.findall("=\n(.*)\n", s)

is faster than list comprehensions

比列表理解更快

[item.split('\n', 1)[0] for item in s.split('=\n')]

Here's what I got:

这是我得到的:

>>> len(s)
337000000
>>> test(get1, s) #list comprehensions
0:00:04.923529
>>> test(get2, s) #re.findall()
0:00:02.737103

Lesson learned.

#9

Here's some not so good looking perl code that does the job. Perhaps you can find it useful in some way. Quick hack, there are other ways of doing it (I feel that this code needs defending).

这里有一些不那么好看的perl代码可以完成这项工作。也许你会发现它在某些方面很有用。快速破解,还有其他方法(我觉得这个代码需要防御)。

#!/usr/bin/perl -w
#
# $Id$
#

use strict;
use warnings;

my @ITEMS;
my $item;
my $state = 0;

open(FD, "< data.txt") or die "Failed to open file.";
while (my $line = <FD>) {
    $line =~ s/(\r|\n)//g;
    if ($line =~ /^===== Item (\d+)\/\d+/) {
        my $item_number = $1;
        if ($item) {
            # Just to make sure we don't have two lines that seems to be a headline in a row.
            # If we have an item but haven't set the title it means that there are two in a row that matches.
            die "Something seems to be wrong, better safe than sorry. Line $. : $line\n" if (not $item->{title});
            # If we have a new item number add previuos item and create a new.
            if ($item_number != $item->{item_number}) {
                push(@ITEMS, $item);
                $item = {};
                $item->{item_number} = $item_number;
            }
        } else {
            # First entry, don't have an item.
            $item = {}; # Create new item.
            $item->{item_number} = $item_number;
        }
        $state = 1;
    } elsif ($state == 1) {
        die "Data must start with a headline." if (not $item);
        # If we already have a title make sure it matches.
        if ($item->{title}) {
            if ($item->{title} ne $line) {
                die "Title doesn't match for item " . $item->{item_number} . ", line $. : $line\n";
            }
        } else {
            $item->{title} = $line;
        }
        $state++;
    } elsif (($state == 2) && ($line =~ /^Info:/)) {
        # Just make sure that for state 2 we have a line that match Info.
        $state++;
    } elsif (($state == 3) && ($line =~ /^Test finished\. Result ([^.]+)\. Time \d+ secunds{0,1}\.$/)) {
        $item->{status} = $1;
        $state++;
    } elsif (($state == 4) && ($line =~ /^Stats:/)) {
        $state++; # After Stats we must have a new item or we should fail.
    } else {
        die "Invalid data, line $.: $line\n";
    }
}
# Need to take care of the last item too.
push(@ITEMS, $item) if ($item);
close FD;

# Loop our items and print the info we stored.
for $item (@ITEMS) {
    print $item->{item_number} . " (" . $item->{status} . ") " . $item->{title} . "\n";
}

#1

1) Loop through every line in the log

    a)If line matches appropriate Regex:

      Display/Store Next Line as the item title.
      Look for the next line containing "Result 
      XXXX." and parse out that result for 
      including in the result set.