在两个方向有条件地浏览文件

时间:2023-01-11 08:54:08

I have a log file written to by several instances of a cgi script. I need to extract certain information, with the following typical workflow:

我有几个cgi脚本实例写的日志文件。我需要提取某些信息,具有以下典型工作流程:

  1. search for the first occurrence of RequestString
  2. 搜索第一次出现的RequestString
  3. extract PID from that log line
  4. 从该日志行中提取PID
  5. search backwards for the first occurrence of PID<separator>ConnectionString, to identify the client that initiated the request
  6. 向后搜索第一次出现的PID ConnectionString,以识别发起请求的客户端
  7. do something with ConnectionString and repeat the search from after 'RequestString'
  8. 使用ConnectionString执行某些操作并在“RequestString”之后重复搜索

What is the best way to do this? I was thinking of writing a perl script to do this with caching the last N lines, and then match through those lines to perform 3.

做这个的最好方式是什么?我正在考虑编写一个perl脚本来缓存最后N行,然后通过这些行匹配执行3。

Is there any better way to do this? Like extended regex that would do exactly this?

有没有更好的方法来做到这一点?像扩展的正则表达式那样可以做到这一点?

Sample with line numbers for reference -- not part of the file:

带行号的样本供参考 - 不属于文件的一部分:

1 date    pid1    ConnectionString1
2 date    pid2    ConnectionString2
3 date    pid3    ConnectionString3
4 date    pid2    SomeOutput2
5 date    pid2    SomeOutput2
6 date    pid4    ConnectionString4
7 date    pid3    SomeOutput3
8 date    pid4    RequestString4
9 date    pid1    SomeOutput1
10 date    pid1    ConnectionString1
11 date    pid1    RequestString1
12 date    pid5    RequestString5

When I grep through this sample file, I wish for the following to match:

当我浏览此示例文件时,我希望以下内容匹配:

  • line 8, paired with line 6
  • 第8行,与第6行配对
  • line 11, paired with line 10 (and not with line 1)
  • 第11行,与第10行配对(而不是第1行)

Specifically, the following shouldn't be matched:

具体而言,不应匹配以下内容:

  • line 12, because no matching ConnectionString with that pid is found (pid5)
  • 第12行,因为找不到与该pid匹配的ConnectionString(pid5)
  • line 1, because there is a new ConnectionString for that pid before the next RequestString for that pid (line 10). Imagine that the first connection attempt failed before logging the RequestString)
  • 第1行,因为在该pid的下一个RequestString之前,该pid有一个新的ConnectionString(第10行)。想象一下,在记录RequestString之前,第一次连接尝试失败了)
  • any of the lines from pid2/pid3, because hey dont have a RequestString logged.
  • 来自pid2 / pid3的任何行,因为他们没有记录RequestString。

I could imagine writing a regex with the option for . to match \n:((pid\d)\s*(ConnectionString\d))(?!\1).*\2\s*RequestString\d and then use \3 to identify the client.

我可以想象用一个选项写一个正则表达式。匹配\ n :((pid \ d)\ s *(ConnectionString \ d))(?!\ 1)。* \ 2 \ s * RequestString \ d然后使用\ 3来标识客户端。

However, there are disproportionately more (perhaps between 1000 and 10000 times more) ConnectionStrings than RequestStrings, so my intuition was to first go for the RequestString and then backtrack.

然而,ConnectionStrings比RequestStrings更多(可能在1000到10000倍之间),所以我的直觉是首先去RequestString,然后回溯。

I guess I could play with (?<) for lookbehind, but the lengths between ConnectionStrings and RequestStrings are essentially arbitrary -- will that work well?

我想我可以使用(?<)for lookbehind,但ConnectionStrings和RequestStrings之间的长度基本上是任意的 - 这样会有效吗?

1 个解决方案

#1


1  

Something along these lines:

这些方面的东西:

#!/bin/bash
# Find and number all RequestStrings, then loop through them
grep -n RequestString file | while IFS=":" read n string; do
   echo $n,$string    # Debug
   head -n $n file | tail -r | grep -m1 Connection
done

Output

产量

4,RequestString 1
6189:Connection
7,RequestString 2
7230:Connection
9,RequestString 3
8280:Connection

with this input file

使用此输入文件

6189:Connection

RequestString 1
7229:Connection
7230:Connection
RequestString 2
8280:Connection
RequestString 3

Note: I used tail -r because OSX lacks tac which I would have preferred.

注意:我使用tail -r因为OSX缺少tac而我更喜欢。

#1


1  

Something along these lines:

这些方面的东西:

#!/bin/bash
# Find and number all RequestStrings, then loop through them
grep -n RequestString file | while IFS=":" read n string; do
   echo $n,$string    # Debug
   head -n $n file | tail -r | grep -m1 Connection
done

Output

产量

4,RequestString 1
6189:Connection
7,RequestString 2
7230:Connection
9,RequestString 3
8280:Connection

with this input file

使用此输入文件

6189:Connection

RequestString 1
7229:Connection
7230:Connection
RequestString 2
8280:Connection
RequestString 3

Note: I used tail -r because OSX lacks tac which I would have preferred.

注意:我使用tail -r因为OSX缺少tac而我更喜欢。