使用shell脚本从特定的日志文件创建一个CSV文件。

时间:2022-10-29 15:29:18

I am trying to convert a specific log file into CSV file using sed, awk, paste commands in Linux to be able to plot it using gnuplot or MS Excel. However, I am not able to do it in the way I want. Here is the sample log file:

我正在尝试使用sed、awk将一个特定的日志文件转换为CSV文件,并在Linux中粘贴命令,以便能够使用gnuplot或MS Excel绘制它。然而,我不能按照我想要的方式去做。以下是日志文件示例:

Feb 15 13:57:08 Program1: The pool size: 100 [High: 80 Norm: 20 Low: 0]
Feb 15 13:58:53 Program1: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 13:58:54 Program3: The pool size: 200 [High: 0 Norm: 200 Low: 0]
Feb 15 13:58:56 Program4: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 13:58:58 Program1: The pool size: 200 [High: 0 Norm: 200 Low: 0]
Feb 15 13:58:59 Program5: The pool size: 300 [High: 100 Norm: 200 Low: 0]
Feb 15 13:59:05 Program1: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 14:00:11 Program2: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 14:00:12 Program2: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 14:00:13 Program1: The pool size: 200 [High: 0 Norm: 200 Low: 0]
Feb 15 14:00:16 Program4: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 14:00:17 Program2: The pool size: 100 [High: 50 Norm: 50 Low: 0]
Feb 15 14:02:28 Program5: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 14:02:31 Program1: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 14:11:01 Program1: The pool size: 100 [High: 0 Norm: 100 Low: 0]

I am trying to convert the above data into a CSV file such that I would have the data at specific point of time. The output CSV I expect should be in the following format:

我正在尝试将上面的数据转换为CSV文件,以便在特定的时间点获得数据。我期望的输出CSV格式应该是:

TimeStamp,Program1_Total,Program1_High,Program1_Norm,Program1_Low,Program2_Total,Program2_High,Program2_Norm,Program2_Low,Program3_Total,Program3_High,Program3_Norm,Program3_Low,Program4_Total,Program4_High,Program4_Norm,Program4_Low

时间戳、Program1_Total Program1_High、Program1_Norm Program1_Low,Program2_Total,Program2_High,Program2_Norm,Program2_Low,Program3_Total,Program3_High,Program3_Norm,Program3_Low,Program4_Total,Program4_High,Program4_Norm Program4_Low

Feb 15 13:57:08,100,80,20,0,0,0,0,0,0,0,0,0,0,0,0,0
Feb 15 13:58:53,100,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0
...
...

What did I try?

我试着什么?

I tried grepping for specific program and create separate smaller files specific to that program in the following way:

我尝试了为特定的程序添加grepping,并按以下方法创建针对该程序的单独的更小的文件:

grep "Program1" sample.log > Program1.log
grep "Program2" sample.log > Program2.log

I tried using paste command to join them. However, what I am not able to figure out is how to handle these timestamps in a better way.

我尝试使用粘贴命令来连接它们。然而,我无法弄明白如何更好地处理这些时间戳。

Any help will be highly appreciated. Thanks in advance.

非常感谢您的帮助。提前谢谢。

3 个解决方案

#1


1  

I think i found a 1 liner solution for your task which only uses the shell and awk, but be advised, it's not pretty at all and you need to add the header to your output file beforehand:

我认为我为你的任务找到了一个只使用shell和awk的线性解决方案,但是我要提醒你,它一点都不好看,你需要在你的输出文件中添加头:

echo "TimeStamp,P1_Total,P1_High,P1_Norm,P1_Low,P2_Total,P2_High,P2_Norm,P2_Low,P3_Total,P3_High,P3_Norm,P3_Low,P4_Total,P4_High,P4_Norm,P4_Low,P5_Total,P5_High,P5_Norm,P5_Low" >> final_output.txt

for i in `seq 1 5` 
do 
l=$((i-1))
r=$((5-i))
awk -v left_padd=${l} -v right_padd=${r} -v nb=${i} '{gsub(/]/, "", $14)} {if ($4 ~ "Program" nb) {printf $1" "$2" "$3", "; for(a=0;a<left_padd;a++) printf "0,\t 0,\t 0,\t 0,\t "; printf $8",\t "$10",\t "$12",\t "$14",\t "; for(b=0;b<right_padd;b++) printf "0,\t 0,\t 0,\t 0,\t "; print "\n"} }' sample.log
done >> final_output.txt

*** Please, note you must change the 5 in seq 1 5 to the number of Program# entries you wish to have in your output file, I used 5 as that was in your example. Also, you need to change the 5 in r=$((5-i)) to the same value as well.

***请注意,您必须将seq 1 5中的5更改为您希望在输出文件中包含的程序#条目的数量,我在示例中使用了5。此外,还需要将r=$(5-i)中的5也更改为相同的值。

Explanation:

解释:

  • The for loop passes the file every time to search for a Program# entry with awk.
  • for循环每次都传递文件以搜索带有awk的程序#条目。
  • The l variable counts how many 0 values it should add at the left of your table.
  • l变量计算它应该在表的左边添加多少个0值。
  • The r variable does the same as the l value only it adds 0 values to the right.
  • r变量的作用与l值相同只是向右增加了0个值。
  • The nb variable stores the Program # so the awk part knows which lines it should look for in the input file.
  • nb变量存储程序#,以便awk部分知道应该在输入文件中查找哪些行。
  • The awk merely prints out the values you asked for in the input file for each Program# entry as well as the preceding and trailing 0 values(4 0s for each Program#) for the other entries in the table.
  • awk仅仅打印出您在输入文件中要求的每个程序#条目的值,以及表中其他条目的前面和后面的0值(每个程序#的4个0)。

Edit:

编辑:

I used \t to delimit the values in awk so it's easier to read, but you may remove that so you only have comma separated values. I also changed the header convention from your answer from Program#_Total to P#_Total for the same reason.

我使用了\t来分隔awk中的值,这样更容易阅读,但是您可以删除它们,这样您就只有逗号分隔的值了。出于同样的原因,我还将标题约定从程序#_Total更改为p# _Total。

*I do realize this is not optimal at all, as the file gets parsed multiple time for each Program# entry, and you also need to add the header yourself in the output file, yet it's the best I could come up with.

*我确实意识到这不是最优的,因为文件会为每个程序#条目解析多个时间,而且您还需要在输出文件中添加标题,但这是我能想到的最好的。

#2


1  

Use cut by using space as divider, then preserve only the fields you need. Once done, use sed to replace spaces with commas.

用空格作为分隔符,然后只保留你需要的字段。一旦完成,使用sed将空格替换为逗号。

cut -d ' ' -f 1,2,3,8,10,12,14 && sed 's/ /,/g'

By using into a while .. read loop you can iterate it in each line.

用一段时间。读循环可以在每行中迭代它。

#3


1  

If Perl is in the options, how about:

如果在选项中有Perl,那么如何:

#!/bin/bash

perl -e '
while (<>) {
    if (/^(.{15}) Program(\d+): The pool size: (\d+) \[High: (\d+) Norm: (\d+) Low: (\d+)\]$/) {
        $timestamp = $1;
        $program = $2;
        $size = $3;
        $high = $4;
        $norm = $5;
        $low = $6;
        if (! defined $array{$timestamp}) {
            # it takes care of duplicate timestamps
            push(@timestamps, $timestamp);
        }
        $i = ($program - 1) * 4;
        @{$array{$timestamp}}[$i .. $i + 3] = ($size, $high, $norm, $low);
    }
}
foreach (@timestamps) {
    print "$_,", join(",", map {$_ + 0} @{$array{$_}}[0 .. 15]), "\n";
}' logfile

BTW it looks like Program5 is excluded in your desired result. If you want to include it, just modify the number 15 in the 2nd last line into 19.

看起来程序5被排除在你想要的结果中。如果你想要包含它,只需将第二行中的数字15修改为19。

#1


1  

I think i found a 1 liner solution for your task which only uses the shell and awk, but be advised, it's not pretty at all and you need to add the header to your output file beforehand:

我认为我为你的任务找到了一个只使用shell和awk的线性解决方案,但是我要提醒你,它一点都不好看,你需要在你的输出文件中添加头:

echo "TimeStamp,P1_Total,P1_High,P1_Norm,P1_Low,P2_Total,P2_High,P2_Norm,P2_Low,P3_Total,P3_High,P3_Norm,P3_Low,P4_Total,P4_High,P4_Norm,P4_Low,P5_Total,P5_High,P5_Norm,P5_Low" >> final_output.txt

for i in `seq 1 5` 
do 
l=$((i-1))
r=$((5-i))
awk -v left_padd=${l} -v right_padd=${r} -v nb=${i} '{gsub(/]/, "", $14)} {if ($4 ~ "Program" nb) {printf $1" "$2" "$3", "; for(a=0;a<left_padd;a++) printf "0,\t 0,\t 0,\t 0,\t "; printf $8",\t "$10",\t "$12",\t "$14",\t "; for(b=0;b<right_padd;b++) printf "0,\t 0,\t 0,\t 0,\t "; print "\n"} }' sample.log
done >> final_output.txt

*** Please, note you must change the 5 in seq 1 5 to the number of Program# entries you wish to have in your output file, I used 5 as that was in your example. Also, you need to change the 5 in r=$((5-i)) to the same value as well.

***请注意,您必须将seq 1 5中的5更改为您希望在输出文件中包含的程序#条目的数量,我在示例中使用了5。此外,还需要将r=$(5-i)中的5也更改为相同的值。

Explanation:

解释:

  • The for loop passes the file every time to search for a Program# entry with awk.
  • for循环每次都传递文件以搜索带有awk的程序#条目。
  • The l variable counts how many 0 values it should add at the left of your table.
  • l变量计算它应该在表的左边添加多少个0值。
  • The r variable does the same as the l value only it adds 0 values to the right.
  • r变量的作用与l值相同只是向右增加了0个值。
  • The nb variable stores the Program # so the awk part knows which lines it should look for in the input file.
  • nb变量存储程序#,以便awk部分知道应该在输入文件中查找哪些行。
  • The awk merely prints out the values you asked for in the input file for each Program# entry as well as the preceding and trailing 0 values(4 0s for each Program#) for the other entries in the table.
  • awk仅仅打印出您在输入文件中要求的每个程序#条目的值,以及表中其他条目的前面和后面的0值(每个程序#的4个0)。

Edit:

编辑:

I used \t to delimit the values in awk so it's easier to read, but you may remove that so you only have comma separated values. I also changed the header convention from your answer from Program#_Total to P#_Total for the same reason.

我使用了\t来分隔awk中的值,这样更容易阅读,但是您可以删除它们,这样您就只有逗号分隔的值了。出于同样的原因,我还将标题约定从程序#_Total更改为p# _Total。

*I do realize this is not optimal at all, as the file gets parsed multiple time for each Program# entry, and you also need to add the header yourself in the output file, yet it's the best I could come up with.

*我确实意识到这不是最优的,因为文件会为每个程序#条目解析多个时间,而且您还需要在输出文件中添加标题,但这是我能想到的最好的。

#2


1  

Use cut by using space as divider, then preserve only the fields you need. Once done, use sed to replace spaces with commas.

用空格作为分隔符,然后只保留你需要的字段。一旦完成,使用sed将空格替换为逗号。

cut -d ' ' -f 1,2,3,8,10,12,14 && sed 's/ /,/g'

By using into a while .. read loop you can iterate it in each line.

用一段时间。读循环可以在每行中迭代它。

#3


1  

If Perl is in the options, how about:

如果在选项中有Perl,那么如何:

#!/bin/bash

perl -e '
while (<>) {
    if (/^(.{15}) Program(\d+): The pool size: (\d+) \[High: (\d+) Norm: (\d+) Low: (\d+)\]$/) {
        $timestamp = $1;
        $program = $2;
        $size = $3;
        $high = $4;
        $norm = $5;
        $low = $6;
        if (! defined $array{$timestamp}) {
            # it takes care of duplicate timestamps
            push(@timestamps, $timestamp);
        }
        $i = ($program - 1) * 4;
        @{$array{$timestamp}}[$i .. $i + 3] = ($size, $high, $norm, $low);
    }
}
foreach (@timestamps) {
    print "$_,", join(",", map {$_ + 0} @{$array{$_}}[0 .. 15]), "\n";
}' logfile

BTW it looks like Program5 is excluded in your desired result. If you want to include it, just modify the number 15 in the 2nd last line into 19.

看起来程序5被排除在你想要的结果中。如果你想要包含它,只需将第二行中的数字15修改为19。