将伪HTML / XML日志文件解析为数据框架（Symantec Altiris）[R]

I've been asked to help parse some log files for a Symantec application (Altiris) and they were delivered to me in a pseudo-HTML/XML format. I've managed to use readLines() and grepl() to get the logs into a decent character vector format and clean out the junk, but can't get it into a data-frame.

我被要求帮助解析赛门铁克应用程序（Altiris）的一些日志文件，并以伪HTML / XML格式发送给我。我已经设法使用readLines（）和grepl（）来将日志转换为一个体面的字符向量格式并清除垃圾，但无法将其放入数据框中。

As of right now, an entry looks something like this (since I can't post real data), all in a character vector with structure chr[1:312]:

截至目前，条目看起来像这样（因为我无法发布实际数据），所有这些都在结构为chr [1：312]的字符向量中：

[310] "<severity='4', hostname='computername125', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='234' >"

I've had no luck with XML parsing and it does look more like HTML to me, and when I tried htmlTreeParse(x) I just ended up with a massive pyramid of tags.

我对XML解析没有运气，它对我来说看起来更像HTML，当我尝试htmlTreeParse（x）时，我最终得到了一个巨大的标签金字塔。

1 个解决方案

#1

If you're working with pseudo-XML, it's probably best to define the parsing rules yourself. I like stringr and dplyr for stuff like this.

如果您正在使用伪XML，最好自己定义解析规则。我喜欢stringr和dplyr这样的东西。

Here's a two-element vector (instead of 312 in your case):

这是一个双元素向量（在你的例子中不是312）：

vec <- c(
  "<severity='4', hostname='computername125', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='234' >",
  "<severity='5', hostname='computername126', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='235' >"
)

Convert it to a data.frame object:

将其转换为data.frame对象：

df <- data.frame(vec, stringsAsFactors = FALSE)

And select out your data based on their character index positions, relative to the positions of your variables of interest:

并根据您的角色指数位置选择您的数据，相对于您感兴趣的变量的位置：

require(stringr)
require(dplyr)

df %>%
  mutate(
    severityStr = str_locate(vec, "severity")[, "start"],
    hostnameStr = str_locate(vec, "hostname")[, "start"],
    sourceStr = str_locate(vec, "source")[, "start"],
    moduleStr = str_locate(vec, "module")[, "start"],
    processStr = str_locate(vec, "process")[, "start"],
    pidStr = str_locate(vec, "pid")[, "start"],
    endStr = str_locate(vec, ">")[, "start"],
    severity = substr(vec, severityStr + 10, hostnameStr - 4),
    hostname = substr(vec, hostnameStr + 10, sourceStr - 4),
    source = substr(vec, sourceStr + 8, moduleStr - 4),
    module = substr(vec, moduleStr + 8, processStr - 4),
    process = substr(vec, processStr + 9, pidStr - 4),
    pid = substr(vec, pidStr + 5, endStr - 3)) %>%
  select(severity, hostname, source, module, process, pid)

Here's the resulting data frame:

这是结果数据框：

  severity        hostname          source       module     process pid
1        4 computername125 PackageDownload herpderp.dll masterP.exe 234
2        5 computername126 PackageDownload herpderp.dll masterP.exe 235

This solution is robust enough to handle string inputs of different lengths. For example, it would read pid in correctly even if it's 95 (two digits instead of three).

该解决方案足够强大，可以处理不同长度的字符串输入。例如，即使它是95（两位数而不是三位数），它也会正确读取pid。

#1