如何使用RegEx在JSON文档中提取数据

时间:2021-10-06 15:06:01

I am no RegEx expert. I am trying to understand if can use RegEx to find a block of data from a JSON file.

我不是RegEx专家。我试图了解是否可以使用RegEx从JSON文件中查找数据块。

My Scenario:

I am using an AWS RDS instance with enhanced monitoring. The monitoring data is being sent to a CloudWatch log stream. I am trying to use the data posted in CloudWatch to be visible in log management solution Loggly.

我正在使用具有增强监控功能的AWS RDS实例。监控数据将发送到CloudWatch日志流。我正在尝试使用CloudWatch中发布的数据在日志管理解决方案Loggly中可见。

The ingestion is no problem, I can see the data in Loggly. However, the whole message is contained in one big blob field. The field content is a JSON document. I am trying to figure out if I can use RegEx to extract only certain parts of the JSON document.

摄入没问题,我可以在Loggly中看到数据。但是,整个消息包含在一个大blob字段中。字段内容是JSON文档。我试图找出是否可以使用RegEx仅提取JSON文档的某些部分。

Here is an sample extract from the JSON payload I am using:

以下是我正在使用的JSON有效负载的示例摘录:

{
    "engine": "MySQL",
    "instanceID": "rds-mysql-test",
    "instanceResourceID": "db-XXXXXXXXXXXXXXXXXXXXXXXXX",
    "timestamp": "2017-02-13T09:49:50Z",
    "version": 1,
    "uptime": "0:05:36",
    "numVCPUs": 1,
    "cpuUtilization": {
        "guest": 0,
        "irq": 0.02,
        "system": 1.02,
        "wait": 7.52,
        "idle": 87.04,
        "user": 1.91,
        "total": 12.96,
        "steal": 2.42,
        "nice": 0.07
    },
    "loadAverageMinute": {
        "fifteen": 0.12,
        "five": 0.26,
        "one": 0.27
    },
    "memory": {
        "writeback": 0,
        "hugePagesFree": 0,
        "hugePagesRsvd": 0,
        "hugePagesSurp": 0,
        "cached": 505160,
        "hugePagesSize": 2048,
        "free": 2830972,
        "hugePagesTotal": 0,
        "inactive": 363904,
        "pageTables": 3652,
        "dirty": 64,
        "mapped": 26572,
        "active": 539432,
        "total": 3842628,
        "slab": 34020,
        "buffers": 16512
    },

My Question

My question is, can I use RegEx to extract, say a subset of the document? For example, CPU Utilization, or Memory etc.? If that is possible, how do I write the RegEx? If possible, I can use it to drill down into the extracted document to get individual data elements as well.

我的问题是,我可以使用RegEx来提取文件的子集吗?例如,CPU利用率还是内存等?如果可以,我该如何编写RegEx?如果可能,我可以使用它深入到提取的文档中以获取单个数据元素。

Many thanks for your help.

非常感谢您的帮助。

1 个解决方案

#1


0  

First I agree with Sebastian: A proper JSON parser is better.

首先,我同意塞巴斯蒂安:一个合适的JSON解析器更好。

Anyway sometimes the dirty approach must be used. If your text layout will not change, then a regexp is simple:

无论如何,有时必须使用脏方法。如果您的文本布局不会改变,那么正则表达式很简单:

E.g. "total": (\d+\.\d+) gets the CPU usage and "total": (\d\d\d+) the total memory usage (match at least 3 digits not to match the first total text, memory will probably never be less than 100 :-).

例如。 “total”:( \ d + \。\ d +)获取CPU使用率和“总计”:( \ d \ d \ d +)总内存使用量(匹配至少3位不匹配第一个总文本,内存可能会永远不会少于100 :-)。

If changes are to be expected make it a bit more stable: ["']total["']\s*:\s*(\d+\.\d+).

如果要改变,请使其更稳定:[“'] total [”'] \ s *:\ s *(\ d + \。\。+ d +)。

It may also be possible to match agains return chars like this: "cpuUtilization"\s*:\s*\{\s*\n.*\n\s*"irq"\s*:\s*(\d+\.\d+) making it a bit more stable (this time for irq value).

也可以像这样匹配agains返回字符:“cpuUtilization”\ s *:\ s * \ {\ s * \ n。* \ n \ s *“irq”\ s *:\ s *(\ d + \。\ d +)使它更稳定(这次是irq值)。

And so on and so on.

等等等等。

You see that you can get fast into very complex expressions. That approach is very fragile!

你会发现你可以快速进入非常复杂的表达式。这种方法非常脆弱!

P.S. Depending of the exact details of the regex of loggy, details may change. Above examples are based on Perl.

附:根据loggy正则表达式的确切细节,细节可能会有所变化。以上示例基于Perl。

#1


0  

First I agree with Sebastian: A proper JSON parser is better.

首先,我同意塞巴斯蒂安:一个合适的JSON解析器更好。

Anyway sometimes the dirty approach must be used. If your text layout will not change, then a regexp is simple:

无论如何,有时必须使用脏方法。如果您的文本布局不会改变,那么正则表达式很简单:

E.g. "total": (\d+\.\d+) gets the CPU usage and "total": (\d\d\d+) the total memory usage (match at least 3 digits not to match the first total text, memory will probably never be less than 100 :-).

例如。 “total”:( \ d + \。\ d +)获取CPU使用率和“总计”:( \ d \ d \ d +)总内存使用量(匹配至少3位不匹配第一个总文本,内存可能会永远不会少于100 :-)。

If changes are to be expected make it a bit more stable: ["']total["']\s*:\s*(\d+\.\d+).

如果要改变,请使其更稳定:[“'] total [”'] \ s *:\ s *(\ d + \。\。+ d +)。

It may also be possible to match agains return chars like this: "cpuUtilization"\s*:\s*\{\s*\n.*\n\s*"irq"\s*:\s*(\d+\.\d+) making it a bit more stable (this time for irq value).

也可以像这样匹配agains返回字符:“cpuUtilization”\ s *:\ s * \ {\ s * \ n。* \ n \ s *“irq”\ s *:\ s *(\ d + \。\ d +)使它更稳定(这次是irq值)。

And so on and so on.

等等等等。

You see that you can get fast into very complex expressions. That approach is very fragile!

你会发现你可以快速进入非常复杂的表达式。这种方法非常脆弱!

P.S. Depending of the exact details of the regex of loggy, details may change. Above examples are based on Perl.

附:根据loggy正则表达式的确切细节,细节可能会有所变化。以上示例基于Perl。