How to parse Multiline block text if content differs from block to block using Python & regex?

时间:2023-02-08 11:13:32

I have a configuration file that I need to parse, the idea is putting it in a dictionary at a later stage thanks to the groupins in python.

我有一个我需要解析的配置文件,由于python中的groupins,我的想法是在稍后阶段将它放在字典中。

The problem I'm facing is that not all lines in every block of text are exactly the same, my regex worked so far for the block with the most lines, but only matches on that single block of course. How do I multiline match if some "set" lines are ommited in some blocks for instances.

我面临的问题是并非每个文本块中的所有行都是完全相同的,我的正则表达式到目前为止对于具有最多行的块来说是有效的,但当然只匹配该单个块。如果在实例的某些块中省略了某些“set”行,我如何进行多行匹配。

  • Do I need to break up the regex and use if, elsif, true/false statements to work through this ? Does not seem pythonic imho.

    我是否需要打破正则表达式并使用if,elsif,true / false语句来解决这个问题?似乎没有pythonic imho。

  • Im quite sure I'm goint to have to breakup my big regex and work through it sequentially ? if true then... else skip to next regex matching line.

    我很确定我不得不分解我的大正则表达式并按顺序完成它吗?如果是,则...否则跳到下一个正则表达式匹配行。

  • Was thinking of putting every block from edit to next into a list element to be parsed seperately ? Or can I just do the whole thing in one go ?

    是否正在考虑将每个块从编辑到下一个块放入列表元素中进行单独解析?或者我可以一次完成整个事情吗?

I have some idea's but I would like som pythonic way of doing it please.

我有一些想法,但我想som pythonic做的方式请。

As always, your help is much appreciated. Thank you

一如既往,非常感谢您的帮助。谢谢

TEXT, where block to match on is from edit to next. Not every block contains the same "set" statements :

TEXT,匹配的块是从编辑到下一个。并非每个块都包含相同的“set”语句:

edit "port11"
    set vdom "ACME_Prod"
    set vlanforward enable
    set type physical
    set device-identification enable
    set snmp-index 26
next
edit "port21"
    set vdom "ACME_Prod"
    set vlanforward enable
    set type physical
    set snmp-index 27
next
edit "port28"
    set vdom "ACME_Prod"
    set vlanforward enable
    set type physical
    set snmp-index 28
next
edit "port29"
    set vdom "ACME_Prod"
    set ip 174.244.244.244 255.255.255.224
    set allowaccess ping
    set vlanforward enable
    set type physical
    set alias "Internet-IRISnet"
    set snmp-index 29
next
edit "port20"
    set vdom "root"
    set ip 192.168.1.1 255.255.255.0
    set allowaccess ping https ssh snmp fgfm
    set vlanforward enable
    set type physical
    set snmp-index 39
next
edit "port25"
    set vdom "root"
    set allowaccess fgfm
    set vlanforward enable
    set type physical
    set snmp-index 40
next

CODE SNIPPET :

代码链:

import re, pprint
file = "interfaces_2016_10_12.conf"

try:
    """
    fileopen = open(file, 'r')
    output = open('output.txt', 'w+')
except:
    exit("Input file does not exist, exiting script.")

#read whole config in 1 go instead of iterating line by line
text = fileopen.read()   

# my verbose regex, verbose so it is more readable !

pattern = r'''^                 # use r for multiline usage
\s+edit\s\"(.*)\"\n           # group(1) match int name
\s+set\svdom\s\"(.*)\"\n      # group(2) match vdom name
\s+set\sip\s(.*)\n            # group(3) match interface ip
\s+set\sallowaccess\s(.*)\n   # group(4) match allowaccess
\s+set\svlanforward\s(.*)\n   # group(5) match vlanforward
\s+set\stype\s(.*)\n          # group(6) match type
\s+set\salias\s\"(.*)\"\n     # group(7) match alias
\s+set\ssnmp-index\s\d{1,3}\n # match snmp-index but we don't need it
\s+next$'''                   # match end of config block

regexp = re.compile(pattern, re.VERBOSE | re.MULTILINE)

For multiline regex matching use finditer(): 
"""
z = 1
for match in regexp.finditer(text):
    while z < 8:
        print match.group(z)
        z += 1

fileopen.close()  #always close file
output.close() #always close file

2 个解决方案

#1


1  

Why use regex when it seems a pretty simple structure to parse:

为什么在使用正则表达式时,它似乎是一个非常简单的解析结构:

data = {}
with open(file, 'r') as fileopen:
    for line in fileopen:
        words = line.strip().split()
        if words[0] == 'edit':  # Create a new block
            curr = data.setdefault(words[1].strip('"'), {})
        elif words[0] == 'set': # Write config to block
            curr[words[1]] = words[2].strip('"') if len(words) == 3 else words[2:]
print(data)

Output:

输出:

{'port11': {'device-identification': 'enable',
  'snmp-index': '26',
  'type': 'physical',
  'vdom': 'ACME_Prod',
  'vlanforward': 'enable'},
 'port20': {'allowaccess': ['ping', 'https', 'ssh', 'snmp', 'fgfm'],
  'ip': ['192.168.1.1', '255.255.255.0'],
  'snmp-index': '39',
  'type': 'physical',
  'vdom': 'root',
  'vlanforward': 'enable'},
  ...

#2


0  

How about:

怎么样:

config = {}
for block in re.split('\nnext\n',open('datafile'):
     for cmd in block.split("\n"):
         cmd = cmd.strip().split()
         if cmd[0] == 'edit': 
             current = cmd[1]
             config[current] = {}
             continue
         config[current][cmd[1]] = cmd[2]

I think that's readable to, but the other answer is preferable as I think (no regex). Upvoted it.

我认为这是可读的,但另一个答案是可取的,因为我认为(没有正则表达式)。赞成它。

#1


1  

Why use regex when it seems a pretty simple structure to parse:

为什么在使用正则表达式时,它似乎是一个非常简单的解析结构:

data = {}
with open(file, 'r') as fileopen:
    for line in fileopen:
        words = line.strip().split()
        if words[0] == 'edit':  # Create a new block
            curr = data.setdefault(words[1].strip('"'), {})
        elif words[0] == 'set': # Write config to block
            curr[words[1]] = words[2].strip('"') if len(words) == 3 else words[2:]
print(data)

Output:

输出:

{'port11': {'device-identification': 'enable',
  'snmp-index': '26',
  'type': 'physical',
  'vdom': 'ACME_Prod',
  'vlanforward': 'enable'},
 'port20': {'allowaccess': ['ping', 'https', 'ssh', 'snmp', 'fgfm'],
  'ip': ['192.168.1.1', '255.255.255.0'],
  'snmp-index': '39',
  'type': 'physical',
  'vdom': 'root',
  'vlanforward': 'enable'},
  ...

#2


0  

How about:

怎么样:

config = {}
for block in re.split('\nnext\n',open('datafile'):
     for cmd in block.split("\n"):
         cmd = cmd.strip().split()
         if cmd[0] == 'edit': 
             current = cmd[1]
             config[current] = {}
             continue
         config[current][cmd[1]] = cmd[2]

I think that's readable to, but the other answer is preferable as I think (no regex). Upvoted it.

我认为这是可读的,但另一个答案是可取的,因为我认为(没有正则表达式)。赞成它。