Python将文本和组解析成不同的部分

时间:2023-01-11 15:31:49

I've a text like the following

我有如下文字

  1. This is a first question and can go to multiple paragraphs. Multiple lines. etc.
    (1)First Option (2) Second Option (3) Third option (4) Fourth Option (5) None of these

    这是第一个问题,可以分为多个段落。多行。 (1)第一选择(2)第二选择(3)第三选项(4)第四选项(5)这些都不是

  2. 8 × ? = 4888 ÷ 4
    (1) 150.75 (2) 125.75 (3) 125.05 (4) 152.75 (5) None of these

    8×? = 4888÷4(1)150.75(2)125.75(3)125.05(4)152.75(5)这些都不是

  3. (62.5 × 14 × 5) ÷ 25 + 41 =
    (1) 4 (2) 5 (3) 9 (4) 8 (5) 6

    (62.5×14×5)÷25 + 41 =(1)4(2)5(3)9(4)8(5)6

  4. (23 × 23 × 23 × 23 × 23 × 23)×
    (1) 32 (2) 30 (3) 9 (4) 7 (5) 11

    (23×23×23×23×23×23)×(1)32(2)30(3)9(4)7(5)11

I would like to parse this into different parts so that I can iterate in a for loop and get each question and also iterate over each answers. The rule is that every question will start with an integer at the start of line (^) followed by a dot. The answers will be prefixed by integers 1 to 5 surrounded by brackets (1-5).

我想将其解析为不同的部分,以便我可以迭代for循环并获得每个问题并迭代每个答案。规则是每个问题都以行(^)开头的整数开头,后跟一个点。答案将以括号(1-5)括起的整数1到5作为前缀。

I would like the parsed data say for ex something like:

我希望解析后的数据可以代表:

for item in parsed_data:
    print item.text
    for answer in item.answers:
        print answer.text

How to do this using python regex?

如何使用python正则表达式执行此操作?

1 个解决方案

#1


1  

honestly, you can just use re.split() for this:

说实话,你可以使用re.split():

#text is the variable with your text
text = text.strip()
questions = re.split(r'\d+\.',text)
questions = [x.strip() for x in questions if x != '']
final = [re.split(r'\(\d+\)',x) for x in questions]

for part in final:
    question = part[0]
    print question
    for answer in part[1:]:
        print answer

#1


1  

honestly, you can just use re.split() for this:

说实话,你可以使用re.split():

#text is the variable with your text
text = text.strip()
questions = re.split(r'\d+\.',text)
questions = [x.strip() for x in questions if x != '']
final = [re.split(r'\(\d+\)',x) for x in questions]

for part in final:
    question = part[0]
    print question
    for answer in part[1:]:
        print answer