使用regex将字符串分割为数组以获取键值对

时间:2021-10-12 21:19:14

I am parsing a text but I can't obtain a piece when a space is missing (which is OK)
Edit: I have added colons to the free text.
Edit: well, this is an arbitrary text format in which key-value pairs can be written. discarding element[0], the rest of the elements on the array result in a sequence of key value. And it accepts multiline values.

我正在解析一个文本,但是当一个空格丢失(这是可以的)的时候,我不能获得一个文本:我已经给*文本添加了冒号。编辑:这是一种可以写入键值对的任意文本格式。丢弃元素[0],数组中其余的元素将产生一个键值序列。它接受多行值。

This is the test case text:

这是测试用例文本:

:part1  only one \s removed:OK
:part2 :text :with
new lines
on it
:noSpaceAfterThis
:thisShoudBeAStandAlongText but: here there are more text
:part4 :even more text

This is what I want:

这就是我想要的:

Array
(
    [0] => 
    [1] => part1
    [2] =>  only one \s removed:OK
    [3] => part2
    [4] => :text :with
new lines
on it
    [5] => noSpaceAfterThis
    [6] => 
    [7] => thisShoudBeAStandAlongText
    [8] => but: here there are more text
    [9] => part4
    [10] => :even more text
)

This is what I get:

这就是我得到的:

Array
(
    [0] => 
    [1] => part1
    [2] =>  only one \s removed:OK
    [3] => part2
    [4] => :text :with
new lines
on it
    [5] => noSpaceAfterThis
    [6] => :thisShoudBeAStandAlongText but: here there are more text
    [7] => part4
    [8] => :even more text
)

And this is my testing code:

这是我的测试代码:

<?php
$text = '
:part1  only one \s removed:OK
:part2 :text :with
new lines
on it
:noSpaceAfterThis
:thisShoudBeAStandAlongText but: here there are more text
:part4 :even more text';

echo '<pre>';
// my effort so far:
$ret = preg_split('|\r?\n:([\w\d]+)(?:\r?\s)?|i', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
print_r($ret);

// nor this one:
$ret = preg_split('|\r?\n:([\w\d]+)\r?\s?|i', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
print_r($ret);

// for debuging, an extra capturing group
$ret = preg_split('|\r?\n:([\w\d]+)(\r?\s)?|i', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
var_dump($ret);

1 个解决方案

#1


3  

An other approach with preg_match_all:

preg_match_all的另一种方法:

$pattern = '~(?<=^:|\n:)\S++|(?<=\s)(?:[^:]+?|(?<!\n):)+?(?= *+(?>\n:|$))~';
preg_match_all($pattern, $text, $matches);
echo '<pre>' . print_r($matches[0], true);

Pattern details:

模式的细节:

# capture all the first word at line begining preceded by a colon #
(?<=^:|\n:)       # lookbehind, preceded by the begining of the string
                  # and a colon or a newline and a colon
\S++              # all that is not a space

# capture all the content until the next line with : at first position #
(?<=\s)           # lookbehind, preceded by a space
(?:               # open a non capturing group
   [^:]+?         # all character that is not a colon, one or more times (lazy)
  |               # OR
   (?<!^|\n):     # negative lookbehind, a colon not preceded by a newline
                  # or the begining of the string
)+?               # close the non capturing group, 
                  #repeat one or more times (lazy)
(?= *+(?>\n:|$))  # lookahead, followed by spaces (zero or more) and a newline 
                  # with colon at first position or the end of the string

The advantage here is to avoid the void results.

这里的优点是避免产生无效结果。

or with preg_split:

或与preg_split:

$res = preg_split('~(?:\s*\n|^):(\S++)(?: )?~', $text, -1, PREG_SPLIT_DELIM_CAPTURE);

Explanations:

解释:

The goal is to split the text in two situations:

目标是将文本分成两种情况:

  • on newlines when the first character is :
  • 第一个字符为:
  • at the first space of the line when the line begin by :
  • 在一行的第一个空格处:

Thus two points of splitting are arounds this :word at the begining of a line. The : and the space after must be removed, but the word must be preserved. This is the reason why i use PREG_SPLIT_DELIM_CAPTURE to keep the word.

因此,分裂的两个点围绕着这一点:线的开头的字。后面的空格必须去掉,但是单词必须保留。这就是我使用PREG_SPLIT_DELIM_CAPTURE保留单词的原因。

pattern details:

模式的细节:

(?:           # non capturing group (all inside will be removed)
   \s*\n      # trim the spaces of the precedent line and the newline
  |           # OR
   ^          # it is the begining of the string
)             # end of the non capturing group
:             # remove the first character when it is a :
(\S++)        # keep the first word with DELIM_CAPTURE
(?: )?        # remove the first space if present

#1


3  

An other approach with preg_match_all:

preg_match_all的另一种方法:

$pattern = '~(?<=^:|\n:)\S++|(?<=\s)(?:[^:]+?|(?<!\n):)+?(?= *+(?>\n:|$))~';
preg_match_all($pattern, $text, $matches);
echo '<pre>' . print_r($matches[0], true);

Pattern details:

模式的细节:

# capture all the first word at line begining preceded by a colon #
(?<=^:|\n:)       # lookbehind, preceded by the begining of the string
                  # and a colon or a newline and a colon
\S++              # all that is not a space

# capture all the content until the next line with : at first position #
(?<=\s)           # lookbehind, preceded by a space
(?:               # open a non capturing group
   [^:]+?         # all character that is not a colon, one or more times (lazy)
  |               # OR
   (?<!^|\n):     # negative lookbehind, a colon not preceded by a newline
                  # or the begining of the string
)+?               # close the non capturing group, 
                  #repeat one or more times (lazy)
(?= *+(?>\n:|$))  # lookahead, followed by spaces (zero or more) and a newline 
                  # with colon at first position or the end of the string

The advantage here is to avoid the void results.

这里的优点是避免产生无效结果。

or with preg_split:

或与preg_split:

$res = preg_split('~(?:\s*\n|^):(\S++)(?: )?~', $text, -1, PREG_SPLIT_DELIM_CAPTURE);

Explanations:

解释:

The goal is to split the text in two situations:

目标是将文本分成两种情况:

  • on newlines when the first character is :
  • 第一个字符为:
  • at the first space of the line when the line begin by :
  • 在一行的第一个空格处:

Thus two points of splitting are arounds this :word at the begining of a line. The : and the space after must be removed, but the word must be preserved. This is the reason why i use PREG_SPLIT_DELIM_CAPTURE to keep the word.

因此,分裂的两个点围绕着这一点:线的开头的字。后面的空格必须去掉,但是单词必须保留。这就是我使用PREG_SPLIT_DELIM_CAPTURE保留单词的原因。

pattern details:

模式的细节:

(?:           # non capturing group (all inside will be removed)
   \s*\n      # trim the spaces of the precedent line and the newline
  |           # OR
   ^          # it is the begining of the string
)             # end of the non capturing group
:             # remove the first character when it is a :
(\S++)        # keep the first word with DELIM_CAPTURE
(?: )?        # remove the first space if present