正则表达式问题 - 引用封闭文本块之外的一个或多个空格

时间:2023-02-08 11:13:20

I want to be replace any occurrence of more than one space with a single space, but take no action in text between quotes.

我想用一个空格替换任何多个空格的出现,但在引号之间的文本中不采取任何操作。

Is there any way of doing this with a Java regex? If so, can you please attempt it or give me a hint?

有没有办法用Java正则表达式做到这一点?如果是这样,你可以尝试一下或给我一个暗示吗?

7 个解决方案

#1


4  

Here's another approach, that uses a lookahead to determine that all quotation marks after the current position come in matched pairs.

这是另一种方法,它使用前瞻来确定当前位置之后的所有引号都是匹配对。

text = text.replaceAll("  ++(?=(?:[^\"]*+\"[^\"]*+\")*+[^\"]*+$)", " ");

If needed, the lookahead can be adapted to handle escaped quotation marks inside the quoted sections.

如果需要,可以调整前瞻以处理引用部分内的转义引号。

#2


2  

When trying to match something that can be contained within something else, it can be helpful to construct a regular expression that matches both, like this:

当尝试匹配可以包含在其他内容中的内容时,构造一个匹配两者的正则表达式会很有帮助,如下所示:

("[^"\\]*(?:\\.[^"\\]*)*")|(  +)

This will match a quoted string or two or more spaces. Because the two expressions are combined, it will match a quoted string OR two or more spaces, but not spaces within quotes. Using this expression, you will need to examine each match to determine if it is a quoted string or two or more spaces and act accordingly:

这将匹配带引号的字符串或两个或多个空格。因为两个表达式是组合的,所以它将匹配带引号的字符串或两个或多个空格,但不匹配引号内的空格。使用此表达式,您需要检查每个匹配项以确定它是带引号的字符串还是两个或更多空格并相应地执行操作:

Pattern spaceOrStringRegex = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );

StringBuffer replacementBuffer = new StringBuffer();

Matcher spaceOrStringMatcher = spaceOrStringRegex.matcher( text );

while ( spaceOrStringMatcher.find() ) 
{
    // if the space group is the match
    if ( spaceOrStringMatcher.group( 2 ) != null ) 
    {
        // replace with a single space
        spaceOrStringMatcher.appendReplacement( replacementBuffer, " " );
    }
}

spaceOrStringMatcher.appendTail( replacementBuffer );

#3


0  

text between quotes : Are the quotes within the same line or multiple lines ?

引号之间的文字:引号是在同一行还是多行?

#4


0  

Tokenize it and emit a single space between tokens. A quick google for "java tokenizer that handles quotes" turned up: this link

对其进行标记并在标记之间发出单个空格。一个快速的谷歌“处理报价的java标记器”出现了:这个链接

YMMV

edit: SO didn't like that link. Here's the google search link: google. It was the first result.

编辑:SO不喜欢那个链接。这是谷歌搜索链接:谷歌。这是第一个结果。

#5


0  

Personally, I don't use Java, but this RegExp could do the trick:

就个人而言,我不使用Java,但这个RegExp可以做到这一点:

([^\" ])*(\\\".*?\\\")*

Trying the expression with RegExBuddy, it generates this code, looks fine to me:

用RegExBuddy尝试表达式,它会生成这段代码,对我来说很好看:

try {
    Pattern regex = Pattern.compile("([^\" ])*(\\\".*?\\\")*", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        for (int i = 1; i <= regexMatcher.groupCount(); i++) {
            // matched text: regexMatcher.group(i)
            // match start: regexMatcher.start(i)
            // match end: regexMatcher.end(i)

            // I suppose here you must use something like
            // sstr += regexMatcher.group(i) + " "
        }
    }
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

At least, it seems to work fine in Python:

至少,它似乎在Python中运行良好:

import re

text = """
este  es   un texto de   prueba "para ver  como se comporta  " la funcion   sobre esto
"para ver  como se comporta  " la funcion   sobre esto  "o sobre otro" lo q sea
"""

ret = ""
print text  

reobj = re.compile(r'([^\" ])*(\".*?\")*', re.IGNORECASE)

for match in reobj.finditer(text):
    if match.group() <> "":
        ret = ret + match.group() + "|"

print ret

#6


0  

After you parse out the quoted content, run this on the rest, in bulk or piece by piece as necessary:

在解析出引用的内容后,根据需要批量或逐件地在其余内容上运行此内容:

String text = "ABC   DEF GHI   JKL";
text = text.replaceAll("( )+", " ");
// text: "ABC DEF GHI JKL"

#7


0  

Jeff, you're on the right track, but there are a few errors in your code, to wit: (1) You forgot to escape the quotation marks inside the negated character classes; (2) The parens inside the first capturing group should have been of the non-capturing variety; (3) If the second set of capturing parens doesn't participate in a match, group(2) returns null, and you're not testing for that; and (4) If you test for two or more spaces in the regex instead of one or more, you don't need to check the length of the match later on. Here's the revised code:

杰夫,你是在正确的轨道上,但你的代码中有一些错误,例如:(1)你忘了逃避否定字符类中的引号; (2)第一捕获组内的parens应该是非捕获种类; (3)如果第二组捕获parens没有参与匹配,则group(2)返回null,并且你没有测试它; (4)如果你在正则表达式中测试两个或多个空格而不是一个或多个空格,你不需要稍后检查匹配的长度。这是修改后的代码:

import java.util.regex.*;

public class Test
{
  public static void main(String[] args) throws Exception
  {
    String text = "blah    blah  \"boo   boo boo\"  blah  blah";
    Pattern p = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );
    StringBuffer sb = new StringBuffer();
    Matcher m = p.matcher( text );
    while ( m.find() ) 
    {
      if ( m.group( 2 ) != null ) 
      {
        m.appendReplacement( sb, " " );
      }
    }
    m.appendTail( sb );
    System.out.println( sb.toString() );
  }
}

#1


4  

Here's another approach, that uses a lookahead to determine that all quotation marks after the current position come in matched pairs.

这是另一种方法,它使用前瞻来确定当前位置之后的所有引号都是匹配对。

text = text.replaceAll("  ++(?=(?:[^\"]*+\"[^\"]*+\")*+[^\"]*+$)", " ");

If needed, the lookahead can be adapted to handle escaped quotation marks inside the quoted sections.

如果需要,可以调整前瞻以处理引用部分内的转义引号。

#2


2  

When trying to match something that can be contained within something else, it can be helpful to construct a regular expression that matches both, like this:

当尝试匹配可以包含在其他内容中的内容时,构造一个匹配两者的正则表达式会很有帮助,如下所示:

("[^"\\]*(?:\\.[^"\\]*)*")|(  +)

This will match a quoted string or two or more spaces. Because the two expressions are combined, it will match a quoted string OR two or more spaces, but not spaces within quotes. Using this expression, you will need to examine each match to determine if it is a quoted string or two or more spaces and act accordingly:

这将匹配带引号的字符串或两个或多个空格。因为两个表达式是组合的,所以它将匹配带引号的字符串或两个或多个空格,但不匹配引号内的空格。使用此表达式,您需要检查每个匹配项以确定它是带引号的字符串还是两个或更多空格并相应地执行操作:

Pattern spaceOrStringRegex = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );

StringBuffer replacementBuffer = new StringBuffer();

Matcher spaceOrStringMatcher = spaceOrStringRegex.matcher( text );

while ( spaceOrStringMatcher.find() ) 
{
    // if the space group is the match
    if ( spaceOrStringMatcher.group( 2 ) != null ) 
    {
        // replace with a single space
        spaceOrStringMatcher.appendReplacement( replacementBuffer, " " );
    }
}

spaceOrStringMatcher.appendTail( replacementBuffer );

#3


0  

text between quotes : Are the quotes within the same line or multiple lines ?

引号之间的文字:引号是在同一行还是多行?

#4


0  

Tokenize it and emit a single space between tokens. A quick google for "java tokenizer that handles quotes" turned up: this link

对其进行标记并在标记之间发出单个空格。一个快速的谷歌“处理报价的java标记器”出现了:这个链接

YMMV

edit: SO didn't like that link. Here's the google search link: google. It was the first result.

编辑:SO不喜欢那个链接。这是谷歌搜索链接:谷歌。这是第一个结果。

#5


0  

Personally, I don't use Java, but this RegExp could do the trick:

就个人而言,我不使用Java,但这个RegExp可以做到这一点:

([^\" ])*(\\\".*?\\\")*

Trying the expression with RegExBuddy, it generates this code, looks fine to me:

用RegExBuddy尝试表达式,它会生成这段代码,对我来说很好看:

try {
    Pattern regex = Pattern.compile("([^\" ])*(\\\".*?\\\")*", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        for (int i = 1; i <= regexMatcher.groupCount(); i++) {
            // matched text: regexMatcher.group(i)
            // match start: regexMatcher.start(i)
            // match end: regexMatcher.end(i)

            // I suppose here you must use something like
            // sstr += regexMatcher.group(i) + " "
        }
    }
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

At least, it seems to work fine in Python:

至少,它似乎在Python中运行良好:

import re

text = """
este  es   un texto de   prueba "para ver  como se comporta  " la funcion   sobre esto
"para ver  como se comporta  " la funcion   sobre esto  "o sobre otro" lo q sea
"""

ret = ""
print text  

reobj = re.compile(r'([^\" ])*(\".*?\")*', re.IGNORECASE)

for match in reobj.finditer(text):
    if match.group() <> "":
        ret = ret + match.group() + "|"

print ret

#6


0  

After you parse out the quoted content, run this on the rest, in bulk or piece by piece as necessary:

在解析出引用的内容后,根据需要批量或逐件地在其余内容上运行此内容:

String text = "ABC   DEF GHI   JKL";
text = text.replaceAll("( )+", " ");
// text: "ABC DEF GHI JKL"

#7


0  

Jeff, you're on the right track, but there are a few errors in your code, to wit: (1) You forgot to escape the quotation marks inside the negated character classes; (2) The parens inside the first capturing group should have been of the non-capturing variety; (3) If the second set of capturing parens doesn't participate in a match, group(2) returns null, and you're not testing for that; and (4) If you test for two or more spaces in the regex instead of one or more, you don't need to check the length of the match later on. Here's the revised code:

杰夫,你是在正确的轨道上,但你的代码中有一些错误,例如:(1)你忘了逃避否定字符类中的引号; (2)第一捕获组内的parens应该是非捕获种类; (3)如果第二组捕获parens没有参与匹配,则group(2)返回null,并且你没有测试它; (4)如果你在正则表达式中测试两个或多个空格而不是一个或多个空格,你不需要稍后检查匹配的长度。这是修改后的代码:

import java.util.regex.*;

public class Test
{
  public static void main(String[] args) throws Exception
  {
    String text = "blah    blah  \"boo   boo boo\"  blah  blah";
    Pattern p = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );
    StringBuffer sb = new StringBuffer();
    Matcher m = p.matcher( text );
    while ( m.find() ) 
    {
      if ( m.group( 2 ) != null ) 
      {
        m.appendReplacement( sb, " " );
      }
    }
    m.appendTail( sb );
    System.out.println( sb.toString() );
  }
}