正则表达式问题 - 引用封闭文本块之外的一个或多个空格

时间:2023-02-08 11:13:20

I want to be replace any occurrence of more than one space with a single space, but take no action in text between quotes.


Is there any way of doing this with a Java regex? If so, can you please attempt it or give me a hint?


7 个解决方案



Here's another approach, that uses a lookahead to determine that all quotation marks after the current position come in matched pairs.


text = text.replaceAll("  ++(?=(?:[^\"]*+\"[^\"]*+\")*+[^\"]*+$)", " ");

If needed, the lookahead can be adapted to handle escaped quotation marks inside the quoted sections.




When trying to match something that can be contained within something else, it can be helpful to construct a regular expression that matches both, like this:


("[^"\\]*(?:\\.[^"\\]*)*")|(  +)

This will match a quoted string or two or more spaces. Because the two expressions are combined, it will match a quoted string OR two or more spaces, but not spaces within quotes. Using this expression, you will need to examine each match to determine if it is a quoted string or two or more spaces and act accordingly:


Pattern spaceOrStringRegex = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );

StringBuffer replacementBuffer = new StringBuffer();

Matcher spaceOrStringMatcher = spaceOrStringRegex.matcher( text );

while ( spaceOrStringMatcher.find() ) 
    // if the space group is the match
    if ( spaceOrStringMatcher.group( 2 ) != null ) 
        // replace with a single space
        spaceOrStringMatcher.appendReplacement( replacementBuffer, " " );

spaceOrStringMatcher.appendTail( replacementBuffer );



text between quotes : Are the quotes within the same line or multiple lines ?




Tokenize it and emit a single space between tokens. A quick google for "java tokenizer that handles quotes" turned up: this link



edit: SO didn't like that link. Here's the google search link: google. It was the first result.




Personally, I don't use Java, but this RegExp could do the trick:


([^\" ])*(\\\".*?\\\")*

Trying the expression with RegExBuddy, it generates this code, looks fine to me:


try {
    Pattern regex = Pattern.compile("([^\" ])*(\\\".*?\\\")*", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        for (int i = 1; i <= regexMatcher.groupCount(); i++) {
            // matched text: regexMatcher.group(i)
            // match start: regexMatcher.start(i)
            // match end: regexMatcher.end(i)

            // I suppose here you must use something like
            // sstr += regexMatcher.group(i) + " "
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression

At least, it seems to work fine in Python:


import re

text = """
este  es   un texto de   prueba "para ver  como se comporta  " la funcion   sobre esto
"para ver  como se comporta  " la funcion   sobre esto  "o sobre otro" lo q sea

ret = ""
print text  

reobj = re.compile(r'([^\" ])*(\".*?\")*', re.IGNORECASE)

for match in reobj.finditer(text):
    if match.group() <> "":
        ret = ret + match.group() + "|"

print ret



After you parse out the quoted content, run this on the rest, in bulk or piece by piece as necessary:


String text = "ABC   DEF GHI   JKL";
text = text.replaceAll("( )+", " ");
// text: "ABC DEF GHI JKL"



Jeff, you're on the right track, but there are a few errors in your code, to wit: (1) You forgot to escape the quotation marks inside the negated character classes; (2) The parens inside the first capturing group should have been of the non-capturing variety; (3) If the second set of capturing parens doesn't participate in a match, group(2) returns null, and you're not testing for that; and (4) If you test for two or more spaces in the regex instead of one or more, you don't need to check the length of the match later on. Here's the revised code:

杰夫,你是在正确的轨道上,但你的代码中有一些错误,例如:(1)你忘了逃避否定字符类中的引号; (2)第一捕获组内的parens应该是非捕获种类; (3)如果第二组捕获parens没有参与匹配,则group(2)返回null,并且你没有测试它; (4)如果你在正则表达式中测试两个或多个空格而不是一个或多个空格,你不需要稍后检查匹配的长度。这是修改后的代码:

import java.util.regex.*;

public class Test
  public static void main(String[] args) throws Exception
    String text = "blah    blah  \"boo   boo boo\"  blah  blah";
    Pattern p = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );
    StringBuffer sb = new StringBuffer();
    Matcher m = p.matcher( text );
    while ( m.find() ) 
      if ( m.group( 2 ) != null ) 
        m.appendReplacement( sb, " " );
    m.appendTail( sb );
    System.out.println( sb.toString() );



Here's another approach, that uses a lookahead to determine that all quotation marks after the current position come in matched pairs.


text = text.replaceAll("  ++(?=(?:[^\"]*+\"[^\"]*+\")*+[^\"]*+$)", " ");

If needed, the lookahead can be adapted to handle escaped quotation marks inside the quoted sections.




When trying to match something that can be contained within something else, it can be helpful to construct a regular expression that matches both, like this:


("[^"\\]*(?:\\.[^"\\]*)*")|(  +)

This will match a quoted string or two or more spaces. Because the two expressions are combined, it will match a quoted string OR two or more spaces, but not spaces within quotes. Using this expression, you will need to examine each match to determine if it is a quoted string or two or more spaces and act accordingly:


Pattern spaceOrStringRegex = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );

StringBuffer replacementBuffer = new StringBuffer();

Matcher spaceOrStringMatcher = spaceOrStringRegex.matcher( text );

while ( spaceOrStringMatcher.find() ) 
    // if the space group is the match
    if ( spaceOrStringMatcher.group( 2 ) != null ) 
        // replace with a single space
        spaceOrStringMatcher.appendReplacement( replacementBuffer, " " );

spaceOrStringMatcher.appendTail( replacementBuffer );



text between quotes : Are the quotes within the same line or multiple lines ?




Tokenize it and emit a single space between tokens. A quick google for "java tokenizer that handles quotes" turned up: this link



edit: SO didn't like that link. Here's the google search link: google. It was the first result.




Personally, I don't use Java, but this RegExp could do the trick:


([^\" ])*(\\\".*?\\\")*

Trying the expression with RegExBuddy, it generates this code, looks fine to me:


try {
    Pattern regex = Pattern.compile("([^\" ])*(\\\".*?\\\")*", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        for (int i = 1; i <= regexMatcher.groupCount(); i++) {
            // matched text: regexMatcher.group(i)
            // match start: regexMatcher.start(i)
            // match end: regexMatcher.end(i)

            // I suppose here you must use something like
            // sstr += regexMatcher.group(i) + " "
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression

At least, it seems to work fine in Python:


import re

text = """
este  es   un texto de   prueba "para ver  como se comporta  " la funcion   sobre esto
"para ver  como se comporta  " la funcion   sobre esto  "o sobre otro" lo q sea

ret = ""
print text  

reobj = re.compile(r'([^\" ])*(\".*?\")*', re.IGNORECASE)

for match in reobj.finditer(text):
    if match.group() <> "":
        ret = ret + match.group() + "|"

print ret



After you parse out the quoted content, run this on the rest, in bulk or piece by piece as necessary:


String text = "ABC   DEF GHI   JKL";
text = text.replaceAll("( )+", " ");
// text: "ABC DEF GHI JKL"



Jeff, you're on the right track, but there are a few errors in your code, to wit: (1) You forgot to escape the quotation marks inside the negated character classes; (2) The parens inside the first capturing group should have been of the non-capturing variety; (3) If the second set of capturing parens doesn't participate in a match, group(2) returns null, and you're not testing for that; and (4) If you test for two or more spaces in the regex instead of one or more, you don't need to check the length of the match later on. Here's the revised code:

杰夫,你是在正确的轨道上,但你的代码中有一些错误,例如:(1)你忘了逃避否定字符类中的引号; (2)第一捕获组内的parens应该是非捕获种类; (3)如果第二组捕获parens没有参与匹配,则group(2)返回null,并且你没有测试它; (4)如果你在正则表达式中测试两个或多个空格而不是一个或多个空格,你不需要稍后检查匹配的长度。这是修改后的代码:

import java.util.regex.*;

public class Test
  public static void main(String[] args) throws Exception
    String text = "blah    blah  \"boo   boo boo\"  blah  blah";
    Pattern p = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );
    StringBuffer sb = new StringBuffer();
    Matcher m = p.matcher( text );
    while ( m.find() ) 
      if ( m.group( 2 ) != null ) 
        m.appendReplacement( sb, " " );
    m.appendTail( sb );
    System.out.println( sb.toString() );