如何删除不平衡/未共享的双引号(在Java中)

时间:2022-05-20 05:57:59

I thought to share this relatively smart problem with everyone here. I am trying to remove unbalanced/unpaired double-quotes from a string.

我想与大家分享这个相对聪明的问题。我试图从字符串中删除不平衡/不成对的双引号。

My work is in progress, I might be close to a solution. But, I didn't get a working solution yet. I am not able to delete the unpaired/unpartnered double-quotes from the string.

我的工作正在进行中,我可能接近解决方案。但是,我还没有得到一个有效的解决方案。我无法从字符串中删除未配对/未配对的双引号。

Example Input

string1=injunct! alter ego."
string2=successor "alter ego" single employer"  "proceeding "citation assets"

Output Should be

输出应该是

string1=injunct! alter ego.
string2=successor "alter ego" single employer  proceeding "citation assets"

This problem sound similar to Using Java remove unbalanced/unpartnered parenthesis

此问题听起来类似于使用Java删除不平衡/未共享的括号

Here is my code so far(it doesn't delete all the unpaird double-quotes)

这是我到目前为止的代码(它不会删除所有非空双引号)

private String removeUnattachedDoubleQuotes(String stringWithDoubleQuotes) {
    String firstPass = "";

    String openingQuotePattern = "\\\"[a-z0-9\\p{Punct}]";
    String closingQuotePattern = "[a-z0-9\\p{Punct}]\\\"";

    int doubleQuoteLevel = 0;
    for (int i = 0; i < stringWithDoubleQuotes.length() - 3; i++) {
        String c = stringWithDoubleQuotes.substring(i, i + 2);
        if (c.matches(openingQuotePattern)) {
            doubleQuoteLevel++;
            firstPass += c;
        }
        else if (c.matches(closingQuotePattern)) {
            if (doubleQuoteLevel > 0) {
                doubleQuoteLevel--;
                firstPass += c;
            }
        }
        else {
            firstPass += c;
        }
    }

    String secondPass = "";
    doubleQuoteLevel = 0;
    for (int i = firstPass.length() - 1; i >= 0; i--) {
        String c = stringWithDoubleQuotes.substring(i, i + 2);
        if (c.matches(closingQuotePattern)) {
            doubleQuoteLevel++;
            secondPass = c + secondPass;
        }
        else if (c.matches(openingQuotePattern)) {
            if (doubleQuoteLevel > 0) {
                doubleQuoteLevel--;
                secondPass = c + secondPass;
            }
        }
        else {
            secondPass = c + secondPass;
        }
    }

    String result = secondPass;

    return result;
}

2 个解决方案

#1


1  

You could use something like (Perl notation):

你可以使用类似的东西(Perl表示法):

s/("(?=\S)[^"]*(?<=\S)")|"/$1/g;

Which in Java would be:

在Java中将是:

str.replaceAll("(\"(?=\\S)[^\"]*(?<=\\S)\")|\"", "$1");

#2


2  

It could probably be done in a single regex if there is no nesting.
There is a concept of delimeters roughly defined, and it is possible to 'bias'
those rules to get a better outcome.
It all depends on what rules are set forth. This regex takes into account
three possible scenario's in order;

如果没有嵌套,它可能在单个正则表达式中完成。有一个大致定义的分界符的概念,有可能“偏向”这些规则以获得更好的结果。这一切都取决于规定的规则。这个正则表达式按顺序考虑了三个可能的场景;

  1. Valid Pair
  2. Invalid Pair (with bias)
  3. 无效对(有偏见)

  4. Invalid Single

It also doesen't parse "" beyond end of line. But it does do multiple
lines combined as a single string. To change that, remove \n where you see it.

它也不会解析“超出行尾”。但它确实将多行合并为一个字符串。要更改它,请删除您看到的位置。


global context - raw find regex
shortened

全局背景 - 原始查找正则表达式缩短

(?:("[a-zA-Z0-9\p{Punct}][^"\n]*(?<=[a-zA-Z0-9\p{Punct}])")|(?<![a-zA-Z0-9\p{Punct}])"([^"\n]*)"(?![a-zA-Z0-9\p{Punct}])|")

replacement grouping

$1$2 or \1\2

Expanded raw regex:

扩展原始正则表达式:

(?:                            // Grouping
                                  // Try to line up a valid pair
   (                                 // Capt grp (1) start 
     "                               // "
      [a-zA-Z0-9\p{Punct}]              // 1 of [a-zA-Z0-9\p{Punct}]
      [^"\n]*                           // 0 or more non- [^"\n] characters
      (?<=[a-zA-Z0-9\p{Punct}])         // 1 of [a-zA-Z0-9\p{Punct}] behind us
     "                               // "
   )                                 // End capt grp (1)

  |                               // OR, try to line up an invalid pair
       (?<![a-zA-Z0-9\p{Punct}])     // Bias, not 1 of [a-zA-Z0-9\p{Punct}] behind us
     "                               // "
   (  [^"\n]*  )                        // Capt grp (2) - 0 or more non- [^"\n] characters
     "                               // "
       (?![a-zA-Z0-9\p{Punct}])      // Bias, not 1 of [a-zA-Z0-9\p{Punct}] ahead of us

  |                               // OR, this single " is considered invalid
     "                               // "
)                               // End Grouping

Perl testcase (don't have Java)

Perl testcase(没有Java)

$str = '
string1=injunct! alter ego."
string2=successor "alter ego" single employer "a" free" proceeding "citation assets"
';

print "\n'$str'\n";

$str =~ s
/
  (?:
     (
       "[a-zA-Z0-9\p{Punct}]
        [^"\n]*
        (?<=[a-zA-Z0-9\p{Punct}])
       "
     )
   |
       (?<![a-zA-Z0-9\p{Punct}])
       " 
     (  [^"\n]*  )
       " (?![a-zA-Z0-9\p{Punct}])
   |
       "
  )
/$1$2/xg;

print "\n'$str'\n";

Output

'
string1=injunct! alter ego."
string2=successor "alter ego" single employer "a" free" proceeding "citation assets"
'

'
string1=injunct! alter ego.
string2=successor "alter ego" single employer "a" free proceeding "citation assets"
'

#1


1  

You could use something like (Perl notation):

你可以使用类似的东西(Perl表示法):

s/("(?=\S)[^"]*(?<=\S)")|"/$1/g;

Which in Java would be:

在Java中将是:

str.replaceAll("(\"(?=\\S)[^\"]*(?<=\\S)\")|\"", "$1");

#2


2  

It could probably be done in a single regex if there is no nesting.
There is a concept of delimeters roughly defined, and it is possible to 'bias'
those rules to get a better outcome.
It all depends on what rules are set forth. This regex takes into account
three possible scenario's in order;

如果没有嵌套,它可能在单个正则表达式中完成。有一个大致定义的分界符的概念,有可能“偏向”这些规则以获得更好的结果。这一切都取决于规定的规则。这个正则表达式按顺序考虑了三个可能的场景;

  1. Valid Pair
  2. Invalid Pair (with bias)
  3. 无效对(有偏见)

  4. Invalid Single

It also doesen't parse "" beyond end of line. But it does do multiple
lines combined as a single string. To change that, remove \n where you see it.

它也不会解析“超出行尾”。但它确实将多行合并为一个字符串。要更改它,请删除您看到的位置。


global context - raw find regex
shortened

全局背景 - 原始查找正则表达式缩短

(?:("[a-zA-Z0-9\p{Punct}][^"\n]*(?<=[a-zA-Z0-9\p{Punct}])")|(?<![a-zA-Z0-9\p{Punct}])"([^"\n]*)"(?![a-zA-Z0-9\p{Punct}])|")

replacement grouping

$1$2 or \1\2

Expanded raw regex:

扩展原始正则表达式:

(?:                            // Grouping
                                  // Try to line up a valid pair
   (                                 // Capt grp (1) start 
     "                               // "
      [a-zA-Z0-9\p{Punct}]              // 1 of [a-zA-Z0-9\p{Punct}]
      [^"\n]*                           // 0 or more non- [^"\n] characters
      (?<=[a-zA-Z0-9\p{Punct}])         // 1 of [a-zA-Z0-9\p{Punct}] behind us
     "                               // "
   )                                 // End capt grp (1)

  |                               // OR, try to line up an invalid pair
       (?<![a-zA-Z0-9\p{Punct}])     // Bias, not 1 of [a-zA-Z0-9\p{Punct}] behind us
     "                               // "
   (  [^"\n]*  )                        // Capt grp (2) - 0 or more non- [^"\n] characters
     "                               // "
       (?![a-zA-Z0-9\p{Punct}])      // Bias, not 1 of [a-zA-Z0-9\p{Punct}] ahead of us

  |                               // OR, this single " is considered invalid
     "                               // "
)                               // End Grouping

Perl testcase (don't have Java)

Perl testcase(没有Java)

$str = '
string1=injunct! alter ego."
string2=successor "alter ego" single employer "a" free" proceeding "citation assets"
';

print "\n'$str'\n";

$str =~ s
/
  (?:
     (
       "[a-zA-Z0-9\p{Punct}]
        [^"\n]*
        (?<=[a-zA-Z0-9\p{Punct}])
       "
     )
   |
       (?<![a-zA-Z0-9\p{Punct}])
       " 
     (  [^"\n]*  )
       " (?![a-zA-Z0-9\p{Punct}])
   |
       "
  )
/$1$2/xg;

print "\n'$str'\n";

Output

'
string1=injunct! alter ego."
string2=successor "alter ego" single employer "a" free" proceeding "citation assets"
'

'
string1=injunct! alter ego.
string2=successor "alter ego" single employer "a" free proceeding "citation assets"
'