Need to put list of unicode words in unicode string in {}. There is my code:

需要在{}中的unicode字符串中放入unicode单词列表。有我的代码:

var txt = "¿One;one oneé two two two two two twö twöu three;;twä;föur?";
var re = new RegExp("(^|\\W)(one|tw|two two|two|twöu|three|föur)(?=\\W|$)", "gi");
alert(txt.replace(re, '$1 {$2}'));

It returns:

¿{One};{one} {one}é {two two} {two two} {two} {tw}ö {tw}öu {three};;{tw}ä;{föur}?

¿{One}; {one} {one}é{two two} {two two} {two} {tw}ö{tw}öu{three} ;; {tw}ä; {föur}?

but should be:

但应该是:

¿{One};{one} oneé {two two} {two two} {two} twö {twöu} {three};;twä;{föur}?

¿{One}; {one}oneé{two two} {two two} {two}twö{twöu} {three} ;;twä; {föur}?

What I'm doing wrong?

我做错了什么?

3 个解决方案

#1

The Problem

What am I doing wrong?

我究竟做错了什么?

Unfortunately, the answer is that you are doing nothing wrong. Javascript is.

不幸的是,答案是你没有做错任何事。 Javascript是。

The problem is that Javascript does not support Unicode regular expressions as such are spelled out in The Unicode Standard.

There is, however, a rather nice library called XRegExp which has a JavaScript plugin that helps a great deal. I recommend it, albeit with several notable caveats. You need to know what it can do, and what it cannot.

然而,有一个相当不错的库叫做XRegExp,它有一个JavaScript插件,可以提供很多帮助。我推荐它,尽管有一些值得注意的警告。你需要知道它能做什么,不能做什么。

What It Does

Corrects various bugs in inconsistencies in Javascript implementations, including its split function.

纠正了Javascript实现中不一致的各种错误,包括其拆分功能。

Supports the BMP code points covered by the 6.1 release of the Unicode Character Database, from January 2012.

从2012年1月开始支持Unicode字符数据库6.1版本所涵盖的BMP代码点。

Correctly ignores case, space, hyphen-minuses, and underscores in Unicode property names, per The Standard — something which even Java gets wrong.

根据标准,正确地忽略Unicode属性名称中的大小写,空格,连字符和下划线 - 即使Java出错也是如此。

Supports the Unicode General Categories like \p{L} for letters and \p{Sc} for currency symbols.

支持Unicode常规类别,如\ p {L}表示字母,\ p {Sc}表示货币符号。

Support the standard full property names like \p{Letter} for \p{L} and \p{Currency_Symbol} for \p{Sc}.

为\ p {L}支持标准的完整属性名称,例如\ p {Letter},为\ p {Sc}支持\ p {Currency_Symbol}。

Supports the Unicode Script properties, like \p{Latin}, \p{Greek}, and \p{Common}.

支持Unicode脚本属性,如\ p {Latin},\ p {Greek}和\ p {Common}。

Supports the Unicode Block properties, like \p{InBasic_Latin} and \p{InMathematical_Alphanumeric_Symbols}.

支持Unicode块属性,如\ p {InBasic_Latin}和\ p {InMathematical_Alphanumeric_Symbols}。

Supports the other 9 Unicode properties needed for level-1 compliance: \p{Alphabetic}, \p{Uppercase}, \p{Lowercase}, \p{White_Space}, \p{Noncharacter_Code_Point}, \p{Default_Ignorable_Code_Point}, \p{Any}, \p{ASCII}, and \p{Assigned}.

支持一级合规性所需的其他9个Unicode属性:\ p {字母},\ p {大写},\ p {小写},\ p {White_Space},\ p {Noncharacter_Code_Point},\ p {Default_Ignorable_Code_Point},\ p {Any},\ p {ASCII}和\ p {Assigned}。

Supports named captures instead of just numbered ones, using standard notation to do so: (?<NAME>⋯) to declare a named group, \k<NAME> to backref it by name, and use ${NAME} in the replacement pattern (and in general access it using result.NAME in your code). This is the same syntax used by Perl 5.10, Java 7, .ɴᴇᴛ, and several other languages. It makes writing complex regexes a lot easier by letting you name parts instead of just numbering them, so that when you move stuff around you don’t have to recalculate the numbered variables.

支持命名捕获而不是仅编号的捕获,使用标准符号来执行此操作:(? ...)声明命名组,\ k 按名称对其进行反向,并在替换模式中使用$ {NAME} (通常使用代码中的result.NAME访问它)。这与Perl 5.10,Java 7,.ɴᴇᴛ和其他几种语言使用的语法相同。通过让你为部件命名而不仅仅是对它们进行编号,它使复杂的正则表达式变得容易得多,因此当你移动东西时,你不必重新计算编号的变量。

Supports /s ᴀᴋᴀ (?s) mode so that dot matches any single code point, rather than anything except for a linebreak sequence. Most other regex engines support this mode.

支持/sᴀᴋᴀ(?s)模式,使点匹配任何单个代码点,而不是除了换行序列之外的任何内容。大多数其他正则表达式引擎支持此模式。

Supports /x ᴀᴋᴀ (?x) mode so that whitespace and comments are ignored (if unescaped). Most regex engines support this mode. It is absolutely indispensable for creating legible — and hence, maintainable — patterns.

支持/xᴀᴋᴀ(?x)模式,以便忽略空格和注释(如果未转义)。大多数正则表达式引擎都支持此模式。它对于创造清晰易读 - 因此可维护 - 的模式是绝对必不可少的。

Supports embedded comments even when not in /x mode using the standard (?#⋯) notation to do so (such as seen in Perl). This lets you put comments in individual regex pieces without going all the way to /x mode, which is often important in developing more complex patterns, by allowing you to build them up piece-wise.

即使不在/ x模式下使用标准(?#⋯)表示法(例如在Perl中看到),也支持嵌入式注释。这使得您可以将注释放在单独的正则表达式部分中,而无需一直到/ x模式,这对于开发更复杂的模式通常很重要,允许您逐个构建它们。

Supports extensibility, so that you can add new token types if you want, such as \a to mean the ALERT character, or the POSIXish character classes.

支持可扩展性,以便您可以根据需要添加新的令牌类型,例如\ a表示ALERT字符或POSIXish字符类。

What It Doesn’t

You should be careful, however, for the things that it does not do:

但是,对于它没有做的事情,你应该小心:

Does not support full Unicode, but only code points from Plane 0. This is a forbidden restriction, as The Unicode Standard requires that there be no difference between astral and non-astral code points in a regular expression. Even Java doesn’t get this right until JDK7. (However, the v2.1.0 development version does support full Unicode.)

不支持完整的Unicode,但只支持平面0的代码点。这是一个禁止的限制,因为Unicode标准要求正则表达式中的星体和非星体代码点之间没有区别。甚至Java在JDK7之前也没有做到这一点。 (但是,v2.1.0开发版本确实支持完整的Unicode。)

Does not support \X for grapheme clusters, or \R for linebreak sequences.

对于字形集群不支持\ X,对于换行序列不支持\ R.

Does not support two-part properties, like \p{GC=Letter}, \p{Block=Phonetic_Extensions}, \p{Script=Greek}, \p{Bidi_Class=Right_to_Left}, \p{Word_Break=A_Letter}, and \p{Numeric_Value=10}.

不支持两部分属性,例如\ p {GC = Letter},\ p {Block = Phonetic_Extensions},\ p {Script = Greek},\ p {Bidi_Class = Right_to_Left},\ p {Word_Break = A_Letter},以及\ p {Numeric_Value = 10}。

It does not update the character class shortcuts to operate per the requirements of UTS#18. Standard JavaScript only allows \s to match the Unicode \p{White_Space} property; it does not allow \d to match \p{Nd} (although some old browsers will do that anyway!) nor \w to match [\p{Alphabetic}\pM\p{Nd}\p{Pc}], let alone providing Unicode-aware versions of \b and \B, all of which are part of the requirements for supporting Unicode Regular Expressions.

它不会根据UTS#18的要求更新操作的字符类快捷方式。标准JavaScript只允许\ s匹配Unicode \ p {White_Space}属性;它不允许\ d匹配\ p {Nd}(尽管有些旧的浏览器会这样做!)也不匹配[\ p {Alphabetic} \ pM \ p {Nd} \ p {Pc}],让单独提供\ b和\ B的Unicode感知版本,所有这些都是支持Unicode正则表达式的要求的一部分。

It does not support some commonly used properties. In practice, the one that is missing is \p{digit}, and perhaps also the rather useful \p{Dash}, \p{Math}, \p{Diacritic}, and \p{Quotation_Mark} properties.

它不支持一些常用属性。在实践中,缺少的是\ p {digit},也许还有相当有用的\ p {Dash},\ p {Math},\ p {Diacritic}和\ p {Quotation_Mark}属性。

Has no support for grapheme clusters such as using \X or even via (?:\p{Grapheme_Base}\p{Grapheme_Extend}*). This is a really big deal.

不支持字形集群,例如使用\ X甚至是(?:\ p {Grapheme_Base} \ p {Grapheme_Extend} *)。这是一个非常重要的事情。

Workarounds

Here are a few workarounds to handle a few of the places where the library doesn’t follow The Unicode Standard:

以下是处理库不遵循Unicode标准的一些地方的一些解决方法:

For the missing \w, you can use [\p{L}\p{Nl}\p{Nd}\p{M}\p{InEnclosedAlphanumerics}]. It overstates matters only in the enclosed numbers, as they’re not \p{Nd}-type numbers which are the only ones that count as alphanumeric.

对于缺失\ w,您可以使用[\ p {L} \ p {Nl} \ p {Nd} \ p {M} \ p {InEnclosedAlphanumerics}]。它仅在附带的数字中夸大了问题,因为它们不是\ p {Nd}型数字,它们是唯一计为字母数字的数字。

For the missing \W, you can therefore use the set-complement of the previous one, so [^\p{L}\p{Nl}\p{Nd}\p{M}\p{InEnclosedAlphanumerics}]. It overstates matters only in the enclosed numbers.

因为缺少\ W,你可以使用前一个的集合补集,所以[^ \ p {L} \ p {Nl} \ p {Nd} \ p {M} \ p {InEnclosedAlphanumerics}]。它仅在随附的数字中夸大了事项。

Since \b is really the same as (?:(?<=\w)(?!\w)|(?<!\w)(?=\w)), you could plug that \w definition into that sequence to create a Unicode-aware version of \b — provided that JavaScript supported all four directions of lookaround, which when last I checked, it did not. You have to have both positive and negative lookbehind, not just lookahead, to do this correctly. Javascript neglects to support those, at least as far as I can see.

因为\ b实际上与(?:(?<= \ w)(?!\ w)|(?

Since \B is really the same as (?:(?<=\w)(?=\w)|(?<!\w)(?!\w)), you could do the same, but subject to the same conditions.

因为\ B真的和(?:(?<= \ w)(?= \ w)|(?

For the missing \X, you can get sorta close by using \P{M}\p{M}*, but that incorrectly splits up CRLF constructs and allows marks on the same, all of which is really quite wrong.

对于缺少\ X,你可以通过使用\ P {M} \ p {M} *来获得sorta,但是这会错误地拆分CRLF结构并允许相同的标记,所有这些都是非常错误的。

For the missing \R, you can construct a work-around using (?:\r\n|[\n-\r\u0085\u2028\u2029]).

对于缺少的\ R,您可以使用(?:\ r \ n | [\ n- \ r \ u \ u0085 \ u2028 \ u2029])构建变通方法。

Summary

The conclusion is that JavaScript’s regexes are completely unsuited for Unicode work. However, the XRegExp plugin moves closer to making that feasible. If you can live with its restrictions, this is probably easier than switching to a different but Unicode-aware programming language. It’s certainly better than being unable to use Unicode regexes even at all.

结论是JavaScript的正则表达式完全不适合Unicode工作。但是,XRegExp插件更接近于实现可行性。如果你可以忍受它的限制,这可能比切换到另一种但支持Unicode的编程语言更容易。它肯定比无法使用Unicode正则表达式更好。

However, it is still a rather long ways from meeting the very most basic requirements (Level 1 support) for Unicode regexes as spelled out in the standard. Someday you are going to want to be able to match characters whether they have accent marks on them or not, or which set up in the Mathematical Alphanumeric Symbols block, or which use the Unicode case-mapping and case-folding definitions, or which follow The Unicode Standard for alphanumeric sorts or for line- and word-breaking, and you cannot do any of those things in Javascript even with the plug-in.

但是,如同标准中规定的那样,满足Unicode正则表达式的最基本要求(1级支持)仍然是一个很长的路要走。有一天你会想要能够匹配字符,无论它们是否有重音符号,或者在数学字母数字符号块中设置,或者使用Unicode案例映射和大小写折叠定义,或者跟随Unicode标准用于字母数字排序或用于行和断字,即使使用插件也无法在Javascript中执行任何操作。

So you might wish to consider using a language that is compliant with The Unicode Standard if you actually need to handle Unicode. Javascript just doesn’t manage that.

因此,如果您确实需要处理Unicode,则可能需要考虑使用符合Unicode标准的语言。 Javascript只是无法管理。

#2

Firstly, unless the regex is dynamic, please use the /.../gi notation.

首先,除非正则表达式是动态的,否则请使用/.../gi表示法。

The problem it returns the wrong value is because \W in Javascript is really just [^0-9a-zA-Z_]. The accented characters like é is not considered a word character. You need to exclude them manually.

它返回错误值的问题是因为Javascript中的\ W实际上只是[^ 0-9a-zA-Z_]。像é这样的重音字符不被视为单词字符。您需要手动排除它们。

var re = /(^|[^a-zäéö])(one|tw|two two|two|twöu|three|föur)(?=[^a-zäéö]|$)/gi;

#3

-1

Try this:

var txt = "¿One;one oneé two two two two two twö twöu three;;twä;föur?";
var re = new RegExp("(^|\\W)(one|two two|two|twöu|three|föur)(?=[^a-zé]|$)", "gi");
alert(txt.replace(re, '$1 {$2}'));

Let me know in case doesnt work...

让我知道如果不工作......

#1

The Problem

What am I doing wrong?

我究竟做错了什么?

Unfortunately, the answer is that you are doing nothing wrong. Javascript is.

不幸的是,答案是你没有做错任何事。 Javascript是。

The problem is that Javascript does not support Unicode regular expressions as such are spelled out in The Unicode Standard.

What It Does

Corrects various bugs in inconsistencies in Javascript implementations, including its split function.

纠正了Javascript实现中不一致的各种错误,包括其拆分功能。

Supports the BMP code points covered by the 6.1 release of the Unicode Character Database, from January 2012.

从2012年1月开始支持Unicode字符数据库6.1版本所涵盖的BMP代码点。

Correctly ignores case, space, hyphen-minuses, and underscores in Unicode property names, per The Standard — something which even Java gets wrong.

根据标准,正确地忽略Unicode属性名称中的大小写,空格,连字符和下划线 - 即使Java出错也是如此。

Supports the Unicode General Categories like \p{L} for letters and \p{Sc} for currency symbols.

支持Unicode常规类别,如\ p {L}表示字母,\ p {Sc}表示货币符号。

Support the standard full property names like \p{Letter} for \p{L} and \p{Currency_Symbol} for \p{Sc}.

为\ p {L}支持标准的完整属性名称,例如\ p {Letter},为\ p {Sc}支持\ p {Currency_Symbol}。

Supports the Unicode Script properties, like \p{Latin}, \p{Greek}, and \p{Common}.

支持Unicode脚本属性,如\ p {Latin},\ p {Greek}和\ p {Common}。

Supports the Unicode Block properties, like \p{InBasic_Latin} and \p{InMathematical_Alphanumeric_Symbols}.

支持Unicode块属性,如\ p {InBasic_Latin}和\ p {InMathematical_Alphanumeric_Symbols}。

Supports the other 9 Unicode properties needed for level-1 compliance: \p{Alphabetic}, \p{Uppercase}, \p{Lowercase}, \p{White_Space}, \p{Noncharacter_Code_Point}, \p{Default_Ignorable_Code_Point}, \p{Any}, \p{ASCII}, and \p{Assigned}.

Supports named captures instead of just numbered ones, using standard notation to do so: (?<NAME>⋯) to declare a named group, \k<NAME> to backref it by name, and use ${NAME} in the replacement pattern (and in general access it using result.NAME in your code). This is the same syntax used by Perl 5.10, Java 7, .ɴᴇᴛ, and several other languages. It makes writing complex regexes a lot easier by letting you name parts instead of just numbering them, so that when you move stuff around you don’t have to recalculate the numbered variables.

Supports /s ᴀᴋᴀ (?s) mode so that dot matches any single code point, rather than anything except for a linebreak sequence. Most other regex engines support this mode.

支持/sᴀᴋᴀ(?s)模式,使点匹配任何单个代码点,而不是除了换行序列之外的任何内容。大多数其他正则表达式引擎支持此模式。

Supports /x ᴀᴋᴀ (?x) mode so that whitespace and comments are ignored (if unescaped). Most regex engines support this mode. It is absolutely indispensable for creating legible — and hence, maintainable — patterns.

Supports embedded comments even when not in /x mode using the standard (?#⋯) notation to do so (such as seen in Perl). This lets you put comments in individual regex pieces without going all the way to /x mode, which is often important in developing more complex patterns, by allowing you to build them up piece-wise.

Supports extensibility, so that you can add new token types if you want, such as \a to mean the ALERT character, or the POSIXish character classes.

支持可扩展性,以便您可以根据需要添加新的令牌类型,例如\ a表示ALERT字符或POSIXish字符类。

What It Doesn’t

You should be careful, however, for the things that it does not do:

但是,对于它没有做的事情,你应该小心:

Does not support full Unicode, but only code points from Plane 0. This is a forbidden restriction, as The Unicode Standard requires that there be no difference between astral and non-astral code points in a regular expression. Even Java doesn’t get this right until JDK7. (However, the v2.1.0 development version does support full Unicode.)

Does not support \X for grapheme clusters, or \R for linebreak sequences.

对于字形集群不支持\ X,对于换行序列不支持\ R.

Does not support two-part properties, like \p{GC=Letter}, \p{Block=Phonetic_Extensions}, \p{Script=Greek}, \p{Bidi_Class=Right_to_Left}, \p{Word_Break=A_Letter}, and \p{Numeric_Value=10}.

不支持两部分属性,例如\ p {GC = Letter},\ p {Block = Phonetic_Extensions},\ p {Script = Greek},\ p {Bidi_Class = Right_to_Left},\ p {Word_Break = A_Letter},以及\ p {Numeric_Value = 10}。

It does not update the character class shortcuts to operate per the requirements of UTS#18. Standard JavaScript only allows \s to match the Unicode \p{White_Space} property; it does not allow \d to match \p{Nd} (although some old browsers will do that anyway!) nor \w to match [\p{Alphabetic}\pM\p{Nd}\p{Pc}], let alone providing Unicode-aware versions of \b and \B, all of which are part of the requirements for supporting Unicode Regular Expressions.

It does not support some commonly used properties. In practice, the one that is missing is \p{digit}, and perhaps also the rather useful \p{Dash}, \p{Math}, \p{Diacritic}, and \p{Quotation_Mark} properties.

它不支持一些常用属性。在实践中,缺少的是\ p {digit},也许还有相当有用的\ p {Dash},\ p {Math},\ p {Diacritic}和\ p {Quotation_Mark}属性。

Has no support for grapheme clusters such as using \X or even via (?:\p{Grapheme_Base}\p{Grapheme_Extend}*). This is a really big deal.

不支持字形集群,例如使用\ X甚至是(?:\ p {Grapheme_Base} \ p {Grapheme_Extend} *)。这是一个非常重要的事情。

Workarounds

Here are a few workarounds to handle a few of the places where the library doesn’t follow The Unicode Standard:

以下是处理库不遵循Unicode标准的一些地方的一些解决方法:

For the missing \w, you can use [\p{L}\p{Nl}\p{Nd}\p{M}\p{InEnclosedAlphanumerics}]. It overstates matters only in the enclosed numbers, as they’re not \p{Nd}-type numbers which are the only ones that count as alphanumeric.

For the missing \W, you can therefore use the set-complement of the previous one, so [^\p{L}\p{Nl}\p{Nd}\p{M}\p{InEnclosedAlphanumerics}]. It overstates matters only in the enclosed numbers.

因为缺少\ W,你可以使用前一个的集合补集,所以[^ \ p {L} \ p {Nl} \ p {Nd} \ p {M} \ p {InEnclosedAlphanumerics}]。它仅在随附的数字中夸大了事项。

Since \b is really the same as (?:(?<=\w)(?!\w)|(?<!\w)(?=\w)), you could plug that \w definition into that sequence to create a Unicode-aware version of \b — provided that JavaScript supported all four directions of lookaround, which when last I checked, it did not. You have to have both positive and negative lookbehind, not just lookahead, to do this correctly. Javascript neglects to support those, at least as far as I can see.

因为\ b实际上与(?:(?<= \ w)(?!\ w)|(?

Since \B is really the same as (?:(?<=\w)(?=\w)|(?<!\w)(?!\w)), you could do the same, but subject to the same conditions.

因为\ B真的和(?:(?<= \ w)(?= \ w)|(?

For the missing \X, you can get sorta close by using \P{M}\p{M}*, but that incorrectly splits up CRLF constructs and allows marks on the same, all of which is really quite wrong.

对于缺少\ X,你可以通过使用\ P {M} \ p {M} *来获得sorta,但是这会错误地拆分CRLF结构并允许相同的标记,所有这些都是非常错误的。

For the missing \R, you can construct a work-around using (?:\r\n|[\n-\r\u0085\u2028\u2029]).

对于缺少的\ R,您可以使用(?:\ r \ n | [\ n- \ r \ u \ u0085 \ u2028 \ u2029])构建变通方法。

Summary

So you might wish to consider using a language that is compliant with The Unicode Standard if you actually need to handle Unicode. Javascript just doesn’t manage that.

因此,如果您确实需要处理Unicode,则可能需要考虑使用符合Unicode标准的语言。 Javascript只是无法管理。

#2

Firstly, unless the regex is dynamic, please use the /.../gi notation.

首先,除非正则表达式是动态的,否则请使用/.../gi表示法。

它返回错误值的问题是因为Javascript中的\ W实际上只是[^ 0-9a-zA-Z_]。像é这样的重音字符不被视为单词字符。您需要手动排除它们。

var re = /(^|[^a-zäéö])(one|tw|two two|two|twöu|three|föur)(?=[^a-zäéö]|$)/gi;

#3

-1

Try this:

var txt = "¿One;one oneé two two two two two twö twöu three;;twä;föur?";
var re = new RegExp("(^|\\W)(one|two two|two|twöu|three|föur)(?=[^a-zé]|$)", "gi");
alert(txt.replace(re, '$1 {$2}'));

Let me know in case doesnt work...

让我知道如果不工作......

秒客网

用正则表达式拆分和替换javascript中的unicode单词

3 个解决方案

#1

The Problem

The problem is that Javascript does not support Unicode regular expressions as such are spelled out in The Unicode Standard.

What It Does

What It Doesn’t

Workarounds

Summary

#2

#3

#1

The Problem

The problem is that Javascript does not support Unicode regular expressions as such are spelled out in The Unicode Standard.

What It Does

What It Doesn’t

Workarounds

Summary

#2

#3

相关文章