什么是用于删除断线的跨平台regex ?

时间:2022-05-31 02:38:30

I am sure this has been asked before, but I cannot find it.

我相信这个问题以前有人问过,但是我找不到。

Basically, assuming you are parsing a text file of unknown origin and want to replace line breaks with some other delimiter, is this the best regex, or is there another?

基本上,假设您正在解析一个未知来源的文本文件,并希望用其他分隔符替换换行符,这是最好的regex,还是还有其他的?

(\r\n)|(\n)|(\r)

(\ r \ n)|(\ n)|(\ r)

5 个解决方案

#1


34  

Fletcher - this did get asked once before.

弗莱彻——以前确实有人问过这个问题。

Here you go: Regular Expression to match cross platform newline characters

这里是:用于匹配跨平台换行字符的正则表达式

  • Spoiler Alert!
  • 剧透!

The regex I use when I want to be precise is "\r\n?|\n".

我想要精确的时候使用的regex是“\r\n?|\n”。

#2


20  

Do check if your regex engine supports \R as a shorthand character class and you will not need to be concerned with the various Unicode newline / linefeed combos. If implemented correctly, you can then match all the various ascii or Unicode line endings transparently using \R.

检查您的regex引擎是否支持\R作为一个简短的字符类,您将不需要关心各种Unicode换行/ linefeed组合。如果实现正确,您可以使用\R透明地匹配所有的ascii或Unicode行结束符。

In Unicode you need to detect NEL (OS/390 line ending, \x85) LS (Line Separator, \x2028) and PS (Paragraph Separator, \x2029) if you want to be completely cross platform these days.

在Unicode中,如果你想要完全跨平台,你需要检测NEL (OS/390行结尾,\x85) LS(行分隔符,\x2028)和PS(段落分隔符,\x2029)。

It is debatable whether LS, NEL, and PS should be treated as line breaks, line endings, or white space. The XML 1.0 standard, for example, does not recognize NEL as a line break character. ECMAScript treats LS and PS as line breaks but NEL as whitespace. Perl unicode regexs will treat VT, FF, CR, CRLF, NEL, LS and PS as line breaks for the purpose of ^ and $ regex meta characters.

LS、NEL和PS是否应该被视为换行符、行结束符或空格都是有争议的。例如,XML 1.0标准不承认NEL是换行字符。ECMAScript处理LS和PS作为换行符,而NEL作为空格。Perl unicode regex将治疗VT,FF、CR、CRLF,NEL LS和PS作为换行regex ^和$元字符的目的。

The Unicode Implementation Guide (section 5.8 and table 5.3) is probably the best bet of what the definitive treatment of what a "newline" is.

Unicode实现指南(5.8节和表5.3)可能是对“换行”的最终处理的最佳选择。

If you are only concerned with ascii with the DOS/Windows/Unix/Mac classic variants, the regex equivalent to \R is (?>\r\n|[\r\n])

如果您只关心具有DOS/Windows/Unix/Mac经典变体的ascii码,那么与\R等价的regex是(?>\ R |[\ R \n \n \n])

In Unicode, the equivalent to \R is (?>\r\n|\n|\x0b|\f|\r|\x85|\x2028|\x2029) The \x0b in there is a vertical tab; once again, this may or may not fit you definition of what a line break is, but that does match the recommendation of the Unicode Implantation. (FF, or \x0C is not included in the regex since a Form Feed is a new page, not a new line in the definition.)

在Unicode,等效的\R是(?>\ R |\n|\x0b|\f|\ |\x85|\x2028|\x2029) \x0b在那里有一个垂直标签;同样,这可能适合也可能不适合您对换行符的定义,但是这确实与Unicode植入的建议相匹配。(FF或\x0C不包含在regex中,因为表单提要是一个新页面,而不是定义中的新行。)

#3


2  

The regex to find any Unicode line terminator should be (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}]) rather than as drewk wrote it, at least in Perl. Taken directly from the perl 5.10.0 documentation (it was removed in later versions). Note the braces after \x: U+2029 is \x{2029} but \x2029 is an ASCII whitespace (U+0020) + a digit 2 + a digit 9. \n outside a character class ,is also not guaranteed to match \x{0a}.

找到任何Unicode行终止符的regex应该是(?>\x0D\x0A?|[\x0A- x0C\x85\ x85\x{2028}\x{2029}])而不是drewk所写的(至少在Perl中)。直接从perl 5.10.0文档中获取(它在以后的版本中被删除)。注意\x: U+2029后面的大括号是\x{2029},但是\x2029是一个ASCII空格(U+0020) +一个数字2 +一个数字9。\n在字符类之外,也不能保证匹配\x{0a}。

#4


1  

If your platform does not support the \R class as suggested by @dawg above, you may still be able to make a pretty elegant and robust solution if your platform supports negative lookaround or character class subtraction (e.g. in Java class subtraction is through the syntax [x&&[^y]]).

如果你的平台不支持\ R类如上@dawg建议,您仍然可以创建一个非常优雅的和健壮的解决方案如果您的平台支持-看看或字符类减法在Java类(例如减法是通过语法[x[^ y]])。

In most regular expresssion grammars, the dot character is defined to mean "any character except the newline character" (see for example, for JavaScript, here). If you match something with the following characteristics:

在大多数常规的表达式语法中,点字符被定义为“除换行字符之外的任何字符”(例如,这里的JavaScript)。如果您匹配以下特征:

  1. not (any character except the newline character) → the newline character; and
  2. (任何字符除了换行符)→换行符;和
  3. is whitespace
  4. 是空白

Since I'm currently working in JavaScript, which AFAIK doesn't have the \R shorthand or character class subtraction, I can still use negative lookahead to get what I want. The following regular expression matches all newlines:

由于我目前使用的是JavaScript,而AFAIK没有\R简写或字符类减法,所以我仍然可以使用消极的前视来获得我想要的东西。下面的正则表达式匹配所有的新行:

/((?!.)\s)+/g

And the following JavaScript code, at least when run in Chrome 42.0.2311.90m on Windows 7, wipes out all the kinds of newlines that JavaScript (i.e. the "ECMAScript" mentioned in @dawg's third paragraph) recognizes:

下面的JavaScript代码(至少在Windows 7上运行的是Chrome 42.0.2311.90m)清除了JavaScript(即@dawg第三段中提到的“ECMAScript”)所识别的所有新行:

var input = "hello\r\n\f\v\u2028\u2029 world";
var output = input.replace(/((?!.)\s)+/g, "");
document.write(output); // hello world

#5


0  

Just replace /[\r\n]+/g with an empty string "".

只需用一个空字符串替换/[\r\n]+/g。

It'll replace all \r and \n no matter what order they appear in the string.

它将替换所有的\r和\n,不管它们在字符串中出现的顺序是什么。

#1


34  

Fletcher - this did get asked once before.

弗莱彻——以前确实有人问过这个问题。

Here you go: Regular Expression to match cross platform newline characters

这里是:用于匹配跨平台换行字符的正则表达式

  • Spoiler Alert!
  • 剧透!

The regex I use when I want to be precise is "\r\n?|\n".

我想要精确的时候使用的regex是“\r\n?|\n”。

#2


20  

Do check if your regex engine supports \R as a shorthand character class and you will not need to be concerned with the various Unicode newline / linefeed combos. If implemented correctly, you can then match all the various ascii or Unicode line endings transparently using \R.

检查您的regex引擎是否支持\R作为一个简短的字符类,您将不需要关心各种Unicode换行/ linefeed组合。如果实现正确,您可以使用\R透明地匹配所有的ascii或Unicode行结束符。

In Unicode you need to detect NEL (OS/390 line ending, \x85) LS (Line Separator, \x2028) and PS (Paragraph Separator, \x2029) if you want to be completely cross platform these days.

在Unicode中,如果你想要完全跨平台,你需要检测NEL (OS/390行结尾,\x85) LS(行分隔符,\x2028)和PS(段落分隔符,\x2029)。

It is debatable whether LS, NEL, and PS should be treated as line breaks, line endings, or white space. The XML 1.0 standard, for example, does not recognize NEL as a line break character. ECMAScript treats LS and PS as line breaks but NEL as whitespace. Perl unicode regexs will treat VT, FF, CR, CRLF, NEL, LS and PS as line breaks for the purpose of ^ and $ regex meta characters.

LS、NEL和PS是否应该被视为换行符、行结束符或空格都是有争议的。例如,XML 1.0标准不承认NEL是换行字符。ECMAScript处理LS和PS作为换行符,而NEL作为空格。Perl unicode regex将治疗VT,FF、CR、CRLF,NEL LS和PS作为换行regex ^和$元字符的目的。

The Unicode Implementation Guide (section 5.8 and table 5.3) is probably the best bet of what the definitive treatment of what a "newline" is.

Unicode实现指南(5.8节和表5.3)可能是对“换行”的最终处理的最佳选择。

If you are only concerned with ascii with the DOS/Windows/Unix/Mac classic variants, the regex equivalent to \R is (?>\r\n|[\r\n])

如果您只关心具有DOS/Windows/Unix/Mac经典变体的ascii码,那么与\R等价的regex是(?>\ R |[\ R \n \n \n])

In Unicode, the equivalent to \R is (?>\r\n|\n|\x0b|\f|\r|\x85|\x2028|\x2029) The \x0b in there is a vertical tab; once again, this may or may not fit you definition of what a line break is, but that does match the recommendation of the Unicode Implantation. (FF, or \x0C is not included in the regex since a Form Feed is a new page, not a new line in the definition.)

在Unicode,等效的\R是(?>\ R |\n|\x0b|\f|\ |\x85|\x2028|\x2029) \x0b在那里有一个垂直标签;同样,这可能适合也可能不适合您对换行符的定义,但是这确实与Unicode植入的建议相匹配。(FF或\x0C不包含在regex中,因为表单提要是一个新页面,而不是定义中的新行。)

#3


2  

The regex to find any Unicode line terminator should be (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}]) rather than as drewk wrote it, at least in Perl. Taken directly from the perl 5.10.0 documentation (it was removed in later versions). Note the braces after \x: U+2029 is \x{2029} but \x2029 is an ASCII whitespace (U+0020) + a digit 2 + a digit 9. \n outside a character class ,is also not guaranteed to match \x{0a}.

找到任何Unicode行终止符的regex应该是(?>\x0D\x0A?|[\x0A- x0C\x85\ x85\x{2028}\x{2029}])而不是drewk所写的(至少在Perl中)。直接从perl 5.10.0文档中获取(它在以后的版本中被删除)。注意\x: U+2029后面的大括号是\x{2029},但是\x2029是一个ASCII空格(U+0020) +一个数字2 +一个数字9。\n在字符类之外,也不能保证匹配\x{0a}。

#4


1  

If your platform does not support the \R class as suggested by @dawg above, you may still be able to make a pretty elegant and robust solution if your platform supports negative lookaround or character class subtraction (e.g. in Java class subtraction is through the syntax [x&&[^y]]).

如果你的平台不支持\ R类如上@dawg建议,您仍然可以创建一个非常优雅的和健壮的解决方案如果您的平台支持-看看或字符类减法在Java类(例如减法是通过语法[x[^ y]])。

In most regular expresssion grammars, the dot character is defined to mean "any character except the newline character" (see for example, for JavaScript, here). If you match something with the following characteristics:

在大多数常规的表达式语法中,点字符被定义为“除换行字符之外的任何字符”(例如,这里的JavaScript)。如果您匹配以下特征:

  1. not (any character except the newline character) → the newline character; and
  2. (任何字符除了换行符)→换行符;和
  3. is whitespace
  4. 是空白

Since I'm currently working in JavaScript, which AFAIK doesn't have the \R shorthand or character class subtraction, I can still use negative lookahead to get what I want. The following regular expression matches all newlines:

由于我目前使用的是JavaScript,而AFAIK没有\R简写或字符类减法,所以我仍然可以使用消极的前视来获得我想要的东西。下面的正则表达式匹配所有的新行:

/((?!.)\s)+/g

And the following JavaScript code, at least when run in Chrome 42.0.2311.90m on Windows 7, wipes out all the kinds of newlines that JavaScript (i.e. the "ECMAScript" mentioned in @dawg's third paragraph) recognizes:

下面的JavaScript代码(至少在Windows 7上运行的是Chrome 42.0.2311.90m)清除了JavaScript(即@dawg第三段中提到的“ECMAScript”)所识别的所有新行:

var input = "hello\r\n\f\v\u2028\u2029 world";
var output = input.replace(/((?!.)\s)+/g, "");
document.write(output); // hello world

#5


0  

Just replace /[\r\n]+/g with an empty string "".

只需用一个空字符串替换/[\r\n]+/g。

It'll replace all \r and \n no matter what order they appear in the string.

它将替换所有的\r和\n,不管它们在字符串中出现的顺序是什么。