Should regex be used in a parser for an interpreter or compiler?

时间:2022-09-03 20:45:12

When parsing a grammar, should RegEx be used to match grammars that can be expressed as regular languages or should the current parser design be used exclusively?

在解析语法时,RegEx是否应该用于匹配可以表示为常规语言的语法,还是应该专门使用当前的解析器设计?

For example, the EBNF grammar for JSON can be expressed as:

例如,JSON的EBNF语法可表示为:

object ::= '{' '}' | '{' members '}';
members ::= pair | pair ',' members;
pair ::= string ':' value;
array ::= '[' ']' | '[' elements ']';
elements ::= value | value ',' elements;
value ::= string | number | object | array | 'true' | 'false' | 'null';

So grammar would need to be matched using some type of lexical analyzer (such as a recursive descent parser or ad hoc parser), but the grammar for some of the values (such as the number) can be expressed as a regular language like this RegEx pattern for number:

因此需要使用某种类型的词法分析器(例如递归下降解析器或ad hoc解析器)来匹配语法,但是某些值(例如数字)的语法可以表示为常规语言,如此RegEx数字模式:

-?\d+(\.\d+)?([eE][+-]?\d+)?

Given this example, assuming one is creating a recursive descent JSON parser... should the number be matched via the recursive descent technique or should the number be matched via RegEx since it can be matched easily using RegEx?

给出这个例子,假设一个人正在创建一个递归下降JSON解析器...应该通过递归下降技术匹配数字还是应该通过RegEx匹配数字,因为它可以使用RegEx轻松匹配?

1 个解决方案

#1


0  

This is a very broad and opinionated question. Hence, to my knowledge, usually you will want a parser to be as fast as possible and to have the smallest footprint in memory as possible, especially if it needs to parse in real-time (on demand).

这是一个非常广泛和自以为是的问题。因此,据我所知,通常你会希望解析器尽可能快,并尽可能在内存中占用最小的空间,特别是如果它需要实时解析(按需)。

A RegEx will surely do the job, but it is like shooting a fly with a nuclear weapon !

RegEx肯定会完成这项工作,但就像用核武器射击一样!

This is why, many parsers are written in low-level language like C to take advantage of string pointers and avoid the overhead caused by high-level languages like Java with immutable fields, garbage collector,...

这就是为什么许多解析器都是用C语言等低级语言编写的,以利用字符串指针,避免高级语言(如带有不可变字段的Java,垃圾收集器等)引起的开销。

Meanwhile, this heavily depends on your use case and cannot be truly answered in a generic way. You should consider the tradeoff between the developer's convenience to use RegEx versus the performance of the parser.

同时,这在很大程度上取决于您的使用案例,并且无法以通用方式真正回答。您应该考虑开发人员使用RegEx的便利性与解析器的性能之间的权衡。

One additionnal consideration is that usually you will want the parser to indicate where you have a syntax error, and which type of error it is. Using a RegEx, it will simply not match and you will have a hard time finding out why it stopped in order to display a proper error message. When using an old-school parser, you can quickly stop parsing as soon as you encounter a syntax error and you can know precisely what did not match and where.

另外一个考虑因素是,通常您需要解析器指出语法错误的位置以及错误类型。使用RegEx,它将无法匹配,您将很难找到它停止的原因,以显示正确的错误消息。使用旧式解析器时,只要遇到语法错误就可以快速停止解析,并且可以准确地知道哪些不匹配以及在哪里。

In your specific case for JSON parsing and using RegEx only for numbers, I suppose you are probably using a high-level language already, so what many implementations do is to rely on the language's native parsing for numbers. So just pick the value (string, number,...) using the delimiters and let the programming language throw an exception for number parsing.

在您的JSON解析和仅对数字使用RegEx的特定情况下,我想您可能已经使用了高级语言,因此许多实现的工作依赖于语言对数字的本机解析。所以只需使用分隔符选择值(字符串,数字......),让编程语言为数字解析抛出异常。

#1


0  

This is a very broad and opinionated question. Hence, to my knowledge, usually you will want a parser to be as fast as possible and to have the smallest footprint in memory as possible, especially if it needs to parse in real-time (on demand).

这是一个非常广泛和自以为是的问题。因此,据我所知,通常你会希望解析器尽可能快,并尽可能在内存中占用最小的空间,特别是如果它需要实时解析(按需)。

A RegEx will surely do the job, but it is like shooting a fly with a nuclear weapon !

RegEx肯定会完成这项工作,但就像用核武器射击一样!

This is why, many parsers are written in low-level language like C to take advantage of string pointers and avoid the overhead caused by high-level languages like Java with immutable fields, garbage collector,...

这就是为什么许多解析器都是用C语言等低级语言编写的,以利用字符串指针,避免高级语言(如带有不可变字段的Java,垃圾收集器等)引起的开销。

Meanwhile, this heavily depends on your use case and cannot be truly answered in a generic way. You should consider the tradeoff between the developer's convenience to use RegEx versus the performance of the parser.

同时,这在很大程度上取决于您的使用案例,并且无法以通用方式真正回答。您应该考虑开发人员使用RegEx的便利性与解析器的性能之间的权衡。

One additionnal consideration is that usually you will want the parser to indicate where you have a syntax error, and which type of error it is. Using a RegEx, it will simply not match and you will have a hard time finding out why it stopped in order to display a proper error message. When using an old-school parser, you can quickly stop parsing as soon as you encounter a syntax error and you can know precisely what did not match and where.

另外一个考虑因素是,通常您需要解析器指出语法错误的位置以及错误类型。使用RegEx,它将无法匹配,您将很难找到它停止的原因,以显示正确的错误消息。使用旧式解析器时,只要遇到语法错误就可以快速停止解析,并且可以准确地知道哪些不匹配以及在哪里。

In your specific case for JSON parsing and using RegEx only for numbers, I suppose you are probably using a high-level language already, so what many implementations do is to rely on the language's native parsing for numbers. So just pick the value (string, number,...) using the delimiters and let the programming language throw an exception for number parsing.

在您的JSON解析和仅对数字使用RegEx的特定情况下,我想您可能已经使用了高级语言,因此许多实现的工作依赖于语言对数字的本机解析。所以只需使用分隔符选择值(字符串,数字......),让编程语言为数字解析抛出异常。