Regexp仅搜索/替换文本,而不是HTML属性

时间:2022-09-13 09:40:18

I'm using JavaScript to do some regular expression. Considering I'm working with well-formed source, and I want to remove any space before[,.] and keep only one space after [,.], except that [,.] is part of a number. Thus I use:

我正在使用JavaScript来做一些正则表达式。考虑到我正在使用格式良好的源,我想在[,。]之前删除任何空格,并且在[,。]之后只保留一个空格,除了[,。]是数字的一部分。因此我使用:

text = text.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');

The problem is that this replaces also text in the html tag attributes. For example my text is (always wrapped with a tag):

问题是这也取代了html标签属性中的文本。例如,我的文本是(总是用标签包装):

<p>Test,and test . Again <img src="xyz.jpg"> ...</p>

Now it adds a space like this src="xyz. jpg" that is not expected. How can I rewrite my regular expression? What I want is

现在它添加了一个像src =“xyz.jpg”这样的空间,这是不期望的。我怎样才能重写我的正则表达式?我想要的是

<p>Test, and test. Again <img src="xyz.jpg"> ...</p>

Thanks!

谢谢!

6 个解决方案

#1


4  

You can use a lookahead to make sure the match isn't occurring inside a tag:

您可以使用前瞻来确保标记内没有匹配:

text = text.replace(/(?![^<>]*>) *([.,]) *([^ \d])/g, '$1 $2');

The usual warnings apply regarding CDATA sections, SGML comments, SCRIPT elements, and angle brackets in attribute values. But I suspect your real problems will arise from the vagaries of "plain" text; HTML's not even in the same league. :D

通常的警告适用于CDATA部分,SGML注释,SCRIPT元素和属性值中的尖括号。但我怀疑你的真正问题会出现在“普通”文本的变幻莫测之中; HTML甚至不在同一个联盟中。 :d

#2


1  

Do not try to rewrite your expression to do this. You won’t succeed and will almost certainly forget about some corner cases. In the best case, this will lead to nasty bugs and in the worst case you will introduce security problems.

不要试图重写你的表达式来做到这一点。你不会成功,几乎肯定会忘记一些角落案件。在最好的情况下,这将导致令人讨厌的错误,在最坏的情况下,您将引入安全问题。

Instead, when you’re already using JavaScript and have well-formed code, use a genuine XML parser to loop over the text nodes and only apply your regex to them.

相反,当您已经使用JavaScript并且具有格式良好的代码时,请使用真正的XML解析器循环文本节点,并仅将正则表达式应用于它们。

#3


1  

If you can access that text through the DOM, you can do this:

如果您可以通过DOM访问该文本,则可以执行以下操作:

function fixPunctuation(elem) {
    // check if parameter is a an ELEMENT_NODE
    if (!(elem instanceof Node) || elem.nodeType !== Node.ELEMENT_NODE) return;
    var children = elem.childNodes, node;
    // iterate the child nodes of the element node
    for (var i=0; children[i]; ++i) {
        node = children[i];
        // check the child’s node type
        switch (node.nodeType) {
        case Node.ELEMENT_NODE:
            // call fixPunctuation if it’s also an ELEMENT_NODE
            fixPunctuation(node);
            break;
        case Node.TEXT_NODE:
            // fix punctuation if it’s a TEXT_NODE
            node.nodeValue = node.nodeValue.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');
            break;
        }
    }
}

Now just pass the DOM node to that function like this:

现在只需将DOM节点传递给该函数,如下所示:

fixPunctuation(document.body);
fixPunctuation(document.getElementById("foobar"));

#4


0  

Html is not a "regular language", therefore regex is not the optimal tool for parsing it. You might be better suited to use a html parser like this one to get at the attribute and then apply regex to do something with the value.

Html不是“常规语言”,因此正则表达式不是解析它的最佳工具。您可能更适合使用像这样的html解析器来获取属性,然后应用正则表达式来对值执行某些操作。

Enjoy!

请享用!

#5


0  

As stated above and many times before, HTML is not a regular language and thus cannot be parsed with regular expressions.

如上所述和之前多次,HTML不是常规语言,因此无法使用正则表达式进行解析。

You will have to do this recursively; I'd suggest crawling the DOM object.

你必须递归地做这件事;我建议抓取DOM对象。

Try something like this...

试试这样的事......

function regexReplaceInnerText(curr_element) {
    if (curr_element.childNodes.length <= 0) { // termination case:
                                               // no children; this is a "leaf node"
        if (curr_element.nodeName == "#text" || curr_element.nodeType == 3) { // node is text; not an empty tag like <br />
            if (curr_element.data.replace(/^\s*|\s*$/g, '') != "") { // node isn't just white space
                                                                     // (you can skip this check if you want)
                var text = curr_element.data;
                text = text.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');
                curr_element.data = text;
            }
        }
    } else {
        // recursive case:
        // this isn't a leaf node, so we iterate over all children and recurse
        for (var i = 0; curr_element.childNodes[i]; i++) {
            regexReplaceInnerText(curr_element.childNodes[i]);
        }
    }
}
// then get the element whose children's text nodes you want to be regex'd
regexReplaceInnerText(document.getElementsByTagName("body")[0]);
// or if you don't want to do the whole document...
regexReplaceInnerText(document.getElementById("ElementToRegEx"));

#6


0  

Don't parse regexHTML with HTMLregex. If you know your HTML is well-formed, use an HTML/XML parser. Otherwise, run it through Tidy first and then use an XML parser.

不要使用HTMLregex解析regexHTML。如果您知道HTML格式正确,请使用HTML / XML解析器。否则,首先通过Tidy运行它,然后使用XML解析器。

#1


4  

You can use a lookahead to make sure the match isn't occurring inside a tag:

您可以使用前瞻来确保标记内没有匹配:

text = text.replace(/(?![^<>]*>) *([.,]) *([^ \d])/g, '$1 $2');

The usual warnings apply regarding CDATA sections, SGML comments, SCRIPT elements, and angle brackets in attribute values. But I suspect your real problems will arise from the vagaries of "plain" text; HTML's not even in the same league. :D

通常的警告适用于CDATA部分,SGML注释,SCRIPT元素和属性值中的尖括号。但我怀疑你的真正问题会出现在“普通”文本的变幻莫测之中; HTML甚至不在同一个联盟中。 :d

#2


1  

Do not try to rewrite your expression to do this. You won’t succeed and will almost certainly forget about some corner cases. In the best case, this will lead to nasty bugs and in the worst case you will introduce security problems.

不要试图重写你的表达式来做到这一点。你不会成功,几乎肯定会忘记一些角落案件。在最好的情况下,这将导致令人讨厌的错误,在最坏的情况下,您将引入安全问题。

Instead, when you’re already using JavaScript and have well-formed code, use a genuine XML parser to loop over the text nodes and only apply your regex to them.

相反,当您已经使用JavaScript并且具有格式良好的代码时,请使用真正的XML解析器循环文本节点,并仅将正则表达式应用于它们。

#3


1  

If you can access that text through the DOM, you can do this:

如果您可以通过DOM访问该文本,则可以执行以下操作:

function fixPunctuation(elem) {
    // check if parameter is a an ELEMENT_NODE
    if (!(elem instanceof Node) || elem.nodeType !== Node.ELEMENT_NODE) return;
    var children = elem.childNodes, node;
    // iterate the child nodes of the element node
    for (var i=0; children[i]; ++i) {
        node = children[i];
        // check the child’s node type
        switch (node.nodeType) {
        case Node.ELEMENT_NODE:
            // call fixPunctuation if it’s also an ELEMENT_NODE
            fixPunctuation(node);
            break;
        case Node.TEXT_NODE:
            // fix punctuation if it’s a TEXT_NODE
            node.nodeValue = node.nodeValue.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');
            break;
        }
    }
}

Now just pass the DOM node to that function like this:

现在只需将DOM节点传递给该函数,如下所示:

fixPunctuation(document.body);
fixPunctuation(document.getElementById("foobar"));

#4


0  

Html is not a "regular language", therefore regex is not the optimal tool for parsing it. You might be better suited to use a html parser like this one to get at the attribute and then apply regex to do something with the value.

Html不是“常规语言”,因此正则表达式不是解析它的最佳工具。您可能更适合使用像这样的html解析器来获取属性,然后应用正则表达式来对值执行某些操作。

Enjoy!

请享用!

#5


0  

As stated above and many times before, HTML is not a regular language and thus cannot be parsed with regular expressions.

如上所述和之前多次,HTML不是常规语言,因此无法使用正则表达式进行解析。

You will have to do this recursively; I'd suggest crawling the DOM object.

你必须递归地做这件事;我建议抓取DOM对象。

Try something like this...

试试这样的事......

function regexReplaceInnerText(curr_element) {
    if (curr_element.childNodes.length <= 0) { // termination case:
                                               // no children; this is a "leaf node"
        if (curr_element.nodeName == "#text" || curr_element.nodeType == 3) { // node is text; not an empty tag like <br />
            if (curr_element.data.replace(/^\s*|\s*$/g, '') != "") { // node isn't just white space
                                                                     // (you can skip this check if you want)
                var text = curr_element.data;
                text = text.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');
                curr_element.data = text;
            }
        }
    } else {
        // recursive case:
        // this isn't a leaf node, so we iterate over all children and recurse
        for (var i = 0; curr_element.childNodes[i]; i++) {
            regexReplaceInnerText(curr_element.childNodes[i]);
        }
    }
}
// then get the element whose children's text nodes you want to be regex'd
regexReplaceInnerText(document.getElementsByTagName("body")[0]);
// or if you don't want to do the whole document...
regexReplaceInnerText(document.getElementById("ElementToRegEx"));

#6


0  

Don't parse regexHTML with HTMLregex. If you know your HTML is well-formed, use an HTML/XML parser. Otherwise, run it through Tidy first and then use an XML parser.

不要使用HTMLregex解析regexHTML。如果您知道HTML格式正确,请使用HTML / XML解析器。否则,首先通过Tidy运行它,然后使用XML解析器。