如何使用正则表达式删除其类的整个HTML标记(及其内容)?

时间:2022-11-28 09:21:38

I am not very good with Regex but I am learning.

我对Regex不是很好,但我正在学习。

I would like to remove some html tag by the class name. This is what I have so far :

我想通过类名删除一些html标签。这是我到目前为止:

<div class="footer".*?>(.*?)</div>

The first .*? is because it might contain other attribute and the second is it might contain other html stuff.

首先 。*?是因为它可能包含其他属性,第二个可能包含其他html内容。

What am I doing wrong? I have try a lot of set without success.

我究竟做错了什么?我试了很多但没有成功。

Update

Inside the DIV it can contain multiple line and I am playing with Perl regex.

在DIV内部,它可以包含多行,我正在使用Perl正则表达式。

8 个解决方案

#1


13  

You will also want to allow for other things before class in the div tag

您还希望在div标记之前允许其他内容

<div[^>]*class="footer"[^>]*>(.*?)</div>

Also, go case-insensitive. You may need to escape things like the quotes, or the slash in the closing tag. What context are you doing this in?

另外,不区分大小写。您可能需要转义引号之类的内容或结束标记中的斜杠。你在做什么背景?

Also note that HTML parsing with regular expressions can be very nasty, depending on the input. A good point is brought up in an answer below - suppose you have a structure like:

另请注意,使用正则表达式进行HTML解析可能会非常讨厌,具体取决于输入。在下面的答案中提出了一个很好的观点 - 假设你有一个像这样的结构:

<div>
    <div class="footer">
        <div>Hi!</div>
    </div>
</div>

Trying to build a regex for that is a recipe for disaster. Your best bet is to load the document into a DOM, and perform manipulations on that.

试图为此构建正则表达式是灾难的一种方法。最好的办法是将文档加载到DOM中,然后对其进行操作。

Pseudocode that should map closely to XML::DOM:

应该紧密映射到XML :: DOM的伪代码:

document = //load document
divs = document.getElementsByTagName("div");
for(div in divs) {
    if(div.getAttributes["class"] == "footer") {
        parent = div.getParent();
        for(child in div.getChildren()) {
            // filter attribute types?
            parent.insertBefore(div, child);
        }
        parent.removeChild(div);
    }
}


Here is a perl library, HTML::DOM, and another, XML::DOM
.NET has built-in libraries to handle dom parsing.

#2


17  

As other people said, HTML is notoriously tricky to deal with using regexes, and a DOM approach might be better. E.g.:

正如其他人所说,HTML处理使用正则表达式是非常棘手的,而DOM方法可能会更好。例如。:

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( 'yourdocument.html' );

for my $node ( $tree->findnodes( '//*[@class="footer"]' ) ) {
    $node->replace_with_content;   # delete element, but not the children
}

print $tree->as_HTML;

#3


1  

In Perl you need the /s modifier, otherwise the dot won't match a newline.

在Perl中,您需要/ s修饰符,否则点将与换行符不匹配。

That said, using a proper HTML or XML parser to remove unwanted parts of a HTML file is much more appropriate.

也就是说,使用适当的HTML或XML解析器来删除HTML文件中不需要的部分更为合适。

#4


0  

Partly depends on the exact regex engine you are using - which language etc. But one possibility is that you need to escape the quotes and/or the forward slash. You might also want to make it case insensitive.

部分取决于您使用的确切正则表达式引擎 - 哪种语言等。但有一种可能性是您需要转义引号和/或正斜杠。您可能还希望使其不区分大小写。

<div class=\"footer\".*?>(.*?)<\/div>

Otherwise please say what language/platform you are using - .NET, java, perl ...

否则请说出您正在使用的语言/平台 - .NET,java,perl ...

#5


0  

Try this:

尝试这个:

<([^\s]+).*?class="footer".*?>([.\n]*?)</([^\s]+)>

Your biggest problem is going to be nested tags. For example:

你最大的问题是嵌套标签。例如:

<div class="footer"><b></b></div>

The regexp given would match everything through the </b>, leaving the </div> dangling on the end. You will have to either assume that the tag you're looking for has no nested elements, or you will need to use some sort of parser from HTML to DOM and an XPath query to remove an entire sub-tree.

给出的正则表达式将通过 匹配所有内容,而 悬挂在最后。您将不得不假设您要查找的标记没有嵌套元素,或者您需要使用某种从HTML到DOM的解析器和XPath查询来删除整个子树。

#6


0  

This will be tricky because of the greediness of regular expressions, (Note that my examples may be specific to perl, but I know that greediness is a general issue with REs.) The second .*? will match as much as possible before the </div>, so if you have the following:

由于正则表达式的贪婪,这将是棘手的(请注意,我的示例可能特定于perl,但我知道贪婪是RE的一般问题。)第二个。*?将在 之前尽可能匹配,所以如果您有以下内容:

<div class="SomethingElse"><div class="footer"> stuff </div></div>

stuff

The expression will match:

表达式将匹配:

<div class="footer"> stuff </div></div>

stuff

which is not likely what you want.

这不太可能是你想要的。

#7


0  

<div[^>]*class="footer"[^>]*>(.*?)</div>

Worked for me, but needed to use backslashes before special characters

为我工作,但需要在特殊字符之前使用反斜杠

<div[^>]*class=\"footer\"[^>]*>(.*?)<\/div>

#8


-3  

why not <div class="footer".*?</div> I'm not a regex guru either, but I don't think you need to specify that last bracket for your open div tag

为什么不

我也不是正则表达式大师,但我认为你不需要为你的open div标签指定最后一个括号

#1


13  

You will also want to allow for other things before class in the div tag

您还希望在div标记之前允许其他内容

<div[^>]*class="footer"[^>]*>(.*?)</div>

Also, go case-insensitive. You may need to escape things like the quotes, or the slash in the closing tag. What context are you doing this in?

另外,不区分大小写。您可能需要转义引号之类的内容或结束标记中的斜杠。你在做什么背景?

Also note that HTML parsing with regular expressions can be very nasty, depending on the input. A good point is brought up in an answer below - suppose you have a structure like:

另请注意,使用正则表达式进行HTML解析可能会非常讨厌,具体取决于输入。在下面的答案中提出了一个很好的观点 - 假设你有一个像这样的结构:

<div>
    <div class="footer">
        <div>Hi!</div>
    </div>
</div>

Trying to build a regex for that is a recipe for disaster. Your best bet is to load the document into a DOM, and perform manipulations on that.

试图为此构建正则表达式是灾难的一种方法。最好的办法是将文档加载到DOM中,然后对其进行操作。

Pseudocode that should map closely to XML::DOM:

应该紧密映射到XML :: DOM的伪代码:

document = //load document
divs = document.getElementsByTagName("div");
for(div in divs) {
    if(div.getAttributes["class"] == "footer") {
        parent = div.getParent();
        for(child in div.getChildren()) {
            // filter attribute types?
            parent.insertBefore(div, child);
        }
        parent.removeChild(div);
    }
}


Here is a perl library, HTML::DOM, and another, XML::DOM
.NET has built-in libraries to handle dom parsing.

#2


17  

As other people said, HTML is notoriously tricky to deal with using regexes, and a DOM approach might be better. E.g.:

正如其他人所说,HTML处理使用正则表达式是非常棘手的,而DOM方法可能会更好。例如。:

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( 'yourdocument.html' );

for my $node ( $tree->findnodes( '//*[@class="footer"]' ) ) {
    $node->replace_with_content;   # delete element, but not the children
}

print $tree->as_HTML;

#3


1  

In Perl you need the /s modifier, otherwise the dot won't match a newline.

在Perl中,您需要/ s修饰符,否则点将与换行符不匹配。

That said, using a proper HTML or XML parser to remove unwanted parts of a HTML file is much more appropriate.

也就是说,使用适当的HTML或XML解析器来删除HTML文件中不需要的部分更为合适。

#4


0  

Partly depends on the exact regex engine you are using - which language etc. But one possibility is that you need to escape the quotes and/or the forward slash. You might also want to make it case insensitive.

部分取决于您使用的确切正则表达式引擎 - 哪种语言等。但有一种可能性是您需要转义引号和/或正斜杠。您可能还希望使其不区分大小写。

<div class=\"footer\".*?>(.*?)<\/div>

Otherwise please say what language/platform you are using - .NET, java, perl ...

否则请说出您正在使用的语言/平台 - .NET,java,perl ...

#5


0  

Try this:

尝试这个:

<([^\s]+).*?class="footer".*?>([.\n]*?)</([^\s]+)>

Your biggest problem is going to be nested tags. For example:

你最大的问题是嵌套标签。例如:

<div class="footer"><b></b></div>

The regexp given would match everything through the </b>, leaving the </div> dangling on the end. You will have to either assume that the tag you're looking for has no nested elements, or you will need to use some sort of parser from HTML to DOM and an XPath query to remove an entire sub-tree.

给出的正则表达式将通过 匹配所有内容,而 悬挂在最后。您将不得不假设您要查找的标记没有嵌套元素,或者您需要使用某种从HTML到DOM的解析器和XPath查询来删除整个子树。

#6


0  

This will be tricky because of the greediness of regular expressions, (Note that my examples may be specific to perl, but I know that greediness is a general issue with REs.) The second .*? will match as much as possible before the </div>, so if you have the following:

由于正则表达式的贪婪,这将是棘手的(请注意,我的示例可能特定于perl,但我知道贪婪是RE的一般问题。)第二个。*?将在 之前尽可能匹配,所以如果您有以下内容:

<div class="SomethingElse"><div class="footer"> stuff </div></div>

stuff

The expression will match:

表达式将匹配:

<div class="footer"> stuff </div></div>

stuff

which is not likely what you want.

这不太可能是你想要的。

#7


0  

<div[^>]*class="footer"[^>]*>(.*?)</div>

Worked for me, but needed to use backslashes before special characters

为我工作,但需要在特殊字符之前使用反斜杠

<div[^>]*class=\"footer\"[^>]*>(.*?)<\/div>

#8


-3  

why not <div class="footer".*?</div> I'm not a regex guru either, but I don't think you need to specify that last bracket for your open div tag

为什么不

我也不是正则表达式大师,但我认为你不需要为你的open div标签指定最后一个括号