正则表达式提取HTML,留下文本

时间:2022-09-13 11:10:46

I have this piece of HTML:

我有这个HTML:

<div class="embed">
<iframe width="300" height="200" frameborder="0" allowfullscreen="" src="http://www.youtube.com/embed/123456"></iframe>
Some text I don't want
</div>

This is how it is being inserted into the HTML:

这是它插入HTML的方式:

<div class="embed"><?php echo $item['embed_html']; ?></div>

This is what

这是什么

 $item['embed_html']

is echoing out:

正在呼应:

<iframe width="300" height="200" frameborder="0" allowfullscreen="" src="http://www.youtube.com/embed/123456"></iframe>Some text I don't want

So I don't want to parse the whole document, just this specific string.

所以我不想解析整个文档,只是这个特定的字符串。

Don't worry, this isn't "outside user" inputted HTML, before anyone points out the security issues with allowing raw code on to a page...

不要担心,这不是“外部用户”输入HTML,在任何人指出允许原始代码到页面的安全问题之前...

I need to extract the HTML but leave the text (so it would look like this):

我需要提取HTML但保留文本(所以它看起来像这样):

<div class="embed">
<iframe width="300" height="200" frameborder="0" allowfullscreen="" src="http://www.youtube.com/embed/123456"></iframe>
</div>

There are multiple different embed codes, so I guess what I'm asking is what is the best way to remove text that is not wrapped in an HTML element (between < and >) (<img, <p, <div, <iframe, <object, <embed, <video etc may all be used in this section). Just that if there is any text added to it that is not wrapped in a tag it should remove it from the string.

有多种不同的嵌入代码,所以我想我要问的是删除未包含在HTML元素中的文本的最佳方法是什么(在 <和> 之间)( 正则表达式提取HTML,留下文本

I don't want to wrap the offending text in a tag, I want to completely remove it. In a way, the reverse of strip_tags()

我不想在标签中包装有问题的文本,我想完全删除它。在某种程度上,strip_tags()的反向

1 个解决方案

#1


3  

This is a simple regex that would do what you want in 99% of cases:

这是一个简单的正则表达式,可以在99%的情况下执行您想要的操作:

<[^>]+>

All it does though is match XML/HTML tags. That's it. There's no clean way of telling it to only match text inside the DOM-subtree of a certain node (such as <div class="embed">). For this you would to use a context free parser, such as a DOM-parser.

它所做的只是匹配XML / HTML标记。而已。没有干净的方法告诉它只匹配某个节点的DOM子树内的文本(例如

)。为此,您将使用无上下文解析器,例如DOM解析器。

Your sample input would be matched into:

您的样本输入将匹配到:

{
    "<div class="embed">",
    "<iframe width="300" height="200" frameborder="0" allowfullscreen="" src="http://www.youtube.com/embed/123456">",
    "</iframe>",
    "</div>"
}

Given this: <!-- <foo> --> input text however you would end up with <foo> being extracted despite being technically commented out. Removing all occurences of regex <!--.*?--> beforehand should solve that though.

鉴于此:<! - - >输入文本然而,尽管在技术上已注释掉,但最终会提取 。事先删除正则表达式的所有出现<! - 。*? - >应该解决这个问题。

Anyway, in general you're best off using a DOM parser for anything XML/HTML.

无论如何,一般来说,你最好使用DOM解析器来处理任何XML / HTML。

#1


3  

This is a simple regex that would do what you want in 99% of cases:

这是一个简单的正则表达式,可以在99%的情况下执行您想要的操作:

<[^>]+>

All it does though is match XML/HTML tags. That's it. There's no clean way of telling it to only match text inside the DOM-subtree of a certain node (such as <div class="embed">). For this you would to use a context free parser, such as a DOM-parser.

它所做的只是匹配XML / HTML标记。而已。没有干净的方法告诉它只匹配某个节点的DOM子树内的文本(例如

)。为此,您将使用无上下文解析器,例如DOM解析器。

Your sample input would be matched into:

您的样本输入将匹配到:

{
    "<div class="embed">",
    "<iframe width="300" height="200" frameborder="0" allowfullscreen="" src="http://www.youtube.com/embed/123456">",
    "</iframe>",
    "</div>"
}

Given this: <!-- <foo> --> input text however you would end up with <foo> being extracted despite being technically commented out. Removing all occurences of regex <!--.*?--> beforehand should solve that though.

鉴于此:<! - - >输入文本然而,尽管在技术上已注释掉,但最终会提取 。事先删除正则表达式的所有出现<! - 。*? - >应该解决这个问题。

Anyway, in general you're best off using a DOM parser for anything XML/HTML.

无论如何,一般来说,你最好使用DOM解析器来处理任何XML / HTML。