RegEx：匹配不在HTML标记内部和部分HTML标记的文本

how to match all contents outside a HTML tag?

如何匹配HTML标记之外的所有内容?

My pseudo-HTML is:

我的伪HTML是:

<h1>aaa</h1>
bbb <img src="bla" /> ccc
<div>ddd</div>

I used the regular expression,

我用了正则表达式,

(?<=^|>)[^><]+?(?=<|$)

which would give me: "aaa bbb ccc ddd"

这会给我:“aaa bbb ccc ddd”

All I need is a way to ignore HTML tags with return: "bbb ccc"

我需要的是一种忽略带有返回的HTML标签的方法:“bbb ccc”

3 个解决方案

#1

Regexes are a clunky and unreliable way to work on markup. I would suggest using a DOM parser such as SimpleHtmlDom:

正则表达式是一种笨重且不可靠的标记工作方式。我建议使用DOM解析器,如SimpleHtmlDom:

//get the textual content of all hyperlinks on specified page.
//you can use selectors, e.g. 'a.pretty' - see the docs
echo file_get_html('http://www.example.org')->find('a')->plaintext;

If you want to do that on the client, you can use a library such as jQuery like so:

如果你想在客户端上这样做,你可以使用像jQuery这样的库:

$('a').each(function() {
    alert($(this).text());
});

#2

Look for an approriate regex to match complete tags (e.g in a library like http://regexlib.com/) and remove them with using the substitute operator s///. Then use the rest.

寻找一个适当的正则表达式来匹配完整的标签(例如在像http://regexlib.com/这样的库中)并使用替换运算符s ///删除它们。然后用剩下的。

#3

Thanks everybody,

the expressions of both together would be dirty work, but I would like the opposite output.

两者的表达将是肮脏的工作,但我想要相反的输出。

(\<(.*?)\>)(.*?)(\<\/(.*?)\>)|(<[a-zA-Z\/][^>]*>)

As pseudo string:

作为伪字符串:

<h1>aaa</h1>

bbb <img src="bla" /> ccc

<div>ddd</div> jhgvjhgjh zhg zt <div>ddd</div>

<div>dsada</div> hbhgjh

For simplification, I use this tool.

为简化起见,我使用此工具。

#1