如何从包含文本的html字符串中删除所有内容，但使用正则表达式保留所有标签及其数据完好无损？

Firstly I would like to say to the more experienced people than myself that it has to be done in regex. No access to a DOM parser due to weird situation.

首先，我想对比我更有经验的人说，必须在正则表达式中完成。由于奇怪的情况，无法访问DOM解析器。

So I have a full HTML/XHTML string and would like to strip everything from it except the links. Basically just the <a> tags are important. I need the tags to keep their information fully, so href, target, class, etc and it should work if its a self terminating tag or if it has a separate end tag. i.e. <a /> or <a></a>

所以我有一个完整的HTML / XHTML字符串，并希望除了链接之外从中删除所有内容。基本上只有标签很重要。我需要标签来保持他们的信息完全，所以href，目标，类等，如果它是一个自终止标签或它有一个单独的结束标签它应该工作。即或

Thanks for any HELP guys!

感谢任何帮助人员！

3 个解决方案

#1

Of course you have the possibility to parse HTML in a Firefox extension. Have a look at HTML to DOM, especially the second and third way.

当然，您可以在Firefox扩展中解析HTML。看看HTML到DOM，特别是第二和第三种方式。

It might seem to be more complex, but it is less error prone than a regular expression.

它可能看起来更复杂，但它比正则表达式更不容易出错。

As soon as you have a reference to the parsed content, all you have to do is to call ref.getElementsByTagName('a') and you are done.

只要您对已解析的内容有所引用，您所要做的就是调用ref.getElementsByTagName（'a'）并完成。

#2

result = subject.match(/<a[^<>]*?(?:\/>|>(?:(?!<\/a>).)*<\/a>)/ig);

gets you an array of all <a> tags in the HTML source (even self-closed tags which are illegal but which you specifically asked for). Is that sufficient?

获取HTML源代码中所有标签的数组（即使是非自动关闭的标签，这些标签是非法的，但您特别要求）。那够了吗？

Explanation:

说明：

<a         # Match <a
[^<>]*?    # Match any characters besides angle brackets, as few as possible
(?:        # Now either match
 />        # /> (self-closed tag)
|          # or 
 >         # a closing angle bracket
 (?:       # followed by...
  (?!</a>) # (if we're not at the closing tag)
  .        # any character
 )*        # any number of times
 </a>      # until the closing tag
)

#3

the regex will look something like this

正则表达式看起来像这样

/\<\a.*[\/]{0,1}>(.*<\/\a>){0,1}/gm

#1