如何从字符串中删除所有HTML标记,而不知道其中包含哪些标记?(复制)

时间:2022-08-27 16:19:50

This question already has an answer here:

这个问题已经有了答案:

Is there any easy way to remove all HTML tags or ANYTHING HTML related from a string?

有什么简单的方法可以从字符串中删除所有的HTML标签或任何与HTML相关的东西吗?

For example:

例如:

string title = "<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)"

The above should really be:

以上应该是:

"Hulk Hogan's Celebrity Championship Wrestling [Proj # 206010] (Reality Series)"

"Hulk Hogan's Celebrity Championship摔跤[Proj # 206010](真人秀)"

3 个解决方案

#1


150  

You can use a simple regex like this:

您可以使用如下简单的regex:

public static string StripHTML(string input)
{
   return Regex.Replace(input, "<.*?>", String.Empty);
}

Be aware that this solution has its own flaw. See Remove HTML tags in String for more information (especially the comments of @mehaase)

请注意,这种解决方案有自己的缺陷。有关更多信息,请参见删除字符串中的HTML标记(特别是@mehaase的注释)

Another solution would be to use the HTML Agility Pack.
You can find an example using the library here: HTML agility pack - removing unwanted tags without removing content?

另一个解决方案是使用HTML敏捷包。您可以在这里找到一个使用该库的示例:HTML敏捷包——在不删除内容的情况下删除不需要的标记?

#2


30  

You can parse the string using Html Agility pack and get the InnerText.

您可以使用Html敏捷包解析字符串并获得InnerText。

    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(@"<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)");
    string result = htmlDoc.DocumentNode.InnerText;

#3


2  

You can use the below code on your string and you will get the complete string without html part.

您可以在您的字符串上使用下面的代码,您将得到没有html部分的完整字符串。

string title = "<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)".Replace("&nbsp;",string.Empty);            
        string s = Regex.Replace(title, "<.*?>", String.Empty);

#1


150  

You can use a simple regex like this:

您可以使用如下简单的regex:

public static string StripHTML(string input)
{
   return Regex.Replace(input, "<.*?>", String.Empty);
}

Be aware that this solution has its own flaw. See Remove HTML tags in String for more information (especially the comments of @mehaase)

请注意,这种解决方案有自己的缺陷。有关更多信息,请参见删除字符串中的HTML标记(特别是@mehaase的注释)

Another solution would be to use the HTML Agility Pack.
You can find an example using the library here: HTML agility pack - removing unwanted tags without removing content?

另一个解决方案是使用HTML敏捷包。您可以在这里找到一个使用该库的示例:HTML敏捷包——在不删除内容的情况下删除不需要的标记?

#2


30  

You can parse the string using Html Agility pack and get the InnerText.

您可以使用Html敏捷包解析字符串并获得InnerText。

    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(@"<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)");
    string result = htmlDoc.DocumentNode.InnerText;

#3


2  

You can use the below code on your string and you will get the complete string without html part.

您可以在您的字符串上使用下面的代码,您将得到没有html部分的完整字符串。

string title = "<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)".Replace("&nbsp;",string.Empty);            
        string s = Regex.Replace(title, "<.*?>", String.Empty);