如何将Html转换为纯文本?

时间:2022-10-30 13:11:11

I have snippets of Html stored in a table. Not entire pages, no tags or the like, just basic formatting.

我将Html片段存储在一个表中。不是整个页面,没有标签之类的,只是基本的格式。

I would like to be able to display that Html as text only, no formatting, on a given page (actually just the first 30 - 50 characters but that's the easy bit).

我希望能够将Html作为文本显示,在给定的页面上没有格式(实际上只是第一个30 - 50个字符,但这是很简单的一点)。

How do I place the "text" within that Html into a string as straight text?

如何将Html中的“文本”作为纯文本放置到字符串中?

So this piece of code.

这段代码。

<b>Hello World.</b><br/><p><i>Is there anyone out there?</i><p>

Becomes:

就变成:

Hello World. Is there anyone out there?

你好,世界外面有人吗?

16 个解决方案

#1


18  

If you are talking about tag stripping, it is relatively straight forward if you don't have to worry about things like <script> tags. If all you need to do is display the text without the tags you can accomplish that with a regular expression:

如果您讨论的是标签剥离,那么如果您不需要担心

<[^>]*>

If you do have to worry about <script> tags and the like then you'll need something a bit more powerful then regular expressions because you need to track state, omething more like a Context Free Grammar (CFG). Althought you might be able to accomplish it with 'Left To Right' or non-greedy matching.

如果您确实需要担心

If you can use regular expressions there are many web pages out there with good info:

如果你可以使用正则表达式,有很多网页有很好的信息:

If you need the more complex behaviour of a CFG I would suggest using a third party tool, unfortunately I don't know of a good one to recommend.

如果您需要更复杂的CFG行为,我建议您使用第三方工具,不幸的是,我不知道有什么好的工具可以推荐。

#2


76  

The free and open source HtmlAgilityPack has in one of its samples a method that converts from HTML to plain text.

免费和开放源码的HtmlAgilityPack在一个示例中有一个方法,可以将HTML转换为纯文本。

var plainText = ConvertToPlainText(string html);

Feed it an HTML string like

给它一个HTML字符串

<b>hello world!</b><br /><i>it is me! !</i>

< b > hello world !< / b > < br / > <我> 这是我!! < / i >

And you'll get a plain text result like:

你会得到一个纯文本结果,比如:

hello world!
it is me!

#3


27  

I could not use HtmlAgilityPack, so I wrote a second best solution for myself

我不能使用HtmlAgilityPack,所以我为自己编写了第二个最佳解决方案。

private static string HtmlToPlainText(string html)
{
    const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
    const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
    const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
    var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
    var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
    var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);

    var text = html;
    //Decode html specific characters
    text = System.Net.WebUtility.HtmlDecode(text); 
    //Remove tag whitespace/line breaks
    text = tagWhiteSpaceRegex.Replace(text, "><");
    //Replace <br /> with line breaks
    text = lineBreakRegex.Replace(text, Environment.NewLine);
    //Strip formatting
    text = stripFormattingRegex.Replace(text, string.Empty);

    return text;
}

#4


19  

HTTPUtility.HTMLEncode() is meant to handle encoding HTML tags as strings. It takes care of all the heavy lifting for you. From the MSDN Documentation:

htmlencode()用于将HTML标记编码为字符串。它能帮你减轻所有的负担。从MSDN文档:

If characters such as blanks and punctuation are passed in an HTTP stream, they might be misinterpreted at the receiving end. HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and >, are encoded as &lt; and &gt; for HTTP transmission.

如果在HTTP流中传递了空格和标点等字符,则在接收端可能会被误解。HTML编码将HTML中不允许的字符转换为字符-实体等价物;HTML解码反向编码。例如,当嵌入文本块时,字符 <和> 被编码为<和比;对于HTTP传输。

HTTPUtility.HTMLEncode() method, detailed here:

HTTPUtility.HTMLEncode()方法,详细的:

public static void HtmlEncode(
  string s,
  TextWriter output
)

Usage:

用法:

String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();

#5


11  

To add to vfilby's answer, you can just perform a RegEx replace within your code; no new classes are necessary. In case other newbies like myself stumple upon this question.

要添加到vfilby的答案,只需在代码中执行RegEx替换;不需要新的类。如果像我这样的新手遇到这个问题。

using System.Text.RegularExpressions;

Then...

然后……

private string StripHtml(string source)
{
        string output;

        //get rid of HTML tags
        output = Regex.Replace(source, "<[^>]*>", string.Empty);

        //get rid of multiple blank lines
        output = Regex.Replace(output, @"^\s*$\n", string.Empty, RegexOptions.Multiline);

        return output;
}

#6


5  

There not a method with the name 'ConvertToPlainText' in the HtmlAgilityPack but you can convert a html string to CLEAR string with :

HtmlAgilityPack中没有一个名为“ConvertToPlainText”的方法,但是可以将一个html字符串转换为CLEAR string:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
var textString = doc.DocumentNode.InnerText;
Regex.Replace(textString , @"<(.|n)*?>", string.Empty).Replace("&nbsp", "");

Thats works for me. BUT I DONT FIND A METHOD WITH NAME 'ConvertToPlainText' IN 'HtmlAgilityPack'.

这为我工作。但是我没有在HtmlAgilityPack中找到一个名为ConvertToPlainText的方法。

#7


5  

Three Step Process for converting HTML into Plain Text

将HTML转换为纯文本的三步骤过程

First You need to Install Nuget Package For HtmlAgilityPack Second Create This class

首先,您需要为HtmlAgilityPack第二创建这个类安装Nuget包。

public class HtmlToText
{
    public HtmlToText()
    {
    }

    public string Convert(string path)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(path);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    public string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    private void ConvertContentTo(HtmlNode node, TextWriter outText)
    {
        foreach(HtmlNode subnode in node.ChildNodes)
        {
            ConvertTo(subnode, outText);
        }
    }

    public void ConvertTo(HtmlNode node, TextWriter outText)
    {
        string html;
        switch(node.NodeType)
        {
            case HtmlNodeType.Comment:
                // don't output comments
                break;

            case HtmlNodeType.Document:
                ConvertContentTo(node, outText);
                break;

            case HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                    break;

                // get text
                html = ((HtmlTextNode)node).Text;

                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                    break;

                // check the text is meaningful and not a bunch of whitespaces
                if (html.Trim().Length > 0)
                {
                    outText.Write(HtmlEntity.DeEntitize(html));
                }
                break;

            case HtmlNodeType.Element:
                switch(node.Name)
                {
                    case "p":
                        // treat paragraphs as crlf
                        outText.Write("\r\n");
                        break;
                }

                if (node.HasChildNodes)
                {
                    ConvertContentTo(node, outText);
                }
                break;
        }
    }
}

By using above class with reference to Judah Himango's answer

通过使用上面提到的关于犹大Himango的回答

Third you need to create the Object of above class and Use ConvertHtml(HTMLContent) Method for converting HTML into Plain Text rather than ConvertToPlainText(string html);

第三,需要创建上面类的对象,并使用ConvertHtml(HTMLContent)方法将HTML转换为纯文本,而不是ConvertToPlainText(string HTML);

HtmlToText htt=new HtmlToText();
var plainText = htt.ConvertHtml(HTMLContent);

#8


3  

I think the easiest way is to make a 'string' extension method (based on what user Richard have suggested):

我认为最简单的方法是创建一个“string”扩展方法(根据用户Richard的建议):

using System;
using System.Text.RegularExpressions;

public static class StringHelpers
{
    public static string StripHTML(this string HTMLText)
        {
            var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            return reg.Replace(HTMLText, "");
        }
}

Then just use this extension method on any 'string' variable in your program:

然后在程序中的任何“字符串”变量上使用这个扩展方法:

var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>";
var yourTextString = yourHtmlString.StripHTML();

I use this extension method to convert html formated comments to plain text so it will be displayed correctly on a crystal report, and it works perfect!

我使用这个扩展方法将html格式的注释转换为纯文本,这样它就会正确地显示在水晶报表上,并且它工作得非常完美!

#9


3  

The simplest way I found:

我找到的最简单的方法是:

HtmlFilter.ConvertToPlainText(html);

The HtmlFilter class is located in Microsoft.TeamFoundation.WorkItemTracking.Controls.dll

HtmlFilter类位于Microsoft.TeamFoundation.WorkItemTracking.Controls.dll中

The dll can be found in folder like this: %ProgramFiles%\Common Files\microsoft shared\Team Foundation Server\14.0\

动态链接库可以在这样的文件夹中找到:%ProgramFiles%\Common Files%\ microsoft shared\Team Foundation Server\14.0\ \ \ \ \ \ \ \

In VS 2015, the dll also requires reference to Microsoft.TeamFoundation.WorkItemTracking.Common.dll, located in the same folder.

在VS 2015中,dll还需要参考microsoft.teamfoundation.workitemtrack . common。dll文件,位于同一文件夹中。

#10


2  

If you have data that has HTML tags and you want to display it so that a person can SEE the tags, use HttpServerUtility::HtmlEncode.

如果您有具有HTML标记的数据,并且想要显示它,以便人们可以看到这些标记,请使用HttpServerUtility::HtmlEncode。

If you have data that has HTML tags in it and you want the user to see the tags rendered, then display the text as is. If the text represents an entire web page, use an IFRAME for it.

如果您有包含HTML标记的数据,并且希望用户看到呈现的标记,则按原样显示文本。如果文本表示整个web页面,请使用IFRAME。

If you have data that has HTML tags and you want to strip out the tags and just display the unformatted text, use a regular expression.

如果您有带有HTML标记的数据,并且想去掉这些标记,只显示未格式化的文本,请使用正则表达式。

#11


1  

It has limitation that not collapsing long inline whitespace, but it is definitely portable and respects layout like webbrowser.

它的限制是不压缩长内联空白,但它绝对是可移植性的,并且与webbrowser一样尊重布局。

static string HtmlToPlainText(string html) {
  string buf;
  string block = "address|article|aside|blockquote|canvas|dd|div|dl|dt|" +
    "fieldset|figcaption|figure|footer|form|h\\d|header|hr|li|main|nav|" +
    "noscript|ol|output|p|pre|section|table|tfoot|ul|video";

  string patNestedBlock = $"(\\s*?</?({block})[^>]*?>)+\\s*";
  buf = Regex.Replace(html, patNestedBlock, "\n", RegexOptions.IgnoreCase);

  // Replace br tag to newline.
  buf = Regex.Replace(buf, @"<(br)[^>]*>", "\n", RegexOptions.IgnoreCase);

  // (Optional) remove styles and scripts.
  buf = Regex.Replace(buf, @"<(script|style)[^>]*?>.*?</\1>", "", RegexOptions.Singleline);

  // Remove all tags.
  buf = Regex.Replace(buf, @"<[^>]*(>|$)", "", RegexOptions.Multiline);

  // Replace HTML entities.
  buf = WebUtility.HtmlDecode(buf);
  return buf;
}

#12


0  

Depends on what you mean by "html." The most complex case would be complete web pages. That's also the easiest to handle, since you can use a text-mode web browser. See the Wikipedia article listing web browsers, including text mode browsers. Lynx is probably the best known, but one of the others may be better for your needs.

取决于你所说的“html”是什么意思。最复杂的情况是完整的网页。这也是最容易处理的,因为您可以使用文本模式的web浏览器。请参阅*列出web浏览器(包括文本模式浏览器)的文章。Lynx可能是最著名的,但是其他的一个可能更适合你的需要。

#13


0  

Here is my solution:

这是我的解决方案:

public string StripHTML(string html)
{
    var regex = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    return System.Web.HttpUtility.HtmlDecode((regex.Replace(html, "")));
}

Example:

例子:

StripHTML("<p class='test' style='color:red;'>Here is my solution:</p>");
// output -> Here is my solution:

#14


0  

I had the same question, just my html had a simple pre-known layout, like:

我也有同样的问题,我的html有一个简单的预先知道的布局,比如:

<DIV><P>abc</P><P>def</P></DIV>

So I ended up using such simple code:

所以我用了这么简单的代码:

string.Join (Environment.NewLine, XDocument.Parse (html).Root.Elements ().Select (el => el.Value))

Which outputs:

输出:

abc
def

#15


0  

Did not write but an using:

没有写,只是用了:

using HtmlAgilityPack;
using System;
using System.IO;
using System.Text.RegularExpressions;

namespace foo {
  //small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
  public static class HtmlToText {

    public static string Convert(string path) {
      HtmlDocument doc = new HtmlDocument();
      doc.Load(path);
      return ConvertDoc(doc);
    }

    public static string ConvertHtml(string html) {
      HtmlDocument doc = new HtmlDocument();
      doc.LoadHtml(html);
      return ConvertDoc(doc);
    }

    public static string ConvertDoc(HtmlDocument doc) {
      using (StringWriter sw = new StringWriter()) {
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
      }
    }

    internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {
      foreach (HtmlNode subnode in node.ChildNodes) {
        ConvertTo(subnode, outText, textInfo);
      }
    }
    public static void ConvertTo(HtmlNode node, TextWriter outText) {
      ConvertTo(node, outText, new PreceedingDomTextInfo(false));
    }
    internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {
      string html;
      switch (node.NodeType) {
        case HtmlNodeType.Comment:
          // don't output comments
          break;
        case HtmlNodeType.Document:
          ConvertContentTo(node, outText, textInfo);
          break;
        case HtmlNodeType.Text:
          // script and style must not be output
          string parentName = node.ParentNode.Name;
          if ((parentName == "script") || (parentName == "style")) {
            break;
          }
          // get text
          html = ((HtmlTextNode)node).Text;
          // is it in fact a special closing node output as text?
          if (HtmlNode.IsOverlappedClosingElement(html)) {
            break;
          }
          // check the text is meaningful and not a bunch of whitespaces
          if (html.Length == 0) {
            break;
          }
          if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace) {
            html = html.TrimStart();
            if (html.Length == 0) { break; }
            textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
          }
          outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));
          if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1])) {
            outText.Write(' ');
          }
          break;
        case HtmlNodeType.Element:
          string endElementString = null;
          bool isInline;
          bool skip = false;
          int listIndex = 0;
          switch (node.Name) {
            case "nav":
              skip = true;
              isInline = false;
              break;
            case "body":
            case "section":
            case "article":
            case "aside":
            case "h1":
            case "h2":
            case "header":
            case "footer":
            case "address":
            case "main":
            case "div":
            case "p": // stylistic - adjust as you tend to use
              if (textInfo.IsFirstTextOfDocWritten) {
                outText.Write("\r\n");
              }
              endElementString = "\r\n";
              isInline = false;
              break;
            case "br":
              outText.Write("\r\n");
              skip = true;
              textInfo.WritePrecedingWhiteSpace = false;
              isInline = true;
              break;
            case "a":
              if (node.Attributes.Contains("href")) {
                string href = node.Attributes["href"].Value.Trim();
                if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase) == -1) {
                  endElementString = "<" + href + ">";
                }
              }
              isInline = true;
              break;
            case "li":
              if (textInfo.ListIndex > 0) {
                outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
              } else {
                outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
              }
              isInline = false;
              break;
            case "ol":
              listIndex = 1;
              goto case "ul";
            case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
              endElementString = "\r\n";
              isInline = false;
              break;
            case "img": //inline-block in reality
              if (node.Attributes.Contains("alt")) {
                outText.Write('[' + node.Attributes["alt"].Value);
                endElementString = "]";
              }
              if (node.Attributes.Contains("src")) {
                outText.Write('<' + node.Attributes["src"].Value + '>');
              }
              isInline = true;
              break;
            default:
              isInline = true;
              break;
          }
          if (!skip && node.HasChildNodes) {
            ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten) { ListIndex = listIndex });
          }
          if (endElementString != null) {
            outText.Write(endElementString);
          }
          break;
      }
    }
  }
  internal class PreceedingDomTextInfo {
    public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten) {
      IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
    }
    public bool WritePrecedingWhiteSpace { get; set; }
    public bool LastCharWasSpace { get; set; }
    public readonly BoolWrapper IsFirstTextOfDocWritten;
    public int ListIndex { get; set; }
  }
  internal class BoolWrapper {
    public BoolWrapper() { }
    public bool Value { get; set; }
    public static implicit operator bool(BoolWrapper boolWrapper) {
      return boolWrapper.Value;
    }
    public static implicit operator BoolWrapper(bool boolWrapper) {
      return new BoolWrapper { Value = boolWrapper };
    }
  }
}

#16


-4  

public static string StripTags2(string html) { return html.Replace("<", "<").Replace(">", ">"); }

公共静态字符串StripTags2(string html){返回html。替换(“<”、“<”)。替换(“>”,“>”);}

By this you escape all "<" and ">" in a string. Is this what you want?

这样,您就可以在字符串中转义所有的“<”和“>”。这是你想要的吗?

#1


18  

If you are talking about tag stripping, it is relatively straight forward if you don't have to worry about things like <script> tags. If all you need to do is display the text without the tags you can accomplish that with a regular expression:

如果您讨论的是标签剥离,那么如果您不需要担心

<[^>]*>

If you do have to worry about <script> tags and the like then you'll need something a bit more powerful then regular expressions because you need to track state, omething more like a Context Free Grammar (CFG). Althought you might be able to accomplish it with 'Left To Right' or non-greedy matching.

如果您确实需要担心

If you can use regular expressions there are many web pages out there with good info:

如果你可以使用正则表达式,有很多网页有很好的信息:

If you need the more complex behaviour of a CFG I would suggest using a third party tool, unfortunately I don't know of a good one to recommend.

如果您需要更复杂的CFG行为,我建议您使用第三方工具,不幸的是,我不知道有什么好的工具可以推荐。

#2


76  

The free and open source HtmlAgilityPack has in one of its samples a method that converts from HTML to plain text.

免费和开放源码的HtmlAgilityPack在一个示例中有一个方法,可以将HTML转换为纯文本。

var plainText = ConvertToPlainText(string html);

Feed it an HTML string like

给它一个HTML字符串

<b>hello world!</b><br /><i>it is me! !</i>

< b > hello world !< / b > < br / > <我> 这是我!! < / i >

And you'll get a plain text result like:

你会得到一个纯文本结果,比如:

hello world!
it is me!

#3


27  

I could not use HtmlAgilityPack, so I wrote a second best solution for myself

我不能使用HtmlAgilityPack,所以我为自己编写了第二个最佳解决方案。

private static string HtmlToPlainText(string html)
{
    const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
    const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
    const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
    var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
    var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
    var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);

    var text = html;
    //Decode html specific characters
    text = System.Net.WebUtility.HtmlDecode(text); 
    //Remove tag whitespace/line breaks
    text = tagWhiteSpaceRegex.Replace(text, "><");
    //Replace <br /> with line breaks
    text = lineBreakRegex.Replace(text, Environment.NewLine);
    //Strip formatting
    text = stripFormattingRegex.Replace(text, string.Empty);

    return text;
}

#4


19  

HTTPUtility.HTMLEncode() is meant to handle encoding HTML tags as strings. It takes care of all the heavy lifting for you. From the MSDN Documentation:

htmlencode()用于将HTML标记编码为字符串。它能帮你减轻所有的负担。从MSDN文档:

If characters such as blanks and punctuation are passed in an HTTP stream, they might be misinterpreted at the receiving end. HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and >, are encoded as &lt; and &gt; for HTTP transmission.

如果在HTTP流中传递了空格和标点等字符,则在接收端可能会被误解。HTML编码将HTML中不允许的字符转换为字符-实体等价物;HTML解码反向编码。例如,当嵌入文本块时,字符 <和> 被编码为<和比;对于HTTP传输。

HTTPUtility.HTMLEncode() method, detailed here:

HTTPUtility.HTMLEncode()方法,详细的:

public static void HtmlEncode(
  string s,
  TextWriter output
)

Usage:

用法:

String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();

#5


11  

To add to vfilby's answer, you can just perform a RegEx replace within your code; no new classes are necessary. In case other newbies like myself stumple upon this question.

要添加到vfilby的答案,只需在代码中执行RegEx替换;不需要新的类。如果像我这样的新手遇到这个问题。

using System.Text.RegularExpressions;

Then...

然后……

private string StripHtml(string source)
{
        string output;

        //get rid of HTML tags
        output = Regex.Replace(source, "<[^>]*>", string.Empty);

        //get rid of multiple blank lines
        output = Regex.Replace(output, @"^\s*$\n", string.Empty, RegexOptions.Multiline);

        return output;
}

#6


5  

There not a method with the name 'ConvertToPlainText' in the HtmlAgilityPack but you can convert a html string to CLEAR string with :

HtmlAgilityPack中没有一个名为“ConvertToPlainText”的方法,但是可以将一个html字符串转换为CLEAR string:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
var textString = doc.DocumentNode.InnerText;
Regex.Replace(textString , @"<(.|n)*?>", string.Empty).Replace("&nbsp", "");

Thats works for me. BUT I DONT FIND A METHOD WITH NAME 'ConvertToPlainText' IN 'HtmlAgilityPack'.

这为我工作。但是我没有在HtmlAgilityPack中找到一个名为ConvertToPlainText的方法。

#7


5  

Three Step Process for converting HTML into Plain Text

将HTML转换为纯文本的三步骤过程

First You need to Install Nuget Package For HtmlAgilityPack Second Create This class

首先,您需要为HtmlAgilityPack第二创建这个类安装Nuget包。

public class HtmlToText
{
    public HtmlToText()
    {
    }

    public string Convert(string path)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(path);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    public string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    private void ConvertContentTo(HtmlNode node, TextWriter outText)
    {
        foreach(HtmlNode subnode in node.ChildNodes)
        {
            ConvertTo(subnode, outText);
        }
    }

    public void ConvertTo(HtmlNode node, TextWriter outText)
    {
        string html;
        switch(node.NodeType)
        {
            case HtmlNodeType.Comment:
                // don't output comments
                break;

            case HtmlNodeType.Document:
                ConvertContentTo(node, outText);
                break;

            case HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                    break;

                // get text
                html = ((HtmlTextNode)node).Text;

                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                    break;

                // check the text is meaningful and not a bunch of whitespaces
                if (html.Trim().Length > 0)
                {
                    outText.Write(HtmlEntity.DeEntitize(html));
                }
                break;

            case HtmlNodeType.Element:
                switch(node.Name)
                {
                    case "p":
                        // treat paragraphs as crlf
                        outText.Write("\r\n");
                        break;
                }

                if (node.HasChildNodes)
                {
                    ConvertContentTo(node, outText);
                }
                break;
        }
    }
}

By using above class with reference to Judah Himango's answer

通过使用上面提到的关于犹大Himango的回答

Third you need to create the Object of above class and Use ConvertHtml(HTMLContent) Method for converting HTML into Plain Text rather than ConvertToPlainText(string html);

第三,需要创建上面类的对象,并使用ConvertHtml(HTMLContent)方法将HTML转换为纯文本,而不是ConvertToPlainText(string HTML);

HtmlToText htt=new HtmlToText();
var plainText = htt.ConvertHtml(HTMLContent);

#8


3  

I think the easiest way is to make a 'string' extension method (based on what user Richard have suggested):

我认为最简单的方法是创建一个“string”扩展方法(根据用户Richard的建议):

using System;
using System.Text.RegularExpressions;

public static class StringHelpers
{
    public static string StripHTML(this string HTMLText)
        {
            var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            return reg.Replace(HTMLText, "");
        }
}

Then just use this extension method on any 'string' variable in your program:

然后在程序中的任何“字符串”变量上使用这个扩展方法:

var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>";
var yourTextString = yourHtmlString.StripHTML();

I use this extension method to convert html formated comments to plain text so it will be displayed correctly on a crystal report, and it works perfect!

我使用这个扩展方法将html格式的注释转换为纯文本,这样它就会正确地显示在水晶报表上,并且它工作得非常完美!

#9


3  

The simplest way I found:

我找到的最简单的方法是:

HtmlFilter.ConvertToPlainText(html);

The HtmlFilter class is located in Microsoft.TeamFoundation.WorkItemTracking.Controls.dll

HtmlFilter类位于Microsoft.TeamFoundation.WorkItemTracking.Controls.dll中

The dll can be found in folder like this: %ProgramFiles%\Common Files\microsoft shared\Team Foundation Server\14.0\

动态链接库可以在这样的文件夹中找到:%ProgramFiles%\Common Files%\ microsoft shared\Team Foundation Server\14.0\ \ \ \ \ \ \ \

In VS 2015, the dll also requires reference to Microsoft.TeamFoundation.WorkItemTracking.Common.dll, located in the same folder.

在VS 2015中,dll还需要参考microsoft.teamfoundation.workitemtrack . common。dll文件,位于同一文件夹中。

#10


2  

If you have data that has HTML tags and you want to display it so that a person can SEE the tags, use HttpServerUtility::HtmlEncode.

如果您有具有HTML标记的数据,并且想要显示它,以便人们可以看到这些标记,请使用HttpServerUtility::HtmlEncode。

If you have data that has HTML tags in it and you want the user to see the tags rendered, then display the text as is. If the text represents an entire web page, use an IFRAME for it.

如果您有包含HTML标记的数据,并且希望用户看到呈现的标记,则按原样显示文本。如果文本表示整个web页面,请使用IFRAME。

If you have data that has HTML tags and you want to strip out the tags and just display the unformatted text, use a regular expression.

如果您有带有HTML标记的数据,并且想去掉这些标记,只显示未格式化的文本,请使用正则表达式。

#11


1  

It has limitation that not collapsing long inline whitespace, but it is definitely portable and respects layout like webbrowser.

它的限制是不压缩长内联空白,但它绝对是可移植性的,并且与webbrowser一样尊重布局。

static string HtmlToPlainText(string html) {
  string buf;
  string block = "address|article|aside|blockquote|canvas|dd|div|dl|dt|" +
    "fieldset|figcaption|figure|footer|form|h\\d|header|hr|li|main|nav|" +
    "noscript|ol|output|p|pre|section|table|tfoot|ul|video";

  string patNestedBlock = $"(\\s*?</?({block})[^>]*?>)+\\s*";
  buf = Regex.Replace(html, patNestedBlock, "\n", RegexOptions.IgnoreCase);

  // Replace br tag to newline.
  buf = Regex.Replace(buf, @"<(br)[^>]*>", "\n", RegexOptions.IgnoreCase);

  // (Optional) remove styles and scripts.
  buf = Regex.Replace(buf, @"<(script|style)[^>]*?>.*?</\1>", "", RegexOptions.Singleline);

  // Remove all tags.
  buf = Regex.Replace(buf, @"<[^>]*(>|$)", "", RegexOptions.Multiline);

  // Replace HTML entities.
  buf = WebUtility.HtmlDecode(buf);
  return buf;
}

#12


0  

Depends on what you mean by "html." The most complex case would be complete web pages. That's also the easiest to handle, since you can use a text-mode web browser. See the Wikipedia article listing web browsers, including text mode browsers. Lynx is probably the best known, but one of the others may be better for your needs.

取决于你所说的“html”是什么意思。最复杂的情况是完整的网页。这也是最容易处理的,因为您可以使用文本模式的web浏览器。请参阅*列出web浏览器(包括文本模式浏览器)的文章。Lynx可能是最著名的,但是其他的一个可能更适合你的需要。

#13


0  

Here is my solution:

这是我的解决方案:

public string StripHTML(string html)
{
    var regex = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    return System.Web.HttpUtility.HtmlDecode((regex.Replace(html, "")));
}

Example:

例子:

StripHTML("<p class='test' style='color:red;'>Here is my solution:</p>");
// output -> Here is my solution:

#14


0  

I had the same question, just my html had a simple pre-known layout, like:

我也有同样的问题,我的html有一个简单的预先知道的布局,比如:

<DIV><P>abc</P><P>def</P></DIV>

So I ended up using such simple code:

所以我用了这么简单的代码:

string.Join (Environment.NewLine, XDocument.Parse (html).Root.Elements ().Select (el => el.Value))

Which outputs:

输出:

abc
def

#15


0  

Did not write but an using:

没有写,只是用了:

using HtmlAgilityPack;
using System;
using System.IO;
using System.Text.RegularExpressions;

namespace foo {
  //small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
  public static class HtmlToText {

    public static string Convert(string path) {
      HtmlDocument doc = new HtmlDocument();
      doc.Load(path);
      return ConvertDoc(doc);
    }

    public static string ConvertHtml(string html) {
      HtmlDocument doc = new HtmlDocument();
      doc.LoadHtml(html);
      return ConvertDoc(doc);
    }

    public static string ConvertDoc(HtmlDocument doc) {
      using (StringWriter sw = new StringWriter()) {
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
      }
    }

    internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {
      foreach (HtmlNode subnode in node.ChildNodes) {
        ConvertTo(subnode, outText, textInfo);
      }
    }
    public static void ConvertTo(HtmlNode node, TextWriter outText) {
      ConvertTo(node, outText, new PreceedingDomTextInfo(false));
    }
    internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {
      string html;
      switch (node.NodeType) {
        case HtmlNodeType.Comment:
          // don't output comments
          break;
        case HtmlNodeType.Document:
          ConvertContentTo(node, outText, textInfo);
          break;
        case HtmlNodeType.Text:
          // script and style must not be output
          string parentName = node.ParentNode.Name;
          if ((parentName == "script") || (parentName == "style")) {
            break;
          }
          // get text
          html = ((HtmlTextNode)node).Text;
          // is it in fact a special closing node output as text?
          if (HtmlNode.IsOverlappedClosingElement(html)) {
            break;
          }
          // check the text is meaningful and not a bunch of whitespaces
          if (html.Length == 0) {
            break;
          }
          if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace) {
            html = html.TrimStart();
            if (html.Length == 0) { break; }
            textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
          }
          outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));
          if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1])) {
            outText.Write(' ');
          }
          break;
        case HtmlNodeType.Element:
          string endElementString = null;
          bool isInline;
          bool skip = false;
          int listIndex = 0;
          switch (node.Name) {
            case "nav":
              skip = true;
              isInline = false;
              break;
            case "body":
            case "section":
            case "article":
            case "aside":
            case "h1":
            case "h2":
            case "header":
            case "footer":
            case "address":
            case "main":
            case "div":
            case "p": // stylistic - adjust as you tend to use
              if (textInfo.IsFirstTextOfDocWritten) {
                outText.Write("\r\n");
              }
              endElementString = "\r\n";
              isInline = false;
              break;
            case "br":
              outText.Write("\r\n");
              skip = true;
              textInfo.WritePrecedingWhiteSpace = false;
              isInline = true;
              break;
            case "a":
              if (node.Attributes.Contains("href")) {
                string href = node.Attributes["href"].Value.Trim();
                if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase) == -1) {
                  endElementString = "<" + href + ">";
                }
              }
              isInline = true;
              break;
            case "li":
              if (textInfo.ListIndex > 0) {
                outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
              } else {
                outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
              }
              isInline = false;
              break;
            case "ol":
              listIndex = 1;
              goto case "ul";
            case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
              endElementString = "\r\n";
              isInline = false;
              break;
            case "img": //inline-block in reality
              if (node.Attributes.Contains("alt")) {
                outText.Write('[' + node.Attributes["alt"].Value);
                endElementString = "]";
              }
              if (node.Attributes.Contains("src")) {
                outText.Write('<' + node.Attributes["src"].Value + '>');
              }
              isInline = true;
              break;
            default:
              isInline = true;
              break;
          }
          if (!skip && node.HasChildNodes) {
            ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten) { ListIndex = listIndex });
          }
          if (endElementString != null) {
            outText.Write(endElementString);
          }
          break;
      }
    }
  }
  internal class PreceedingDomTextInfo {
    public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten) {
      IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
    }
    public bool WritePrecedingWhiteSpace { get; set; }
    public bool LastCharWasSpace { get; set; }
    public readonly BoolWrapper IsFirstTextOfDocWritten;
    public int ListIndex { get; set; }
  }
  internal class BoolWrapper {
    public BoolWrapper() { }
    public bool Value { get; set; }
    public static implicit operator bool(BoolWrapper boolWrapper) {
      return boolWrapper.Value;
    }
    public static implicit operator BoolWrapper(bool boolWrapper) {
      return new BoolWrapper { Value = boolWrapper };
    }
  }
}

#16


-4  

public static string StripTags2(string html) { return html.Replace("<", "<").Replace(">", ">"); }

公共静态字符串StripTags2(string html){返回html。替换(“<”、“<”)。替换(“>”,“>”);}

By this you escape all "<" and ">" in a string. Is this what you want?

这样,您就可以在字符串中转义所有的“<”和“>”。这是你想要的吗?