如何将HTML转换成c#中的文本?

I'm looking for C# code to convert an HTML document to plain text.

我正在寻找c#代码，以将一个HTML文档转换为纯文本。

I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.

我不是在寻找简单的标签剥离，而是一些可以输出纯文本并合理保留原始布局的东西。

The output should look like this:

输出应该如下所示:

Html2Txt at W3C

在W3C Html2Txt

I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?

我已经查看了HTML敏捷包，但我不认为这是我需要的。有人有其他建议吗?

EDIT: I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt @ W3C produced. Too bad that source doesn't seem to be available. I was looking to see if there is a more "canned" solution available.

编辑:我从CodePlex上下载HTML敏捷包，然后运行Html2Txt项目。多么令人失望(至少是对文本转换做html的模块)!它所做的就是去掉标签、简化表格等等。输出看起来一点也不像W3C生产的Html2Txt @。可惜的是，这个资源似乎没有。我在寻找是否有更“罐装”的解决方案。

EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false and ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!

编辑2:谢谢大家的建议。法斯瓦特向我要去的方向倾斜。我可以使用System.Diagnostics软件。进程类来运行lynx。exe使用“-dump”开关将文本发送到标准输出，并使用ProcessStartInfo捕获stdout。UseShellExecute = false和ProcessStartInfo。RedirectStandardOutput = true。我将在c#类中完成所有这些。此代码将只调用occassionly，因此我不太关心生成一个新进程与在代码中执行它。另外,猞猁是快! !

18 个解决方案

#1

What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.

您正在寻找的是一个文本模式的DOM渲染器，它输出文本，非常像Lynx或其他文本浏览器……这比你想象的要难得多。

#2

Just a note about the HtmlAgilityPack for posterity. The project contains an example of parsing text to html, which, as noted by the OP, does not handle whitespace at all like anyone writing HTML would envisage. There are full-text rendering solutions out there, noted by others to this question, which this is not (it cannot even handle tables in its current form), but it is lightweight and fast, which is all I wanted for creating a simple text version of HTML emails.

这是关于我们后代的HtmlAgilityPack。该项目包含一个解析文本到html的示例，正如OP所指出的，它完全不像任何编写html的人所设想的那样处理空格。这里有一些全文呈现解决方案，其他人也注意到了这一点(它甚至不能处理当前形式的表)，但它是轻量级的、快速的，这是我创建HTML电子邮件的简单文本版本所需要的。

//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{

    public static string Convert(string path)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(path);
        return ConvertDoc(doc);
    }

    public static string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);
        return ConvertDoc(doc);
    }

    public static string ConvertDoc (HtmlDocument doc)
    {
        using (StringWriter sw = new StringWriter())
        {
            ConvertTo(doc.DocumentNode, sw);
            sw.Flush();
            return sw.ToString();
        }
    }

    internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
    {
        foreach (HtmlNode subnode in node.ChildNodes)
        {
            ConvertTo(subnode, outText, textInfo);
        }
    }
    public static void ConvertTo(HtmlNode node, TextWriter outText)
    {
        ConvertTo(node, outText, new PreceedingDomTextInfo(false));
    }
    internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
    {
        string html;
        switch (node.NodeType)
        {
            case HtmlNodeType.Comment:
                // don't output comments
                break;
            case HtmlNodeType.Document:
                ConvertContentTo(node, outText, textInfo);
                break;
            case HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                {
                    break;
                }
                // get text
                html = ((HtmlTextNode)node).Text;
                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                {
                    break;
                }
                // check the text is meaningful and not a bunch of whitespaces
                if (html.Length == 0)
                {
                    break;
                }
                if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
                {
                    html= html.TrimStart();
                    if (html.Length == 0) { break; }
                    textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
                }
                outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));
                if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
                {
                    outText.Write(' ');
                }
                    break;
            case HtmlNodeType.Element:
                string endElementString = null;
                bool isInline;
                bool skip = false;
                int listIndex = 0;
                switch (node.Name)
                {
                    case "nav":
                        skip = true;
                        isInline = false;
                        break;
                    case "body":
                    case "section":
                    case "article":
                    case "aside":
                    case "h1":
                    case "h2":
                    case "header":
                    case "footer":
                    case "address":
                    case "main":
                    case "div":
                    case "p": // stylistic - adjust as you tend to use
                        if (textInfo.IsFirstTextOfDocWritten)
                        {
                            outText.Write("\r\n");
                        }
                        endElementString = "\r\n";
                        isInline = false;
                        break;
                    case "br":
                        outText.Write("\r\n");
                        skip = true;
                        textInfo.WritePrecedingWhiteSpace = false;
                        isInline = true;
                        break;
                    case "a":
                        if (node.Attributes.Contains("href"))
                        {
                            string href = node.Attributes["href"].Value.Trim();
                            if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
                            {
                                endElementString =  "<" + href + ">";
                            }  
                        }
                        isInline = true;
                        break;
                    case "li": 
                        if(textInfo.ListIndex>0)
                        {
                            outText.Write("\r\n{0}.\t", textInfo.ListIndex++); 
                        }
                        else
                        {
                            outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
                        }
                        isInline = false;
                        break;
                    case "ol": 
                        listIndex = 1;
                        goto case "ul";
                    case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
                        endElementString = "\r\n";
                        isInline = false;
                        break;
                    case "img": //inline-block in reality
                        if (node.Attributes.Contains("alt"))
                        {
                            outText.Write('[' + node.Attributes["alt"].Value);
                            endElementString = "]";
                        }
                        if (node.Attributes.Contains("src"))
                        {
                            outText.Write('<' + node.Attributes["src"].Value + '>');
                        }
                        isInline = true;
                        break;
                    default:
                        isInline = true;
                        break;
                }
                if (!skip && node.HasChildNodes)
                {
                    ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
                }
                if (endElementString != null)
                {
                    outText.Write(endElementString);
                }
                break;
        }
    }
}
internal class PreceedingDomTextInfo
{
    public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
    {
        IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
    }
    public bool WritePrecedingWhiteSpace {get;set;}
    public bool LastCharWasSpace { get; set; }
    public readonly BoolWrapper IsFirstTextOfDocWritten;
    public int ListIndex { get; set; }
}
internal class BoolWrapper
{
    public BoolWrapper() { }
    public bool Value { get; set; }
    public static implicit operator bool(BoolWrapper boolWrapper)
    {
        return boolWrapper.Value;
    }
    public static implicit operator BoolWrapper(bool boolWrapper)
    {
        return new BoolWrapper{ Value = boolWrapper };
    }
}

As an example, the following HTML code...

例如，下面的HTML代码……

<!DOCTYPE HTML>
<html>
    <head>
    </head>
    <body>
        <header>
            Whatever Inc.
        </header>
        <main>
            <p>
                Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
            </p>
            <ol>
                <li>
                    Please confirm this is your email by replying.
                </li>
                <li>
                    Then perform this step.
                </li>
            </ol>
            <p>
                Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
            </p>
            <ul>
                <li>
                    a point.
                </li>
                <li>
                    another point, with a <a href="http://en.wikipedia.org/wiki/Hyperlink">hyperlink</a>.
                </li>
            </ul>
            <p>
                Sincerely,
            </p>
            <p>
                The whatever.com team
            </p>
        </main>
        <footer>
            Ph: 000 000 000<br/>
            mail: whatever st
        </footer>
    </body>
</html>

...will be transformed into:

…将转换为:

Whatever Inc. 


Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things: 

1.  Please confirm this is your email by replying. 
2.  Then perform this step. 

Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please: 

*   a point. 
*   another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>. 

Sincerely, 

The whatever.com team 


Ph: 000 000 000
mail: whatever st

...as opposed to:

…而不是:

        Whatever Inc.


            Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:

                Please confirm this is your email by replying.

                Then perform this step.


            Please solve this . Then, in any order, could you please:

                a point.

                another point, with a hyperlink.


            Sincerely,


            The whatever.com team

        Ph: 000 000 000
        mail: whatever st

#3

You could use this:

你可以用这个:

 public static string StripHTML(string HTMLText, bool decode = true)
        {
            Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            var stripped = reg.Replace(HTMLText, "");
            return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
        }

Updated

更新

Thanks for the comments I have updated to improve this function

感谢我为改进这个功能而更新的评论

#4

I've heard from a reliable source that, if you're doing HTML parsing in .Net, you should look at the HTML agility pack again..

我从一个可靠的来源听说，如果你在。net中做HTML解析，你应该再看看HTML敏捷包。

http://www.codeplex.com/htmlagilitypack

Some sample on SO..

一些样品. .

HTML Agility pack - parsing tables

HTML敏捷包-解析表

#5

Because I wanted conversion to plain text with LF and bullets, I found this pretty solution on codeproject, which covers many conversion usecases:

因为我想用LF和子弹转换成纯文本，所以我在codeproject上找到了这个漂亮的解决方案，它涵盖了许多转换用例:

Convert HTML to Plain Text

将HTML转换为纯文本

Yep, looks so big, but works fine.

是的，看起来太大了，但效果很好。

#6

Have you tried http://www.aaronsw.com/2002/html2text/ it's Python, but open source.

你试过http://www.aaronsw.com/2002/html2text/它是Python，但是是开源的。

#7

Assuming you have well formed html, you could also maybe try an XSL transform.

假设您已经有了格式良好的html，您也可以尝试XSL转换。

Here's an example:

这里有一个例子:

using System;
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Xsl;

class Html2TextExample
{
    public static string Html2Text(XDocument source)
    {
        var writer = new StringWriter();
        Html2Text(source, writer);
        return writer.ToString();
    }

    public static void Html2Text(XDocument source, TextWriter output)
    {
        Transformer.Transform(source.CreateReader(), null, output);
    }

    public static XslCompiledTransform _transformer;
    public static XslCompiledTransform Transformer
    {
        get
        {
            if (_transformer == null)
            {
                _transformer = new XslCompiledTransform();
                var xsl = XDocument.Parse(@"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>");
                _transformer.Load(xsl.CreateNavigator());
            }
            return _transformer;
        }
    }

    static void Main(string[] args)
    {
        var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>");
        var text = Html2Text(html);
        Console.WriteLine(text);
    }
}

#8

The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements (li) and line breaks for br's and p's. It shouldn't be too hard to extend this to tables.

最简单的方法可能是标签剥离，加上一些标签的替换，如列表元素的破折号(li)和br和p的换行符。将其扩展到表应该不难。

#9

I had some decoding issues with HtmlAgility and I didn't want to invest time investigating it.

我对HtmlAgility有些解码问题，但我不想花时间调查此事。

Instead I used that utility from the Microsoft Team Foundation API:

相反，我使用了来自Microsoft Team Foundation API的实用工具:

var text = HtmlFilter.ConvertToPlainText(htmlContent);

#10

Another post suggests the HTML agility pack:

另一篇文章建议HTML敏捷性包:

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

这是一个敏捷的HTML解析器，它构建一个读/写DOM并支持简单的XPATH或XSLT(实际上，您不必理解XPATH或XSLT，不用担心…)。它是一个。net代码库，允许您解析“web之外”的HTML文件。解析器对“真实世界”格式错误的HTML非常宽容。对象模型与提出的系统非常相似。Xml，但用于HTML文档(或流)。

#11

I have used Detagger in the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.

我过去使用过Detagger。它很好地将HTML格式化为文本，它不仅仅是一个标签清除器。

#12

-1

I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/

我不知道c#，但是这里有一个非常小且容易阅读的python html2txt脚本:http://www.aaronsw.com/2002/html2text/

#13

-1

I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first

我最近写了一个解决方案，通过使用Markdown XSLT文件来转换HTML源代码，这对我很有用。当然，HTML源代码首先需要是有效的XML

#14

-1

Try the easy and usable way: just call StripHTML(WebBrowserControl_name);

尝试简单易用的方法:调用StripHTML(WebBrowserControl_name);

 public string StripHTML(WebBrowser webp)
        {
            try
            {
                doc.execCommand("SelectAll", true, null);
                IHTMLSelectionObject currentSelection = doc.selection;

                if (currentSelection != null)
                {
                    IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange;
                    if (range != null)
                    {
                        currentSelection.empty();
                        return range.text;
                    }
                }
            }
            catch (Exception ep)
            {
                //MessageBox.Show(ep.Message);
            }
            return "";

        }

#15

-2

In Genexus You can made with Regex

在Genexus中，你可以用Regex制作

&pattern = '<[^>]+>'

模式= <[^ >]+ >的

&TSTRPNOT=&TSTRPNOT.ReplaceRegEx(&pattern,"")

&TSTRPNOT = &TSTRPNOT.ReplaceRegEx(模式”、“)

In Genexus possiamo gestirlo con Regex,

在Genexus，

#16

-3

You can use WebBrowser control to render in memory your html content. After LoadCompleted event fired...

您可以使用WebBrowser控件在内存中呈现html内容。LoadCompleted事件后解雇了……

IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;

#17

-3

If you are using .NET framework 4.5 you can use System.Net.WebUtility.HtmlDecode() which takes a HTML encoded string and returns a decoded string.

如果您正在使用。net framework 4.5，您可以使用System.Net.WebUtility.HtmlDecode()，它接受HTML编码的字符串并返回解码后的字符串。

Documented on MSDN at: http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx

在MSDN上记录的:http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx。

You can use this in a Windows Store app as well.

你也可以在Windows Store应用程序中使用它。

#18

-4

This is another solution to convert HTML to Text or RTF in C#:

这是另一个在c#中将HTML转换为文本或RTF的解决方案:

    SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
    h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
    string text = h.ConvertString(htmlString);

This library is not free, this is commercial product and it is my own product.

这个图书馆不是免费的，这是商业产品，也是我自己的产品。

#1