从Word文档转换为HTML

时间:2022-10-30 13:06:21

I want to save the Word document in HTML using Word Viewer without having Word installed in my machine. Is there any way to accomplish this in C#?

我想使用Word Viewer将Word文档保存为HTML,而不在我的计算机中安装Word。有没有办法在C#中实现这一目标?

11 个解决方案

#1


17  

For converting .docx file to HTML format, you can use OpenXmlPowerTools. Make sure to add a reference to OpenXmlPowerTools.dll.

要将.docx文件转换为HTML格式,可以使用OpenXmlPowerTools。确保添加对OpenXmlPowerTools.dll的引用。

using OpenXmlPowerTools;
using DocumentFormat.OpenXml.Wordprocessing;

byte[] byteArray = File.ReadAllBytes(DocxFilePath);
using (MemoryStream memoryStream = new MemoryStream())
{
     memoryStream.Write(byteArray, 0, byteArray.Length);
     using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true))
     {
          HtmlConverterSettings settings = new HtmlConverterSettings()
          {
               PageTitle = "My Page Title"
          };
          XElement html = HtmlConverter.ConvertToHtml(doc, settings);

          File.WriteAllText(HTMLFilePath, html.ToStringNewLineOnAttributes());
     }
}

#2


4  

You can try with Microsoft.Office.Interop.Word;

您可以尝试使用Microsoft.Office.Interop.Word;

   using Word = Microsoft.Office.Interop.Word;

    public static void ConvertDocToHtml(object Sourcepath, object TargetPath)
    {

        Word._Application newApp = new Word.Application();
        Word.Documents d = newApp.Documents;
        object Unknown = Type.Missing;
        Word.Document od = d.Open(ref Sourcepath, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown, ref Unknown);
        object format = Word.WdSaveFormat.wdFormatHTML;



        newApp.ActiveDocument.SaveAs(ref TargetPath, ref format,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown);

        newApp.Documents.Close(Word.WdSaveOptions.wdDoNotSaveChanges);


    }

#3


2  

We can use OpenXML and OpenXmlPowerTools to convert Word document into HTML.

我们可以使用OpenXML和OpenXmlPowerTools将Word文档转换为HTML。

Install Required Package

Install-Package DocumentFormat.OpenXml

Install-Package OpenXmlPowerTools

Add Reference

Right click in you Project in Solution Explorer
then Add >> Reference >> Select System.Drawing and WindowsBase 从Word文档转换为HTML

右键单击解决方案资源管理器中的Project,然后单击添加>>参考>>选择System.Drawing和WindowsBase

Follow the CODE Below

using DocumentFormat.OpenXml.Packaging;
using OpenXmlPowerTools;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;
using System.Drawing.Imaging;

namespace WordToHTML
{
    class Program
    {
        static void Main(string[] args)
        {
            byte[] byteArray = File.ReadAllBytes("kk.docx");

            using (MemoryStream memoryStream = new MemoryStream())
            {
                memoryStream.Write(byteArray, 0, byteArray.Length);
                using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true))
                {
                    int imageCounter = 0;
                    HtmlConverterSettings settings = new HtmlConverterSettings()
                    {
                        PageTitle = "My Page Title",
                        ImageHandler = imageInfo =>
                        {
                            DirectoryInfo localDirInfo = new DirectoryInfo("img");
                            if (!localDirInfo.Exists)
                                localDirInfo.Create();
                            ++imageCounter;
                            string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                            ImageFormat imageFormat = null;
                            if (extension == "png")
                            {
                                extension = "gif";
                                imageFormat = ImageFormat.Gif;
                            }
                            else if (extension == "gif")
                                imageFormat = ImageFormat.Gif;
                            else if (extension == "bmp")
                                imageFormat = ImageFormat.Bmp;
                            else if (extension == "jpeg")
                                imageFormat = ImageFormat.Jpeg;
                            else if (extension == "tiff")
                            {
                                extension = "gif";
                                imageFormat = ImageFormat.Gif;
                            }
                            else if (extension == "x-wmf")
                            {
                                extension = "wmf";
                                imageFormat = ImageFormat.Wmf;
                            }
                            if (imageFormat == null)
                                return null;

                            string imageFileName = "img/image" +
                                imageCounter.ToString() + "." + extension;
                            try
                            {
                                imageInfo.Bitmap.Save(imageFileName, imageFormat);
                            }
                            catch (System.Runtime.InteropServices.ExternalException)
                            {
                                return null;
                            }
                            XElement img = new XElement(Xhtml.img,
                                new XAttribute(NoNamespace.src, imageFileName),
                                imageInfo.ImgStyleAttribute,
                                imageInfo.AltText != null ?
                                    new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                            return img;
                        }
                    };
                    XElement html = HtmlConverter.ConvertToHtml(doc, settings);
                    File.WriteAllText("kk.html", html.ToStringNewLineOnAttributes());
                };
            }
        }
    }
}

Follow this blog post for working solution

#4


1  

I think this will depend on the version of the Word document. If you have them in docx format, I believe they are stored within the file as XML data (but it is so long since I looked at the specification I am perfectly happy to be corrected on that).

我认为这将取决于Word文档的版本。如果你有docx格式的文件,我相信它们作为XML数据存储在文件中(但是我查看规范已经很久了,我很乐意对其进行更正)。

#5


0  

You will need to have MS Word installed to do this, I believe.

我相信你需要安装MS Word才能做到这一点。

Check out this article for details on the implementation.

有关实现的详细信息,请查看此文章。

#6


0  

According to this Stack Overflow question, it isn't possible with word viewer. You will need Word to use COM Interop to interact with Word.

根据这个Stack Overflow问题,单词查看器是不可能的。您需要Word才能使用COM Interop与Word进行交互。

#7


0  

If you're open to not using C#, you could do something like print to file using PrimoPDF (which would change the .doc into a .pdf) and then use a PDF to HTML converter to go the rest of the way. After that you can edit your html however you like.

如果您愿意不使用C#,您可以使用PrimoPDF(将.doc更改为.pdf)执行打印到文件之类的操作,然后使用PDF到HTML转换器完成剩下的工作。之后,您可以编辑您喜欢的HTML。

#8


0  

Another similar topic which I have got is Convert Word to HTML then render HTML on webpage. I think you might find this helpful if you are still on it. There's a freely distributed dll for this. I have given the link there.

我得到的另一个类似主题是将Word转换为HTML然后在网页上呈现HTML。我想如果你还在上面,你可能会觉得这很有帮助。这是一个免费分发的DLL。我在那里给了链接。

#9


0  

I wrote Mammoth for .NET, which is a library that converts docx files to HTML, and is available on NuGet.

我写了Mammoth for .NET,它是一个将docx文件转换为HTML的库,可以在NuGet上找到。

Mammoth tries to produce clean HTML by looking at semantic information -- for instance, mapping paragraph styles in Word (such as Heading 1) to appropriate tags and style in HTML/CSS (such as <h1>). If you want something that produces an exact visual copy, then Mammoth probably isn't for you. If you have something that's already well-structured and want to convert that to tidy HTML, Mammoth might do the trick.

猛犸通过查看语义信息来尝试生成干净的HTML - 例如,将Word中的段落样式(例如标题1)映射到HTML / CSS中的适当标记和样式(例如

)。如果你想要能产生精确视觉副本的东西,那么猛犸可能不适合你。如果你有一些已经结构良好的东西,并希望将其转换为整洁的HTML,猛犸可能会做到这一点。

#10


0  

Gembox works pretty well. It even converts images in the Word doc to base64 encoded strings in img tags.

Gembox工作得很好。它甚至可以将Word doc中的图像转换为img标记中的base64编码字符串。

#11


-1  

Using the document conversion tools available in OpenOffice.org is probably the only possible option - the .doc format is only designed to be opened via Microsoft products so any libraries dealing with it will need to have reverse engineered the entire format.

使用OpenOffice.org中提供的文档转换工具可能是唯一可行的选项 - .doc格式仅设计为通过Microsoft产品打开,因此任何处理它的库都需要对整个格式进行反向设计。

#1


17  

For converting .docx file to HTML format, you can use OpenXmlPowerTools. Make sure to add a reference to OpenXmlPowerTools.dll.

要将.docx文件转换为HTML格式,可以使用OpenXmlPowerTools。确保添加对OpenXmlPowerTools.dll的引用。

using OpenXmlPowerTools;
using DocumentFormat.OpenXml.Wordprocessing;

byte[] byteArray = File.ReadAllBytes(DocxFilePath);
using (MemoryStream memoryStream = new MemoryStream())
{
     memoryStream.Write(byteArray, 0, byteArray.Length);
     using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true))
     {
          HtmlConverterSettings settings = new HtmlConverterSettings()
          {
               PageTitle = "My Page Title"
          };
          XElement html = HtmlConverter.ConvertToHtml(doc, settings);

          File.WriteAllText(HTMLFilePath, html.ToStringNewLineOnAttributes());
     }
}

#2


4  

You can try with Microsoft.Office.Interop.Word;

您可以尝试使用Microsoft.Office.Interop.Word;

   using Word = Microsoft.Office.Interop.Word;

    public static void ConvertDocToHtml(object Sourcepath, object TargetPath)
    {

        Word._Application newApp = new Word.Application();
        Word.Documents d = newApp.Documents;
        object Unknown = Type.Missing;
        Word.Document od = d.Open(ref Sourcepath, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown, ref Unknown);
        object format = Word.WdSaveFormat.wdFormatHTML;



        newApp.ActiveDocument.SaveAs(ref TargetPath, ref format,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown);

        newApp.Documents.Close(Word.WdSaveOptions.wdDoNotSaveChanges);


    }

#3


2  

We can use OpenXML and OpenXmlPowerTools to convert Word document into HTML.

我们可以使用OpenXML和OpenXmlPowerTools将Word文档转换为HTML。

Install Required Package

Install-Package DocumentFormat.OpenXml

Install-Package OpenXmlPowerTools

Add Reference

Right click in you Project in Solution Explorer
then Add >> Reference >> Select System.Drawing and WindowsBase 从Word文档转换为HTML

右键单击解决方案资源管理器中的Project,然后单击添加>>参考>>选择System.Drawing和WindowsBase

Follow the CODE Below

using DocumentFormat.OpenXml.Packaging;
using OpenXmlPowerTools;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;
using System.Drawing.Imaging;

namespace WordToHTML
{
    class Program
    {
        static void Main(string[] args)
        {
            byte[] byteArray = File.ReadAllBytes("kk.docx");

            using (MemoryStream memoryStream = new MemoryStream())
            {
                memoryStream.Write(byteArray, 0, byteArray.Length);
                using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true))
                {
                    int imageCounter = 0;
                    HtmlConverterSettings settings = new HtmlConverterSettings()
                    {
                        PageTitle = "My Page Title",
                        ImageHandler = imageInfo =>
                        {
                            DirectoryInfo localDirInfo = new DirectoryInfo("img");
                            if (!localDirInfo.Exists)
                                localDirInfo.Create();
                            ++imageCounter;
                            string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                            ImageFormat imageFormat = null;
                            if (extension == "png")
                            {
                                extension = "gif";
                                imageFormat = ImageFormat.Gif;
                            }
                            else if (extension == "gif")
                                imageFormat = ImageFormat.Gif;
                            else if (extension == "bmp")
                                imageFormat = ImageFormat.Bmp;
                            else if (extension == "jpeg")
                                imageFormat = ImageFormat.Jpeg;
                            else if (extension == "tiff")
                            {
                                extension = "gif";
                                imageFormat = ImageFormat.Gif;
                            }
                            else if (extension == "x-wmf")
                            {
                                extension = "wmf";
                                imageFormat = ImageFormat.Wmf;
                            }
                            if (imageFormat == null)
                                return null;

                            string imageFileName = "img/image" +
                                imageCounter.ToString() + "." + extension;
                            try
                            {
                                imageInfo.Bitmap.Save(imageFileName, imageFormat);
                            }
                            catch (System.Runtime.InteropServices.ExternalException)
                            {
                                return null;
                            }
                            XElement img = new XElement(Xhtml.img,
                                new XAttribute(NoNamespace.src, imageFileName),
                                imageInfo.ImgStyleAttribute,
                                imageInfo.AltText != null ?
                                    new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                            return img;
                        }
                    };
                    XElement html = HtmlConverter.ConvertToHtml(doc, settings);
                    File.WriteAllText("kk.html", html.ToStringNewLineOnAttributes());
                };
            }
        }
    }
}

Follow this blog post for working solution

#4


1  

I think this will depend on the version of the Word document. If you have them in docx format, I believe they are stored within the file as XML data (but it is so long since I looked at the specification I am perfectly happy to be corrected on that).

我认为这将取决于Word文档的版本。如果你有docx格式的文件,我相信它们作为XML数据存储在文件中(但是我查看规范已经很久了,我很乐意对其进行更正)。

#5


0  

You will need to have MS Word installed to do this, I believe.

我相信你需要安装MS Word才能做到这一点。

Check out this article for details on the implementation.

有关实现的详细信息,请查看此文章。

#6


0  

According to this Stack Overflow question, it isn't possible with word viewer. You will need Word to use COM Interop to interact with Word.

根据这个Stack Overflow问题,单词查看器是不可能的。您需要Word才能使用COM Interop与Word进行交互。

#7


0  

If you're open to not using C#, you could do something like print to file using PrimoPDF (which would change the .doc into a .pdf) and then use a PDF to HTML converter to go the rest of the way. After that you can edit your html however you like.

如果您愿意不使用C#,您可以使用PrimoPDF(将.doc更改为.pdf)执行打印到文件之类的操作,然后使用PDF到HTML转换器完成剩下的工作。之后,您可以编辑您喜欢的HTML。

#8


0  

Another similar topic which I have got is Convert Word to HTML then render HTML on webpage. I think you might find this helpful if you are still on it. There's a freely distributed dll for this. I have given the link there.

我得到的另一个类似主题是将Word转换为HTML然后在网页上呈现HTML。我想如果你还在上面,你可能会觉得这很有帮助。这是一个免费分发的DLL。我在那里给了链接。

#9


0  

I wrote Mammoth for .NET, which is a library that converts docx files to HTML, and is available on NuGet.

我写了Mammoth for .NET,它是一个将docx文件转换为HTML的库,可以在NuGet上找到。

Mammoth tries to produce clean HTML by looking at semantic information -- for instance, mapping paragraph styles in Word (such as Heading 1) to appropriate tags and style in HTML/CSS (such as <h1>). If you want something that produces an exact visual copy, then Mammoth probably isn't for you. If you have something that's already well-structured and want to convert that to tidy HTML, Mammoth might do the trick.

猛犸通过查看语义信息来尝试生成干净的HTML - 例如,将Word中的段落样式(例如标题1)映射到HTML / CSS中的适当标记和样式(例如

)。如果你想要能产生精确视觉副本的东西,那么猛犸可能不适合你。如果你有一些已经结构良好的东西,并希望将其转换为整洁的HTML,猛犸可能会做到这一点。

#10


0  

Gembox works pretty well. It even converts images in the Word doc to base64 encoded strings in img tags.

Gembox工作得很好。它甚至可以将Word doc中的图像转换为img标记中的base64编码字符串。

#11


-1  

Using the document conversion tools available in OpenOffice.org is probably the only possible option - the .doc format is only designed to be opened via Microsoft products so any libraries dealing with it will need to have reverse engineered the entire format.

使用OpenOffice.org中提供的文档转换工具可能是唯一可行的选项 - .doc格式仅设计为通过Microsoft产品打开,因此任何处理它的库都需要对整个格式进行反向设计。