如何使用asp.net将.docx转换为html?

时间:2022-04-02 08:48:30

Word 2007 saves its documents in .docx format which is really a zip file with a bunch of stuff in it including an xml file with the document.

Word 2007以.docx格式保存文档,这是一个zip文件,里面有很多东西,包括带文档的xml文件。

I want to be able to take a .docx file and drop it into a folder in my asp.net web app and have the code open the .docx file and render the (xml part of the) document as a web page.

我希望能够获取.docx文件并将其放入我的asp.net Web应用程序中的文件夹中,并让代码打开.docx文件并将(xml部分)文档呈现为网页。

I've been searching the web for more information on this but so far haven't found much. My questions are:

我一直在网上搜索有关这方面的更多信息,但到目前为止还没有找到太多信息。我的问题是:

  1. Would you (a) use XSLT to transform the XML to HTML, or (b) use xml manipulation libraries in .net (such as XDocument and XElement in 3.5) to convert to HTML or (c) other?
  2. 您是否(a)使用XSLT将XML转换为HTML,或者(b)使用.net中的xml操作库(例如3.5中的XDocument和XElement)转换为HTML或(c)其他?
  3. Do you know of any open source libraries/projects that have done this that I could use as a starting point?
  4. 你知道有哪些开源库/项目可以作为起点使用吗?

Thanks!

谢谢!

5 个解决方案

#1


4  

Try this post? I don't know but might be what you are looking for.

试试这篇文章?我不知道,但可能是你在找什么。

#2


3  

I wrote mammoth.js, which is a JavaScript library that converts docx files to HTML. If you want to do the rendering server-side in .NET, there is also a .NET version of Mammoth available on NuGet.

我写了mammoth.js,这是一个将docx文件转换为HTML的JavaScript库。如果你想在.NET中进行渲染服务器端,那么NuGet上还有一个.NET版本的Mammoth。

Mammoth tries to produce clean HTML by looking at semantic information -- for instance, mapping paragraph styles in Word (such as Heading 1) to appropriate tags and style in HTML/CSS (such as <h1>). If you want something that produces an exact visual copy, then Mammoth probably isn't for you. If you have something that's already well-structured and want to convert that to tidy HTML, Mammoth might do the trick.

猛犸通过查看语义信息来尝试生成干净的HTML - 例如,将Word中的段落样式(例如标题1)映射到HTML / CSS中的适当标记和样式(例如

)。如果你想要能产生精确视觉副本的东西,那么猛犸可能不适合你。如果你有一些已经结构良好的东西,并希望将其转换为整洁的HTML,猛犸可能会做到这一点。

#3


2  

Word 2007 has an API that you can use to convert to HTML. Here's a post that talks about it http://msdn.microsoft.com/en-us/magazine/cc163526.aspx. You can find documentation around the API, but I remember that there is a convert to HTML function in the API.

Word 2007有一个可用于转换为HTML的API。这是一篇关于它的帖子http://msdn.microsoft.com/en-us/magazine/cc163526.aspx。您可以找到有关API的文档,但我记得API中有一个转换为HTML函数。

#4


1  

This code will helps to convert .docx file to text

此代码将有助于将.docx文件转换为文本

function read_file_docx($filename){

    $striped_content = '';
    $content = '';

    if(!$filename || !file_exists($filename)) { echo "sucess";}else{ echo "not sucess";}

    $zip = zip_open($filename);

    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    //echo $content;
    //echo "<hr>";
    //file_put_contents('1.xml', $content);     

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
     //header("Content-Type: plain/text");


    $striped_content = strip_tags($content);


      $striped_content = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$striped_content);

    echo nl2br($striped_content); 
}

#5


0  

I'm using Interop. It is somewhat problamatic but works fine in most of the case.

我正在使用Interop。这有些问题但在大多数情况下都能正常工作。

using System.Runtime.InteropServices;
using Microsoft.Office.Interop.Word;

This one returns the list of html converted documents' path

这个返回html转换文档的路径列表

public List<string> GetHelpDocuments()
    {

        List<string> lstHtmlDocuments = new List<string>();
        foreach (string _sourceFilePath in Directory.GetFiles(""))
        {
            string[] validextentions = { ".doc", ".docx" };
            if (validextentions.Contains(System.IO.Path.GetExtension(_sourceFilePath)))
            {
                sourceFilePath = _sourceFilePath;
                destinationFilePath = _sourceFilePath.Replace(System.IO.Path.GetExtension(_sourceFilePath), ".html");
                if (System.IO.File.Exists(sourceFilePath))
                {
                    //checking if the HTML format of the file already exists. if it does then is it the latest one?
                    if (System.IO.File.Exists(destinationFilePath))
                    {
                        if (System.IO.File.GetCreationTime(destinationFilePath) != System.IO.File.GetCreationTime(sourceFilePath))
                        {
                            System.IO.File.Delete(destinationFilePath);
                            ConvertToHTML();
                        }
                    }
                    else
                    {
                        ConvertToHTML();
                    }

                    lstHtmlDocuments.Add(destinationFilePath);
                }
            }


        }
        return lstHtmlDocuments;
    }

And this one to convert doc to html.

而这一个将doc转换为html。

private void ConvertToHtml()
    {
        IsError = false;
        if (System.IO.File.Exists(sourceFilePath))
        {
            Microsoft.Office.Interop.Word.Application docApp = null;
            string strExtension = System.IO.Path.GetExtension(sourceFilePath);
            try
            {
                docApp = new Microsoft.Office.Interop.Word.Application();
                docApp.Visible = true;

                docApp.DisplayAlerts = WdAlertLevel.wdAlertsNone;
                object fileFormat = WdSaveFormat.wdFormatHTML;
                docApp.Application.Visible = true;
                var doc = docApp.Documents.Open(sourceFilePath);
                doc.SaveAs2(destinationFilePath, fileFormat);
            }
            catch
            {
                IsError = true;
            }
            finally
            {
                try
                {
                    docApp.Quit(SaveChanges: false);

                }
                catch { }
                finally
                {
                    Process[] wProcess = Process.GetProcessesByName("WINWORD");
                    foreach (Process p in wProcess)
                    {
                        p.Kill();
                    }
                }
                Marshal.ReleaseComObject(docApp);
                docApp = null;
                GC.Collect();
            }
        }
    }

The killing of the word is not fun, but can't let it hanging there and block others, right?

杀死这个词并不好玩,但不能让它挂在那里阻挡别人,对吧?

In the web/html i render html to a iframe.

在web / html中我将html渲染为iframe。

There is a dropdown which contains the list of help documents. Value is the path to the html version of it and text is name of the document.

有一个下拉列表,其中包含帮助文档列表。值是html版本的路径,text是文档的名称。

private void BindHelpContents()
    {
        List<string> lstHelpDocuments = new List<string>();
        HelpDocuments hDoc = new HelpDocuments(Server.MapPath("~/HelpDocx/docx/"));
        lstHelpDocuments = hDoc.GetHelpDocuments();
        int index = 1;
        ddlHelpDocuments.Items.Insert(0, new ListItem { Value = "0", Text = "---Select Document---", Selected = true });

        foreach (string strHelpDocument in lstHelpDocuments)
        {
            ddlHelpDocuments.Items.Insert(index, new ListItem { Value = strHelpDocument, Text = strHelpDocument.Split('\\')[strHelpDocument.Split('\\').Length - 1].Replace(".html", "") });
            index++;
        }
        FetchDocuments();

    }

on selected index changed, it is renedred to frame

在选定的索引更改后,将其重新绑定到框架

    protected void RenderHelpContents(object sender, EventArgs e)
    {
        try
        {
            if (ddlHelpDocuments.SelectedValue == "0") return;
            string strHtml = ddlHelpDocuments.SelectedValue;
            string newaspxpage = strHtml.Replace(Server.MapPath("~/"), "~/");
            string pageVirtualPath = VirtualPathUtility.ToAbsolute(newaspxpage);// 
            documentholder.Attributes["src"] = pageVirtualPath;
        }
        catch
        {
            lblGError.Text = "Selected document doesn't exist, please refresh the page and try again. If that doesn't help, please contact Support";
        }
    }

#1


4  

Try this post? I don't know but might be what you are looking for.

试试这篇文章?我不知道,但可能是你在找什么。

#2


3  

I wrote mammoth.js, which is a JavaScript library that converts docx files to HTML. If you want to do the rendering server-side in .NET, there is also a .NET version of Mammoth available on NuGet.

我写了mammoth.js,这是一个将docx文件转换为HTML的JavaScript库。如果你想在.NET中进行渲染服务器端,那么NuGet上还有一个.NET版本的Mammoth。

Mammoth tries to produce clean HTML by looking at semantic information -- for instance, mapping paragraph styles in Word (such as Heading 1) to appropriate tags and style in HTML/CSS (such as <h1>). If you want something that produces an exact visual copy, then Mammoth probably isn't for you. If you have something that's already well-structured and want to convert that to tidy HTML, Mammoth might do the trick.

猛犸通过查看语义信息来尝试生成干净的HTML - 例如,将Word中的段落样式(例如标题1)映射到HTML / CSS中的适当标记和样式(例如

)。如果你想要能产生精确视觉副本的东西,那么猛犸可能不适合你。如果你有一些已经结构良好的东西,并希望将其转换为整洁的HTML,猛犸可能会做到这一点。

#3


2  

Word 2007 has an API that you can use to convert to HTML. Here's a post that talks about it http://msdn.microsoft.com/en-us/magazine/cc163526.aspx. You can find documentation around the API, but I remember that there is a convert to HTML function in the API.

Word 2007有一个可用于转换为HTML的API。这是一篇关于它的帖子http://msdn.microsoft.com/en-us/magazine/cc163526.aspx。您可以找到有关API的文档,但我记得API中有一个转换为HTML函数。

#4


1  

This code will helps to convert .docx file to text

此代码将有助于将.docx文件转换为文本

function read_file_docx($filename){

    $striped_content = '';
    $content = '';

    if(!$filename || !file_exists($filename)) { echo "sucess";}else{ echo "not sucess";}

    $zip = zip_open($filename);

    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    //echo $content;
    //echo "<hr>";
    //file_put_contents('1.xml', $content);     

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
     //header("Content-Type: plain/text");


    $striped_content = strip_tags($content);


      $striped_content = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$striped_content);

    echo nl2br($striped_content); 
}

#5


0  

I'm using Interop. It is somewhat problamatic but works fine in most of the case.

我正在使用Interop。这有些问题但在大多数情况下都能正常工作。

using System.Runtime.InteropServices;
using Microsoft.Office.Interop.Word;

This one returns the list of html converted documents' path

这个返回html转换文档的路径列表

public List<string> GetHelpDocuments()
    {

        List<string> lstHtmlDocuments = new List<string>();
        foreach (string _sourceFilePath in Directory.GetFiles(""))
        {
            string[] validextentions = { ".doc", ".docx" };
            if (validextentions.Contains(System.IO.Path.GetExtension(_sourceFilePath)))
            {
                sourceFilePath = _sourceFilePath;
                destinationFilePath = _sourceFilePath.Replace(System.IO.Path.GetExtension(_sourceFilePath), ".html");
                if (System.IO.File.Exists(sourceFilePath))
                {
                    //checking if the HTML format of the file already exists. if it does then is it the latest one?
                    if (System.IO.File.Exists(destinationFilePath))
                    {
                        if (System.IO.File.GetCreationTime(destinationFilePath) != System.IO.File.GetCreationTime(sourceFilePath))
                        {
                            System.IO.File.Delete(destinationFilePath);
                            ConvertToHTML();
                        }
                    }
                    else
                    {
                        ConvertToHTML();
                    }

                    lstHtmlDocuments.Add(destinationFilePath);
                }
            }


        }
        return lstHtmlDocuments;
    }

And this one to convert doc to html.

而这一个将doc转换为html。

private void ConvertToHtml()
    {
        IsError = false;
        if (System.IO.File.Exists(sourceFilePath))
        {
            Microsoft.Office.Interop.Word.Application docApp = null;
            string strExtension = System.IO.Path.GetExtension(sourceFilePath);
            try
            {
                docApp = new Microsoft.Office.Interop.Word.Application();
                docApp.Visible = true;

                docApp.DisplayAlerts = WdAlertLevel.wdAlertsNone;
                object fileFormat = WdSaveFormat.wdFormatHTML;
                docApp.Application.Visible = true;
                var doc = docApp.Documents.Open(sourceFilePath);
                doc.SaveAs2(destinationFilePath, fileFormat);
            }
            catch
            {
                IsError = true;
            }
            finally
            {
                try
                {
                    docApp.Quit(SaveChanges: false);

                }
                catch { }
                finally
                {
                    Process[] wProcess = Process.GetProcessesByName("WINWORD");
                    foreach (Process p in wProcess)
                    {
                        p.Kill();
                    }
                }
                Marshal.ReleaseComObject(docApp);
                docApp = null;
                GC.Collect();
            }
        }
    }

The killing of the word is not fun, but can't let it hanging there and block others, right?

杀死这个词并不好玩,但不能让它挂在那里阻挡别人,对吧?

In the web/html i render html to a iframe.

在web / html中我将html渲染为iframe。

There is a dropdown which contains the list of help documents. Value is the path to the html version of it and text is name of the document.

有一个下拉列表,其中包含帮助文档列表。值是html版本的路径,text是文档的名称。

private void BindHelpContents()
    {
        List<string> lstHelpDocuments = new List<string>();
        HelpDocuments hDoc = new HelpDocuments(Server.MapPath("~/HelpDocx/docx/"));
        lstHelpDocuments = hDoc.GetHelpDocuments();
        int index = 1;
        ddlHelpDocuments.Items.Insert(0, new ListItem { Value = "0", Text = "---Select Document---", Selected = true });

        foreach (string strHelpDocument in lstHelpDocuments)
        {
            ddlHelpDocuments.Items.Insert(index, new ListItem { Value = strHelpDocument, Text = strHelpDocument.Split('\\')[strHelpDocument.Split('\\').Length - 1].Replace(".html", "") });
            index++;
        }
        FetchDocuments();

    }

on selected index changed, it is renedred to frame

在选定的索引更改后,将其重新绑定到框架

    protected void RenderHelpContents(object sender, EventArgs e)
    {
        try
        {
            if (ddlHelpDocuments.SelectedValue == "0") return;
            string strHtml = ddlHelpDocuments.SelectedValue;
            string newaspxpage = strHtml.Replace(Server.MapPath("~/"), "~/");
            string pageVirtualPath = VirtualPathUtility.ToAbsolute(newaspxpage);// 
            documentholder.Attributes["src"] = pageVirtualPath;
        }
        catch
        {
            lblGError.Text = "Selected document doesn't exist, please refresh the page and try again. If that doesn't help, please contact Support";
        }
    }