如何在C#(.NET)中加载MS Word文档的文本?

时间:2022-10-30 15:31:41

How do I load MS Word document (.doc and .docx) to memory (variable) without doing this?:

如何在不执行此操作的情况下将MS Word文档(.doc和.docx)加载到内存(变量)?

wordApp.Documents.Open

I don't want to open MS Word, I just want that text inside.

我不想打开MS Word,我只想在里面找到那个文字。

You gave me answer for DOCX, but what about DOC? I want free and high performance solution - not to open 12.000 instances of Word to process all of them. :( Aspose is commercial product, and 900$ is a way too much for what I do.

你给了我DOCX的答案,但DOC怎么样?我想要免费和高性能的解决方案 - 不要打开12.000个Word实例来处理所有这些。 :( Aspose是商业产品,900美元是我做的太多的方式。

7 个解决方案

#1


4  

You can use wordconv.exe which is part of the Office Compatibility Pack to convert from doc to docx.

您可以使用wordconv.exe,它是Office兼容包的一部分,可以从doc转换为docx。

http://www.microsoft.com/downloads/details.aspx?familyid=941b3470-3ae9-4aee-8f43-c6bb74cd1466&displaylang=en

Just call the command like so: "C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme InputFile OutputFile

只需像这样调用命令:“C:\ Program Files \ Microsoft Office \ Office12 \ wordconv.exe”-oice -nme InputFile OutputFile

I'm not sure if you need word installed for it to run but it does work. I use it locally as a windows shell command to convert old office files to 2007 format whenever I want.

我不确定你是否需要安装单词才能运行但它确实有效。我在本地使用它作为Windows shell命令,以便随时将旧的office文件转换为2007格式。

#2


2  

For docx formatted Word Documents I found this interesting article on The CodeProject

对于docx格式的Word文档,我在CodeProject上找到了这篇有趣的文章

Using DocxToText to Extract Text from DOCX Files

使用DocxToText从DOCX文件中提取文本

In the article the author discusses stripping out just the words themselves.

在文章中,作者讨论了剥离单词本身。

For your doc (non-docx) Word Documents other than using the Office APIs and (in the background) spawning an instance of Word you could try shelling out to one of the many different Doc2Docx converters on the market and then applying the above process for both.

对于您的doc(非docx)Word文档而不是使用Office API和(在后台)生成Word实例,您可以尝试向市场上的众多不同Doc2Docx转换器中的一个转换,然后应用上述过程都。

#3


2  

If you are dealing with docx you can do this with out doing any interop with Word .docx file actually a ZIP contains an XML file , you can read the XML Please refer the below links

如果你正在处理docx你可以做任何与Word .docx文件互操作实际上一个ZIP包含一个XML文件,你可以阅读XML请参考下面的链接

http://conceptdev.blogspot.com/2007/03/open-docx-using-c-to-extract-text-for.html

Office (2007) Open XML File Formats

Office(2007)Open XML File Formats

#4


1  

I recently did some research on this topic. It turns out that to be able to manipulate word files programatically without opening word itself you need some very expensive tools.

我最近做了一些关于这个主题的研究。事实证明,为了能够以编程方式操作word文件而不打开字本身,你需要一些非常昂贵的工具。

There's an article over at code project on manipulating Word, you might find it useful. The author build a C# COM wrapper for dealing with calls to Word. It looks like it actually pops open the word application though.

在代码项目上有一篇关于操作Word的文章,你可能会发现它很有用。作者构建了一个C#COM包装器,用于处理对Word的调用。看起来它实际上是弹出打开单词应用程序。

This post over at the neowin forums looks promising too. It includes quite a few PInvoked calls for the purpose of text extraction.

这篇关于neowin论坛的帖子看起来也很有希望。它包含了很多用于文本提取的PInvoked调用。

Maybe if you could find a way to keep the window hidden it would be acceptable.

也许如果你能找到一种隐藏窗户的方法,那将是可以接受的。

#5


0  

Aspose has a component to read, modify and write Word documents. Here is the product link : Aspose.Words for .NET and Java

Aspose有一个组件来读取,修改和编写Word文档。这是产品链接:Aspose.Words for .NET和Java

Aspose.Words enables .NET and Java applications to read, modify and write Word® documents without utilizing Microsoft Word®. Aspose.Words supports a wide array of features including document creation, content and formatting manipulation, powerful mail merge abilities, comprehensive support of DOC, OOXML, RTF, WordprocessingML, HTML, OpenDocument and PDF formats. Aspose.Words is truly the most affordable, fastest and feature rich Word component on the market.

Aspose.Words使.NET和Java应用程序能够在不使用MicrosoftWord®的情况下读取,修改和编写Word®文档。 Aspose.Words支持多种功能,包括文档创建,内容和格式操作,强大的邮件合并功能,DOC,OOXML,RTF,WordprocessingML,HTML,OpenDocument和PDF格式的全面支持。 Aspose.Words是市场上最实惠,最快,功能最丰富的Word组件。

#6


0  

With docxtemplater, you can easily get the full text of a word (works with docx only).

使用docxtemplater,您可以轻松获取单词的全文(仅适用于docx)。

Here's the code (Node.JS)

这是代码(Node.JS)

DocxTemplater=require('docxtemplater'); doc=new DocxTemplater().loadFromFile("input.docx"); result=doc.getFullText();

DocxTemplater =要求( 'docxtemplater'); doc = new DocxTemplater()。loadFromFile(“input.docx”);结果= doc.getFullText();

This is just three lines of code and doesn't depend on any word instance (all plain JS)

这只是三行代码,并不依赖于任何单词实例(所有普通的JS)

#7


-1  

I don't mean to be an antagonist, but why?

我不是故意成为反对者,为什么?

I've extracted data from Word Documents on Linux servers using Word2X or AbiWord and depending on the number and the variety of docments there will always be errors with the extraction. It's worse the more bullets, page breaks, document sections and other "special" features there are.

我使用Word2X或AbiWord从Linux服务器上的Word文档中提取数据,并且根据文档的数量和种类,提取总是会出错。更糟糕的是更多的子弹,分页符,文档部分和其他“特殊”功能。

I understand there are options now to automate OpenOffice to process documents, but my advice is, if you can, just use Word to process Word documents.

我知道现在有一些选项可以自动化OpenOffice来处理文档,但我的建议是,如果可以的话,只需使用Word来处理Word文档。

#1


4  

You can use wordconv.exe which is part of the Office Compatibility Pack to convert from doc to docx.

您可以使用wordconv.exe,它是Office兼容包的一部分,可以从doc转换为docx。

http://www.microsoft.com/downloads/details.aspx?familyid=941b3470-3ae9-4aee-8f43-c6bb74cd1466&displaylang=en

Just call the command like so: "C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme InputFile OutputFile

只需像这样调用命令:“C:\ Program Files \ Microsoft Office \ Office12 \ wordconv.exe”-oice -nme InputFile OutputFile

I'm not sure if you need word installed for it to run but it does work. I use it locally as a windows shell command to convert old office files to 2007 format whenever I want.

我不确定你是否需要安装单词才能运行但它确实有效。我在本地使用它作为Windows shell命令,以便随时将旧的office文件转换为2007格式。

#2


2  

For docx formatted Word Documents I found this interesting article on The CodeProject

对于docx格式的Word文档,我在CodeProject上找到了这篇有趣的文章

Using DocxToText to Extract Text from DOCX Files

使用DocxToText从DOCX文件中提取文本

In the article the author discusses stripping out just the words themselves.

在文章中,作者讨论了剥离单词本身。

For your doc (non-docx) Word Documents other than using the Office APIs and (in the background) spawning an instance of Word you could try shelling out to one of the many different Doc2Docx converters on the market and then applying the above process for both.

对于您的doc(非docx)Word文档而不是使用Office API和(在后台)生成Word实例,您可以尝试向市场上的众多不同Doc2Docx转换器中的一个转换,然后应用上述过程都。

#3


2  

If you are dealing with docx you can do this with out doing any interop with Word .docx file actually a ZIP contains an XML file , you can read the XML Please refer the below links

如果你正在处理docx你可以做任何与Word .docx文件互操作实际上一个ZIP包含一个XML文件,你可以阅读XML请参考下面的链接

http://conceptdev.blogspot.com/2007/03/open-docx-using-c-to-extract-text-for.html

Office (2007) Open XML File Formats

Office(2007)Open XML File Formats

#4


1  

I recently did some research on this topic. It turns out that to be able to manipulate word files programatically without opening word itself you need some very expensive tools.

我最近做了一些关于这个主题的研究。事实证明,为了能够以编程方式操作word文件而不打开字本身,你需要一些非常昂贵的工具。

There's an article over at code project on manipulating Word, you might find it useful. The author build a C# COM wrapper for dealing with calls to Word. It looks like it actually pops open the word application though.

在代码项目上有一篇关于操作Word的文章,你可能会发现它很有用。作者构建了一个C#COM包装器,用于处理对Word的调用。看起来它实际上是弹出打开单词应用程序。

This post over at the neowin forums looks promising too. It includes quite a few PInvoked calls for the purpose of text extraction.

这篇关于neowin论坛的帖子看起来也很有希望。它包含了很多用于文本提取的PInvoked调用。

Maybe if you could find a way to keep the window hidden it would be acceptable.

也许如果你能找到一种隐藏窗户的方法,那将是可以接受的。

#5


0  

Aspose has a component to read, modify and write Word documents. Here is the product link : Aspose.Words for .NET and Java

Aspose有一个组件来读取,修改和编写Word文档。这是产品链接:Aspose.Words for .NET和Java

Aspose.Words enables .NET and Java applications to read, modify and write Word® documents without utilizing Microsoft Word®. Aspose.Words supports a wide array of features including document creation, content and formatting manipulation, powerful mail merge abilities, comprehensive support of DOC, OOXML, RTF, WordprocessingML, HTML, OpenDocument and PDF formats. Aspose.Words is truly the most affordable, fastest and feature rich Word component on the market.

Aspose.Words使.NET和Java应用程序能够在不使用MicrosoftWord®的情况下读取,修改和编写Word®文档。 Aspose.Words支持多种功能,包括文档创建,内容和格式操作,强大的邮件合并功能,DOC,OOXML,RTF,WordprocessingML,HTML,OpenDocument和PDF格式的全面支持。 Aspose.Words是市场上最实惠,最快,功能最丰富的Word组件。

#6


0  

With docxtemplater, you can easily get the full text of a word (works with docx only).

使用docxtemplater,您可以轻松获取单词的全文(仅适用于docx)。

Here's the code (Node.JS)

这是代码(Node.JS)

DocxTemplater=require('docxtemplater'); doc=new DocxTemplater().loadFromFile("input.docx"); result=doc.getFullText();

DocxTemplater =要求( 'docxtemplater'); doc = new DocxTemplater()。loadFromFile(“input.docx”);结果= doc.getFullText();

This is just three lines of code and doesn't depend on any word instance (all plain JS)

这只是三行代码,并不依赖于任何单词实例(所有普通的JS)

#7


-1  

I don't mean to be an antagonist, but why?

我不是故意成为反对者,为什么?

I've extracted data from Word Documents on Linux servers using Word2X or AbiWord and depending on the number and the variety of docments there will always be errors with the extraction. It's worse the more bullets, page breaks, document sections and other "special" features there are.

我使用Word2X或AbiWord从Linux服务器上的Word文档中提取数据,并且根据文档的数量和种类,提取总是会出错。更糟糕的是更多的子弹,分页符,文档部分和其他“特殊”功能。

I understand there are options now to automate OpenOffice to process documents, but my advice is, if you can, just use Word to process Word documents.

我知道现在有一些选项可以自动化OpenOffice来处理文档,但我的建议是,如果可以的话,只需使用Word来处理Word文档。