将doc/docx转换为语义HTML

时间:2022-10-30 12:10:47

I would like to convert doc/docx documents to semantic HTML.

我想把doc/docx文档转换成语义HTML。

Some wishes/requirements:

一些愿望/要求:

  1. Semantic HTML such that headers in the document are <h1>, <h2> etc., tables are <table> and so forth.

    语义HTML这样的文件头是

    等等,表是 <表> 等等。

  2. Should preferably be possible to handle headings, lists, tables and images. Graphs and math formulas is a nice extra.

    最好能够处理标题、列表、表格和图像。图和数学公式是一个很好的额外。

• Doesn't have to be converted straight from doc/docx to html, could use an intermediary format, such as xml or docbook.

•不需要直接从doc/docx转换为html,可以使用中间格式,如xml或docbook。

• Should work programatically, and with large number of documents.

•工作要有计划性,要有大量的文件。

The closest thing to a solution I've found so far is http://holloway.co.nz/docvert/index.html, but unfortunately there are many a few bugs, small user base and it can't handle a lot of documents. More of a proof of concept.

到目前为止,我发现的最接近于解决方案的是http://holloway.co.nz/docvert/index。html,但不幸的是有很多bug,用户基数很小,它不能处理很多文档。更多的是概念的证明。

5 个解决方案

#1


1  

There's a tool called upCast which is able to convert Word documents into XML.

有一个叫做upCast的工具可以将Word文档转换成XML。

#2


2  

" headers in the document are " I think this is impossible. Because MS Word only write down the result, with different styles of <p> just like printed text on paper, the original info are not recorded.

文件的标题是“我认为这是不可能的”。因为MS Word只写结果,和纸质打印文本一样,

的风格不同,原始信息不会被记录下来。

Your other wishes could be approached. There're two commercial tools can do this (don't believe those free tools or online tools, they don't do the real work.)

你的其他愿望也可以实现。有两种商业工具可以做到这一点(不要相信那些免费的工具或在线工具,它们不能做真正的工作)。

1 Word Cleaner by Zapadoo www.zapadoo.com
2 HTML Cleaner for Word by wonder Studio www.htmlcleaner.com

1 Word Cleaner by Zapadoo www.zapadoo.com 2 HTML Cleaner for Word by wonder Studio www.htmlcleaner.com

I prefer the second one which released just last year. You can try them both.

我更喜欢去年上映的第二部。你可以两个都试试。

#3


1  

docx4j (for docx only, not doc) writes clean HTML output. You'd need to change things a bit if you wanted <h1> instead of <p class="h1">, but its open source so you can do that.

docx4j(仅用于docx,而不是doc)编写干净的HTML输出。如果你想要

而不是

,你需要做一点改变,但是它是开源的,所以你可以这么做。

#4


1  

I wrote a utility which implements the requirements you listed, excluding images, graphs and maths formulas. It's beta quality (i.e., it works on my machine). I published it at http://www.modeltext.com/word

我编写了一个实用程序,它实现了您列出的需求,不包括图像、图形和数学公式。这是β(即质量。,它在我的机器上工作)。我在http://www.modeltext.com/word上发布了它

#5


0  

Just more ideas.

只是更多的想法。

Use Gmail to convert word docs

使用Gmail转换word文档。

http://www.oreillynet.com/mac/blog/2006/05/use_gmail_to_convert_word_docs.html

http://www.oreillynet.com/mac/blog/2006/05/use_gmail_to_convert_word_docs.html

#1


1  

There's a tool called upCast which is able to convert Word documents into XML.

有一个叫做upCast的工具可以将Word文档转换成XML。

#2


2  

" headers in the document are " I think this is impossible. Because MS Word only write down the result, with different styles of <p> just like printed text on paper, the original info are not recorded.

文件的标题是“我认为这是不可能的”。因为MS Word只写结果,和纸质打印文本一样,

的风格不同,原始信息不会被记录下来。

Your other wishes could be approached. There're two commercial tools can do this (don't believe those free tools or online tools, they don't do the real work.)

你的其他愿望也可以实现。有两种商业工具可以做到这一点(不要相信那些免费的工具或在线工具,它们不能做真正的工作)。

1 Word Cleaner by Zapadoo www.zapadoo.com
2 HTML Cleaner for Word by wonder Studio www.htmlcleaner.com

1 Word Cleaner by Zapadoo www.zapadoo.com 2 HTML Cleaner for Word by wonder Studio www.htmlcleaner.com

I prefer the second one which released just last year. You can try them both.

我更喜欢去年上映的第二部。你可以两个都试试。

#3


1  

docx4j (for docx only, not doc) writes clean HTML output. You'd need to change things a bit if you wanted <h1> instead of <p class="h1">, but its open source so you can do that.

docx4j(仅用于docx,而不是doc)编写干净的HTML输出。如果你想要

而不是

,你需要做一点改变,但是它是开源的,所以你可以这么做。

#4


1  

I wrote a utility which implements the requirements you listed, excluding images, graphs and maths formulas. It's beta quality (i.e., it works on my machine). I published it at http://www.modeltext.com/word

我编写了一个实用程序,它实现了您列出的需求,不包括图像、图形和数学公式。这是β(即质量。,它在我的机器上工作)。我在http://www.modeltext.com/word上发布了它

#5


0  

Just more ideas.

只是更多的想法。

Use Gmail to convert word docs

使用Gmail转换word文档。

http://www.oreillynet.com/mac/blog/2006/05/use_gmail_to_convert_word_docs.html

http://www.oreillynet.com/mac/blog/2006/05/use_gmail_to_convert_word_docs.html