如何在Word中将Word文档转换为非常简单的html?

时间:2022-10-30 13:06:39

Every now and then I receive a Word Document that I have to display as a web page. I'm currently using Django's flatpages to achieve this by grabbing the html content generated by MS Word. The generated html is quite messy. Is there a better way that can generate very simple html to solve this issue using Python?

我不时地收到一个Word文档,我必须将其显示为网页。我目前正在使用Django的flatpages通过抓取MS Word生成的html内容来实现这一目标。生成的HTML非常混乱。有没有更好的方法可以使用Python生成非常简单的html来解决这个问题?

6 个解决方案

#1


6  

A good solution involves uploading into Google Docs and exporting the html version from it. (There must be an api for that?)

一个好的解决方案是上传到Google文档并从中导出html版本。 (必须有api吗?)

It does so many "clean ups"; Beautiful Soup down the road can be used to make any further changes, as appropriate. It is the most powerful and elegant html parsing library on the planet.

它做了很多“清理”;沿途的美丽汤可以用来进行任何进一步的改变。它是这个星球上最强大,最优雅的html解析库。

This is a known standard for Journalist companies.

这是记者公司的已知标准。

#2


4  

I found this web page: http://www.textfixer.com/html/convert-word-to-html.php

我找到了这个网页:http://www.textfixer.com/html/convert-word-to-html.php

It converts a formated text to simple HTML markup, preserving bold, italic, links and paragraphs, but not adding tags for font-sizes and faces. Exactly what I needed to save some time.

它将格式化文本转换为简单的HTML标记,保留粗体,斜体,链接和段落,但不为字体大小和面添加标记。正是我需要节省一些时间。

#3


3  

My super-simple app WordOff has an API for cleaning up cruft from Word-exported HTML. You could override the save method of your flatpages model to pipe your HTML through the API the first time it gets saved. Something like this:

我的超级简单的应用程序WordOff有一个API,用于清除Word导出的HTML。您可以覆盖flatpages模型的save方法,以便在第一次保存时通过API管道HTML。像这样的东西:

import urllib
import urllib2

def decruft(html):
    data = urllib.urlencode({'html' : html})
    req = urllib2.Request('http://wordoff.org/api/clean', data)
    response = urllib2.urlopen(req)
    return response.read()

def save(self, **kwargs):
    if not self.pk: # only de-cruft when content is first added
        self.content = decruft(self.content)
    super(FlatPage, self).save(**kwargs)

#4


2  

It depends how much formatting and images you're dealing with. I do one of a couple things:

这取决于你正在处理多少格式和图像。我做了几件事之一:

  • Google Docs: Probably the closest you'll get to the original formatting and usable HTML.
  • Google文档:可能是您最接近原始格式和可用HTML的版本。
  • Markdown: Abandon formatting. Paste it into a plain text editor, run it through Markdown and fix the rest by hand.
  • Markdown:放弃格式化。将其粘贴到纯文本编辑器中,通过Markdown运行并手动修复其余部分。

#5


2  

You can also use Abiword/wvWare to convert word document to XHTML and then parse it with BeautifulSoup/ElementTree/etc. to preprocess it if you need. In my experience, Abiword does a pretty good job at converting Word files and produce relatively clean XHTML files.

您还可以使用Abiword / wvWare将word文档转换为XHTML,然后使用BeautifulSoup / ElementTree / etc进行解析。如果需要,可以预处理它。根据我的经验,Abiword在转换Word文件和生成相对干净的XHTML文件方面做得非常好。

I should mention that Abiword can be run on the command line, so it's easy to integrate it in an automated process.

我应该提一下,Abiword可以在命令行上运行,因此很容易将它集成到自动化过程中。

#6


2  

Word 2010 has the ability to "save as filtered web page". This will eliminate the overwhelming majority of the HTML that Word inserts.

Word 2010具有“另存为筛选的网页”的功能。这将消除Word插入的绝大多数HTML。

#1


6  

A good solution involves uploading into Google Docs and exporting the html version from it. (There must be an api for that?)

一个好的解决方案是上传到Google文档并从中导出html版本。 (必须有api吗?)

It does so many "clean ups"; Beautiful Soup down the road can be used to make any further changes, as appropriate. It is the most powerful and elegant html parsing library on the planet.

它做了很多“清理”;沿途的美丽汤可以用来进行任何进一步的改变。它是这个星球上最强大,最优雅的html解析库。

This is a known standard for Journalist companies.

这是记者公司的已知标准。

#2


4  

I found this web page: http://www.textfixer.com/html/convert-word-to-html.php

我找到了这个网页:http://www.textfixer.com/html/convert-word-to-html.php

It converts a formated text to simple HTML markup, preserving bold, italic, links and paragraphs, but not adding tags for font-sizes and faces. Exactly what I needed to save some time.

它将格式化文本转换为简单的HTML标记,保留粗体,斜体,链接和段落,但不为字体大小和面添加标记。正是我需要节省一些时间。

#3


3  

My super-simple app WordOff has an API for cleaning up cruft from Word-exported HTML. You could override the save method of your flatpages model to pipe your HTML through the API the first time it gets saved. Something like this:

我的超级简单的应用程序WordOff有一个API,用于清除Word导出的HTML。您可以覆盖flatpages模型的save方法,以便在第一次保存时通过API管道HTML。像这样的东西:

import urllib
import urllib2

def decruft(html):
    data = urllib.urlencode({'html' : html})
    req = urllib2.Request('http://wordoff.org/api/clean', data)
    response = urllib2.urlopen(req)
    return response.read()

def save(self, **kwargs):
    if not self.pk: # only de-cruft when content is first added
        self.content = decruft(self.content)
    super(FlatPage, self).save(**kwargs)

#4


2  

It depends how much formatting and images you're dealing with. I do one of a couple things:

这取决于你正在处理多少格式和图像。我做了几件事之一:

  • Google Docs: Probably the closest you'll get to the original formatting and usable HTML.
  • Google文档:可能是您最接近原始格式和可用HTML的版本。
  • Markdown: Abandon formatting. Paste it into a plain text editor, run it through Markdown and fix the rest by hand.
  • Markdown:放弃格式化。将其粘贴到纯文本编辑器中,通过Markdown运行并手动修复其余部分。

#5


2  

You can also use Abiword/wvWare to convert word document to XHTML and then parse it with BeautifulSoup/ElementTree/etc. to preprocess it if you need. In my experience, Abiword does a pretty good job at converting Word files and produce relatively clean XHTML files.

您还可以使用Abiword / wvWare将word文档转换为XHTML,然后使用BeautifulSoup / ElementTree / etc进行解析。如果需要,可以预处理它。根据我的经验,Abiword在转换Word文件和生成相对干净的XHTML文件方面做得非常好。

I should mention that Abiword can be run on the command line, so it's easy to integrate it in an automated process.

我应该提一下,Abiword可以在命令行上运行,因此很容易将它集成到自动化过程中。

#6


2  

Word 2010 has the ability to "save as filtered web page". This will eliminate the overwhelming majority of the HTML that Word inserts.

Word 2010具有“另存为筛选的网页”的功能。这将消除Word插入的绝大多数HTML。