我如何以编程方式检查HTML文档

I have a database full of small HTML documents and I need to programatically insert several into, say, a PDF document with iText or a Word document with Aspose.Words. I need to preserve any formatting within the HTML documents (within reason, honouring <b> tags is a must, CSS like <span style="blah"> is a nice-to-have).

我有一个包含小型HTML文档的数据库,我需要以编程方式将几个插入到带有iText的PDF文档或带有Aspose.Words的Word文档中。我需要保留HTML文档中的任何格式(在合理范围内,尊重标签是必须的,像这样的CSS是一个不错的选择)。

Both iText and Aspose work (roughly) along the lines:

iText和Aspose都可以(粗略地)工作:

Document document = new Document( Size.A4, Aspect.PORTRAIT );

document.setFont( "Helvetica", 20, Font.BOLD );
document.insert( "some string" )
document.setBold( true );
document.insert( "A bold string" );

Therefore (I think) I need some kind of HTML parser which will I can inspect for strings and styles to insert into my document.

因此(我认为)我需要某种HTML解析器,我可以检查字符串和样式以插入到我的文档中。

Can anybody suggest a good library or sensible approach to this problem? Platform is Java

任何人都可以建议一个好的图书馆或明智的方法解决这个问题吗?平台是Java

5 个解决方案

#1

HTMLparser is a good HTML parser.

HTMLparser是一个很好的HTML解析器。

I have used this to parse HTML on one of my projects.

我用它来解析我的一个项目上的HTML。

You can write your own filters to parse the HTML for what you want, so the <br> tag shouldn't be difficult to parse out

您可以编写自己的过滤器来解析HTML所需的内容,因此不应该难以解析
标记

Yo can parse out CSS usin the CssSelectorNodeFilter

Yo可以在CssSelectorNodeFilter中解析CSS

#2

If the HTML is "well-formed XML" (XHTML) why not use an XML parser (such as Xerces) and then inspect programatically the DOM tree.

如果HTML是“格式良好的XML”(XHTML),为什么不使用XML解析器(如Xerces),然后以编程方式检查DOM树。

#3

Adobe Acrobat Pro allows you to grab sites via HTTP and does an excellent job of preserving the style and layout. I haven't used it from an API aspect, but it may be worth looking into.

Adobe Acrobat Pro允许您通过HTTP抓取网站,并且可以很好地保留样式和布局。我没有从API方面使用它,但它可能值得研究。

#4

You'd probably be better off getting a component that goes directly from HTML to PDF, or Word, then to try to parse the HTML document and duplicate the formatting yourself based on the HTML. If you want to convert HTML to PDF, and you use .Net, Winnovative provides a good solution.

您可能最好将一个组件直接从HTML转换为PDF或Word,然后尝试解析HTML文档并根据HTML自行复制格式。如果您想将HTML转换为PDF,并使用.Net,Winnovative提供了一个很好的解决方案。

#5

Check out the flying saucer xhtml renderer- they render well-formed XHTML files to PDF, and let you control the output using CSS.

查看飞碟xhtml渲染器 - 它们将格式良好的XHTML文件渲染为PDF,并让您使用CSS控制输出。

#1