在ASP.NET中，将PDF文件转换为HTML的最佳方法是什么？

What my users will do is select a PDF document on their machine, upload it to my website, where I will convert into an HTML document for display on the website. The document will be stored in a database after conversion.

我的用户将在他们的机器上选择一个PDF文档,将其上传到我的网站,在那里我将转换为HTML文档以便在网站上显示。转换后,文档将存储在数据库中。

What's the best way to convert a PDF to HTML?

将PDF转换为HTML的最佳方法是什么?

I have been handed a requirement where a user would create a "news" story as a pdf and then would upload it to the sever, where it will be converted to HTML and displayed on the website.

我已经提出了一个要求,即用户可以创建一个“新闻”故事作为pdf,然后将其上传到服务器,在那里它将转换为HTML并显示在网站上。

6 个解决方案

#1

Any document creation software that can save documents as PDF can save them as HTML. I'm assuming the issue is that your users will be creating rich documents (lots of embedded images), which results in multiple files, and your requirements stem from a desire to make uploading these documents as simple as possible to the user.

任何可以将文档保存为PDF的文档创建软件都可以将它们保存为HTML。我假设问题是您的用户将创建丰富的文档(大量嵌入的图像),这会导致多个文件,并且您的要求源于希望尽可能简单地将这些文档上载到用户。

There are numerous conversion packages that can probably do this for you, however when you're talking about rich content, you are talking about text plus images. Those images have to be stored somewhere and served somehow, and whatever conversion method you use will require you to examine all image sources to make sure they point to valid locations on your server.

有许多转换包可以为您做到这一点,但是当您谈论丰富的内容时,您谈论的是文本和图像。这些图像必须存储在某处并以某种方式提供,无论您使用何种转换方法,都需要检查所有图像源,以确保它们指向服务器上的有效位置。

I would like to suggest an alternate way of doing this that you can take to your team: Implement one of the many blog APIs for publishing content. There are free and commercial software packages that use these APIs to publish content directly to a website, such as Windows Live Writer and Microsoft Word. Your users can simply create their content and upload it directly to your website without having to publish it as PDF first then upload it. So the process becomes much smoother for your users, and you get the posts in a form that doesn't require you spend thousands of dollars on developing or buying conversion code.

我想建议另一种方法,你可以带到你的团队:实现许多博客API之一用于发布内容。有免费和商业软件包使用这些API将内容直接发布到网站,如Windows Live Writer和Microsoft Word。您的用户只需创建内容并将其直接上传到您的网站,而无需先将其发布为PDF然后再上传。因此,对于您的用户来说,这个过程会变得更顺畅,并且您可以使用不需要花费数千美元来开发或购买转换代码的表单来获取帖子。

The two most common APIs are the MetaWeblog API and the Movable Type API. Both are very simple and easy to implement. I think this way would be a MUCH better alternative than what you're thinking about doing.

两个最常见的API是MetaWeblog API和Movable Type API。两者都非常简单,易于实现。我认为这种方式比你想做的更好。

#2

I don't think converting a PDF to an HTML string is necessarily the best idea, especially if you want to export it back as PDF. PDF files often contain binary elements such as images, so you may be best to convert it to ASCII via an encoding, such as Base64. That way you will have an ASCII string you can save into a text field in the DB and then convert it back out. Could you expand more on the main requirement?

我不认为将PDF转换为HTML字符串是最好的主意,特别是如果您想将其作为PDF导出。 PDF文件通常包含二进制元素(如图像),因此最好通过编码将其转换为ASCII,例如Base64。这样,您将拥有一个ASCII字符串,您可以将其保存到数据库中的文本字段中,然后将其转换回来。你能否进一步扩展主要要求?

#3

My recommendation would be to not do it that way IF POSSIBLE (but we all know what managers are like) so...

我的建议是不要这样做,如果可能的话(但我们都知道管理者是什么样的)所以......

I would recommend that you stay away from converting the PDF to/from HTML (because unless you can find a commercial solution it will be nigh on impossible) and instead do as has already been mentioned and store it as an encoded Base64 string, or BLOB or some other binary format in the database, and then display it to the user with some sort of PDF view plugin for the browser.

我建议你远离将PDF转换为HTML或从HTML转换(因为除非你能找到一个商业解决方案,它几乎不可能),而是按照已经提到的那样做,并将其存储为编码的Base64字符串,或BLOB或者数据库中的其他二进制格式,然后使用某种用于浏览器的PDF视图插件将其显示给用户。

#4

All it took was a simple google search for "PDF to HTML": http://www.gnostice.com/pdf2manyOverview_x.asp. I'm sure there are others.

所需要的只是一个简单的谷歌搜索“PDF到HTML”:http://www.gnostice.com/pdf2manyOverview_x.asp。我确定还有其他人。

So while it's 'possible', you may want to explain to your manager that this isn't the best content management solution.

因此,虽然它“可能”,但您可能需要向您的经理解释这不是最佳的内容管理解决方案。

#5

Why not use the iTextSharp to read the PDF content? Then You could save both the binary PDF and the text content to the database. You could then let users search the content and download the PDF.

为什么不使用iTextSharp来阅读PDF内容?然后,您可以将二进制PDF和文本内容保存到数据库中。然后,您可以让用户搜索内容并下载PDF。

#6

You should look into DynamicPDF. They have a converter (currently Beta) out for serving exactly this purpose. We have used their products with great success (especially for dumping Reporting Services reports directly to PDF).

你应该研究一下DynamicPDF。他们有一个转换器(目前是Beta),正好用于此目的。我们使用他们的产品取得了巨大成功(特别是将Reporting Services报告直接转储为PDF)。

Ref: http://www.dynamicpdf.com/

#1