Docx到pdf使用openoffice无头方式太慢

时间:2022-01-06 00:29:06

I've been using PHPWord for docx files generation. And it's been working great. But now I have the need to also make available some of those files on a pdf version.

我一直在使用PHPWord生成docx文件。它一直运行得很好。但是现在我还需要在pdf版本中提供其中的一些文件。

After a few research I found PyODConverter which use OOo. Seemed quite a good option since I don't want to depend on third party web services. I tried it out on my machine and it works fined, so I've applied it on my server as well. It took a little longer but I've managed to get it working on there too.

经过一些研究,我发现了使用OOo的PyODConverter。看起来是个不错的选择,因为我不想依赖第三方web服务。我在我的机器上试用过它,它工作得很好,所以我也把它应用到我的服务器上。它花了一些时间,但我也设法使它在那里工作。

There is however an (bad) issue. On the server this takes about 21 seconds to get it done, while on my machine it doesn't take longer than 2. :( This is way too much time for my needs so I've been trying to spot what might be causing this delay. Starting openoffice in healess mode with socket creation is okay. So I've been looking at the python script trying to find out which instruction might be causing to slow down. I've narrowed it down to this line:

然而,这是一个(不好的)问题。在服务器上,这需要21秒才能完成,而在我的机器上,不需要超过2秒。当前位置这对我的需要来说实在是太长时间了,所以我一直在想可能是什么原因导致了这次延误。以无疗愈模式启动openoffice并创建套接字是可以的。所以我一直在看python脚本,试图找出哪个指令可能会导致慢下来。我把范围缩小到这条线:

context = resolver.resolve("uno:socket,host=127.0.0.1,port=8100;urp;StarOffice.ComponentContext")

This is the action that's taking about 20secs to execute. The code where it is inserted:

这个动作需要20秒才能完成。所插入的代码:

localContext = uno.getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext)
try:
    context = resolver.resolve("uno:socket,host=127.0.0.1,port=8100;urp;StarOffice.ComponentContext")
except NoConnectException:
    raise DocumentConversionException, "failed to connect to OpenOffice.org on port %s" % port
self.desktop = context.ServiceManager.createInstanceWithContext("com.sun.star.frame.Desktop", context)

Any clues on what might be causing this delay? I've ruled out the document that I'm trying to convert since this operations occur before that. Could it be a problem with 'uno'? Or maybe another missing library that might be causing useless testing on during the resolve() operation?

有什么线索可以说明延误的原因吗?我已经排除了要转换的文档,因为这个操作在那之前发生过。这可能是uno的问题吗?或者可能是另一个丢失的库,在解析()操作期间可能会导致无用的测试?

Any ideas are welcome. :)

任何想法都是受欢迎的。:)

Best regards, Restless

最好的问候,不安分的

3 个解决方案

#1


3  

I manage to eliminate the delay by using pipes instead of sockets for the connection.

通过使用管道而不是套接字连接,我设法消除了延迟。

context = resolver.resolve("uno:pipe,name=myuser_OOffice;urp;StarOffice.ComponentContext")

I still have one problem though... the user executing the python script must be the same that starts OOo for everything to work okay. Usually it would not be much of an issue, but I'm trying to execute python from my web application and I still didn't manage to get it working. I'm trying with something like this:

不过我还有一个问题……执行python脚本的用户必须与启动OOo的用户相同,以确保所有操作正常。通常这不是什么大问题,但是我正在尝试从我的web应用程序中执行python,但是我还是没能让它正常工作。我正在尝试这样的东西:

exec('sudo -u#1000 -s python path/to/DocumentConverter.py filename.docx filename.pdf');

I'm getting nothing from this.. and I don't get why. Maybe the user (www-data) running exec() does not have permission to execute sudo??

我什么也没得到。我不知道为什么。可能运行exec()的用户(www-data)没有权限执行sudo??

#2


2  

Perhaps the name resolver on the server doesn't know localhost (which would be very odd, but 20 seconds does sound like a DNS timeout). You could try replacing it with 127.0.0.1.

也许服务器上的名称解析器不知道localhost(这很奇怪,但是20秒听起来确实像是DNS超时)。您可以尝试用127.0.0.1替换它。

Alternatively, perhaps it's doing the lookup fine, getting both IPv6 and IPv4 addresses back for localhost, trying to make the connection via IPv6 and failing (i.e. the component may not support IPv6, or doesn't bind to that interface by default) and only then falling back to IPv4. In that case, the remedy would be the same: replace localhost with 127.0.0.1.

或者,它可能正在进行查找,将IPv6和IPv4地址转换为本地主机,尝试通过IPv6和失败(即组件可能不支持IPv6,或者不绑定到该接口,默认情况下),然后返回到IPv4。在这种情况下,补救方法是相同的:用127.0.0.1替换localhost。

#3


2  

Its a pity that openoffice is so heavy. I was also considering it, but then I found lighter solution that is abiword.

很遗憾,openoffice这么笨重。我也在考虑这个问题,但后来我找到了一个更轻松的解决办法,那就是abiword。

I had to generate the previews of 4 first pages from uploaded document. This is what I did:

我必须从上传的文档中生成4个第一页的预览。这就是我所做的:

abiword document.doc --to=ps --exp-props="pages:1-4"
gs -q -dNOPAUSE -dBATCH -dTextAlphaBits=4  -dGraphicsAlphaBits=4 -r72 -sDEVICE=pnggray -sOutputFile=preview%d.png document.ps

So you may get the recent abiword and try something like this:

所以你可能会得到最近的abiword并尝试以下方法:

abiword document.docx --to=pdf

#1


3  

I manage to eliminate the delay by using pipes instead of sockets for the connection.

通过使用管道而不是套接字连接,我设法消除了延迟。

context = resolver.resolve("uno:pipe,name=myuser_OOffice;urp;StarOffice.ComponentContext")

I still have one problem though... the user executing the python script must be the same that starts OOo for everything to work okay. Usually it would not be much of an issue, but I'm trying to execute python from my web application and I still didn't manage to get it working. I'm trying with something like this:

不过我还有一个问题……执行python脚本的用户必须与启动OOo的用户相同,以确保所有操作正常。通常这不是什么大问题,但是我正在尝试从我的web应用程序中执行python,但是我还是没能让它正常工作。我正在尝试这样的东西:

exec('sudo -u#1000 -s python path/to/DocumentConverter.py filename.docx filename.pdf');

I'm getting nothing from this.. and I don't get why. Maybe the user (www-data) running exec() does not have permission to execute sudo??

我什么也没得到。我不知道为什么。可能运行exec()的用户(www-data)没有权限执行sudo??

#2


2  

Perhaps the name resolver on the server doesn't know localhost (which would be very odd, but 20 seconds does sound like a DNS timeout). You could try replacing it with 127.0.0.1.

也许服务器上的名称解析器不知道localhost(这很奇怪,但是20秒听起来确实像是DNS超时)。您可以尝试用127.0.0.1替换它。

Alternatively, perhaps it's doing the lookup fine, getting both IPv6 and IPv4 addresses back for localhost, trying to make the connection via IPv6 and failing (i.e. the component may not support IPv6, or doesn't bind to that interface by default) and only then falling back to IPv4. In that case, the remedy would be the same: replace localhost with 127.0.0.1.

或者,它可能正在进行查找,将IPv6和IPv4地址转换为本地主机,尝试通过IPv6和失败(即组件可能不支持IPv6,或者不绑定到该接口,默认情况下),然后返回到IPv4。在这种情况下,补救方法是相同的:用127.0.0.1替换localhost。

#3


2  

Its a pity that openoffice is so heavy. I was also considering it, but then I found lighter solution that is abiword.

很遗憾,openoffice这么笨重。我也在考虑这个问题,但后来我找到了一个更轻松的解决办法,那就是abiword。

I had to generate the previews of 4 first pages from uploaded document. This is what I did:

我必须从上传的文档中生成4个第一页的预览。这就是我所做的:

abiword document.doc --to=ps --exp-props="pages:1-4"
gs -q -dNOPAUSE -dBATCH -dTextAlphaBits=4  -dGraphicsAlphaBits=4 -r72 -sDEVICE=pnggray -sOutputFile=preview%d.png document.ps

So you may get the recent abiword and try something like this:

所以你可能会得到最近的abiword并尝试以下方法:

abiword document.docx --to=pdf