在Ubuntu上使用ImageMagick将PDF转换为JPEG时出现不匹配的字体问题?

时间:2022-05-11 09:00:23

I am using this command to convert a PDF to a set of JPEG files:

我正在使用此命令将PDF转换为一组JPEG文件:

convert -strip -quality 100 -alpha off \
        -density 165% -scene 1 tmp3GtW_h.pdf /tmp/a1.jpg

Here is the original PDF:

这是原始PDF:

在Ubuntu上使用ImageMagick将PDF转换为JPEG时出现不匹配的字体问题?

The font is thinner and more akin to Helvetica.

字体更薄,更类似于Helvetica。

Here is the outcome:

结果如下:

在Ubuntu上使用ImageMagick将PDF转换为JPEG时出现不匹配的字体问题?

The font in the output JPEG file is different and thicker.

输出JPEG文件中的字体不同且更粗。

The convert command shows this warning:

convert命令显示此警告:

   **** Warning:  An error occurred while reading an XREF table.
   **** The file has been damaged.  This may have been caused
   **** by a problem while converting or transfering the file.
   **** Ghostscript will attempt to recover the data.

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> Microsoft? PowerPoint? 2013 <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

The version of convert is:

转换版本是:

$ convert --version
Version: ImageMagick 6.8.9-7 Q16 x86_64 2014-12-30 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2014 ImageMagick Studio LLC
Features: DPC OpenMP
Delegates: jng jpeg png x xml zlib

Ghostscript version is:

Ghostscript版本是:

$ gs --version
9.10

My questions are

我的问题是

1) How can I resolve this issue?

1)我该如何解决这个问题?

2) How can I tell what font the PDF file is using?

2)如何判断PDF文件使用的字体?

3) How can I tell what fonts are available to convert and gs?

3)如何判断哪些字体可用于转换和gs?

EDIT: Found an answer to question 2. Here is the outcome from the pdffonts command:

编辑:找到问题2的答案。以下是pdffonts命令的结果:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Intro Black Italic                   Type 1            WinAnsi          no  no  no     145  0
Intro Regular                        Type 1            WinAnsi          no  no  no     147  0
Intro Black Inline Caps              Type 1            WinAnsi          no  no  no     388  0
ABCDEE+Segoe UI                      TrueType          WinAnsi          yes yes no    2233  0
ABCDEE+Segoe UI,Italic               CID TrueType      Identity-H       yes yes yes   2607  0
ABCDEE+Segoe UI,Italic               TrueType          WinAnsi          yes yes no    2612  0
Intro Bold Italic                    Type 1            WinAnsi          no  no  no    3781  0

1 个解决方案

#1


If you want to know all relevant details about the fonts used by a PDF document, use

如果您想了解有关PDF文档使用的字体的所有相关详细信息,请使用

pdffonts the.pdf

You'll see in the column emb indicated with yes or no if a font is embedded.

如果嵌入了字体,您将在列中显示是或否的列。

If a font is NOT embedded, such things will happen as you see: the PDF renderer does not find the font in the file, so it uses a substitution font:

如果未嵌入字体,则会发生如下情况:PDF渲染器在文件中找不到字体,因此它使用替换字体:

  1. If you are lucky, it finds one on the local system with the same or a similar name, and the rendered pages will look like it did look for the producer of the PDF (who must have had a font with the name used by the PDF on his system).
  2. 如果幸运的话,它会在本地系统上找到一个具有相同或相似名称的页面,并且呈现的页面看起来就像它找到了PDF的制作者(谁必须有一个PDF所用的字体的字体)在他的系统上)。

  3. If you are more unlucky, it uses a substitution font that is not really suitable, and doesn't look good or "right".
  4. 如果你更不走运,它会使用一种不太合适的替换字体,并且看起来不太好或“不对”。

  5. If you are very unlucky, the substitution doesn't work at all and the page looks like garbage.
  6. 如果你非常不走运,替换根本不起作用,页面看起来像垃圾。

But the document will most likely look different from viewer to viewer, and from system to system. Because each viewer uses a different algorithm to substitute missing fonts.

但是,从观看者到观看者以及从系统到系统,文档很可能看起来不同。因为每个查看器使用不同的算法来替换丢失的字体。

The pdffonts command has the -subst parameter. So

pdffonts命令具有-subst参数。所以

pdffonts -subst the.pdf

will report, what substitution fonts could be possibly be used. Since Poppler, the library pdffonts is based upon uses FreeType as its font engine, this reported substitution fonts will likely be valid for every viewer that also uses FreeType.

将报告,可能会使用哪些替换字体。自Poppler以来,库pdffonts基于使用FreeType作为其字体引擎,这种报告的替换字体可能对每个也使用FreeType的查看器都有效。

Acrobat for example does NOT use FreeType, but its own font rendering engine. So in Adobe Reader you'll likely get different substitution fonts.

例如,Acrobat不使用FreeType,而是使用自己的字体渲染引擎。因此,在Adobe Reader中,您可能会获得不同的替换字体。


Ghostscript:

The command

gs -h

will report (amongst other things) which directories it will use as its path to search for fonts.

将报告(除其他外)它将用作搜索字体的路径的目录。

Any Ghostscript command you run can be amended by

您运行的任何Ghostscript命令都可以修改

-sFONTPATH=/path/to/dir:/path/to/other/dir

to tell Ghostscript to look in other directories for needed fonts for the duration of the current command.

告诉Ghostscript在当前命令的持续时间内在其他目录中查找所需的字体。

ImageMagick:

This command

convert -list font

will report all fonts which ImageMagick has found on the system.

将报告ImageMagick在系统中找到的所有字体。


Update: (after update to question)

So very clearly that four different Intro fonts are not embedded in the PDF. This is a very uncommon font, certainly not in the top 200 used worldwide in PDFs (I should know, because I've harvested 1.000.000 PDFs from the web and am currently creating a statistical database about their various properties -- I don't have a single Intro in there...).

非常清楚,PDF中没有嵌入四种不同的Intro字体。这是一种非常罕见的字体,当然不会出现在全球使用的前200个PDF中(我应该知道,因为我从网上收集了1.000.000个PDF,目前正在创建一个关于各种属性的统计数据库 - 我不知道那里有一个简介...)。

Whoever created that PDF, or whichever software did so, clearly didn't have much clue about document processing. Because every other system or user or application which has to open, view or process that document will see a very different view of those pages using these fonts from what its creator saw.

谁创建了这个PDF,或者无论哪个软件都这样做,显然对文档处理没有多少线索。因为必须打开,查看或处理该文档的每个其他系统或用户或应用程序将使用这些字体从其创建者看到的那些字体看到非常不同的视图。

In order to process this PDF into images you should not rely on ImageMagick, but run Ghostscript directly:

要将此PDF处理为图像,您不应该依赖ImageMagick,而是直接运行Ghostscript:

  1. Locate the directories where the four Intro fonts are to be found.
  2. 找到要找到四个简介字体的目录。

  3. Run the Ghostscript command with the -sFONTPATH=... parameter as explained above.
  4. 如上所述,使用-sFONTPATH = ...参数运行Ghostscript命令。


Let me re-iterate:

让我重复一遍:

  1. You cannot force or suggest to convert to use any font for rendering the PDF pages to raster images.
  2. 您无法强制或建议转换为使用任何字体将PDF页面渲染为光栅图像。

  3. This is because ImageMagick never gets to see the PDF itself. What ImageMagick receives, is a raster image, which has been produced by Ghostscript.
  4. 这是因为ImageMagick永远不会看到PDF本身。 ImageMagick接收的是一个由Ghostscript生成的光栅图像。

  5. Once Ghostscript is done with its work, the accident has happened already, and convert cannot insert any 'font' into the raster data in the aftermath.
  6. 一旦Ghostscript完成其工作,事故已经发生,并且转换不能在后果中将任何“字体”插入到栅格数据中。

  7. The fonts that convert can use are only for its own drawing, writing, captioning and annotating operations.
  8. 转换可以使用的字体仅用于其自己的绘图,书写,字幕和注释操作。

  9. So you have to run Ghostscript directly, and supply the -sFONTPATH=... argument.
  10. 所以你必须直接运行Ghostscript,并提供-sFONTPATH = ...参数。

  11. You have to find out yourself, where on your system that Intro font family is. I cannot do that for you, sorry.
  12. 您必须找到自己,在您的系统上的Intro字体系列。抱歉,我不能为你做那件事。

Running convert -verbose will give you some insight about how exactly ImageMagick employs Ghostscript as its 'delegate' for PDF input processing, and which command line parameters it uses....

运行convert -verbose将让您深入了解ImageMagick如何使用Ghostscript作为PDF输入处理的“委托”,以及它使用的命令行参数....

#1


If you want to know all relevant details about the fonts used by a PDF document, use

如果您想了解有关PDF文档使用的字体的所有相关详细信息,请使用

pdffonts the.pdf

You'll see in the column emb indicated with yes or no if a font is embedded.

如果嵌入了字体,您将在列中显示是或否的列。

If a font is NOT embedded, such things will happen as you see: the PDF renderer does not find the font in the file, so it uses a substitution font:

如果未嵌入字体,则会发生如下情况:PDF渲染器在文件中找不到字体,因此它使用替换字体:

  1. If you are lucky, it finds one on the local system with the same or a similar name, and the rendered pages will look like it did look for the producer of the PDF (who must have had a font with the name used by the PDF on his system).
  2. 如果幸运的话,它会在本地系统上找到一个具有相同或相似名称的页面,并且呈现的页面看起来就像它找到了PDF的制作者(谁必须有一个PDF所用的字体的字体)在他的系统上)。

  3. If you are more unlucky, it uses a substitution font that is not really suitable, and doesn't look good or "right".
  4. 如果你更不走运,它会使用一种不太合适的替换字体,并且看起来不太好或“不对”。

  5. If you are very unlucky, the substitution doesn't work at all and the page looks like garbage.
  6. 如果你非常不走运,替换根本不起作用,页面看起来像垃圾。

But the document will most likely look different from viewer to viewer, and from system to system. Because each viewer uses a different algorithm to substitute missing fonts.

但是,从观看者到观看者以及从系统到系统,文档很可能看起来不同。因为每个查看器使用不同的算法来替换丢失的字体。

The pdffonts command has the -subst parameter. So

pdffonts命令具有-subst参数。所以

pdffonts -subst the.pdf

will report, what substitution fonts could be possibly be used. Since Poppler, the library pdffonts is based upon uses FreeType as its font engine, this reported substitution fonts will likely be valid for every viewer that also uses FreeType.

将报告,可能会使用哪些替换字体。自Poppler以来,库pdffonts基于使用FreeType作为其字体引擎,这种报告的替换字体可能对每个也使用FreeType的查看器都有效。

Acrobat for example does NOT use FreeType, but its own font rendering engine. So in Adobe Reader you'll likely get different substitution fonts.

例如,Acrobat不使用FreeType,而是使用自己的字体渲染引擎。因此,在Adobe Reader中,您可能会获得不同的替换字体。


Ghostscript:

The command

gs -h

will report (amongst other things) which directories it will use as its path to search for fonts.

将报告(除其他外)它将用作搜索字体的路径的目录。

Any Ghostscript command you run can be amended by

您运行的任何Ghostscript命令都可以修改

-sFONTPATH=/path/to/dir:/path/to/other/dir

to tell Ghostscript to look in other directories for needed fonts for the duration of the current command.

告诉Ghostscript在当前命令的持续时间内在其他目录中查找所需的字体。

ImageMagick:

This command

convert -list font

will report all fonts which ImageMagick has found on the system.

将报告ImageMagick在系统中找到的所有字体。


Update: (after update to question)

So very clearly that four different Intro fonts are not embedded in the PDF. This is a very uncommon font, certainly not in the top 200 used worldwide in PDFs (I should know, because I've harvested 1.000.000 PDFs from the web and am currently creating a statistical database about their various properties -- I don't have a single Intro in there...).

非常清楚,PDF中没有嵌入四种不同的Intro字体。这是一种非常罕见的字体,当然不会出现在全球使用的前200个PDF中(我应该知道,因为我从网上收集了1.000.000个PDF,目前正在创建一个关于各种属性的统计数据库 - 我不知道那里有一个简介...)。

Whoever created that PDF, or whichever software did so, clearly didn't have much clue about document processing. Because every other system or user or application which has to open, view or process that document will see a very different view of those pages using these fonts from what its creator saw.

谁创建了这个PDF,或者无论哪个软件都这样做,显然对文档处理没有多少线索。因为必须打开,查看或处理该文档的每个其他系统或用户或应用程序将使用这些字体从其创建者看到的那些字体看到非常不同的视图。

In order to process this PDF into images you should not rely on ImageMagick, but run Ghostscript directly:

要将此PDF处理为图像,您不应该依赖ImageMagick,而是直接运行Ghostscript:

  1. Locate the directories where the four Intro fonts are to be found.
  2. 找到要找到四个简介字体的目录。

  3. Run the Ghostscript command with the -sFONTPATH=... parameter as explained above.
  4. 如上所述,使用-sFONTPATH = ...参数运行Ghostscript命令。


Let me re-iterate:

让我重复一遍:

  1. You cannot force or suggest to convert to use any font for rendering the PDF pages to raster images.
  2. 您无法强制或建议转换为使用任何字体将PDF页面渲染为光栅图像。

  3. This is because ImageMagick never gets to see the PDF itself. What ImageMagick receives, is a raster image, which has been produced by Ghostscript.
  4. 这是因为ImageMagick永远不会看到PDF本身。 ImageMagick接收的是一个由Ghostscript生成的光栅图像。

  5. Once Ghostscript is done with its work, the accident has happened already, and convert cannot insert any 'font' into the raster data in the aftermath.
  6. 一旦Ghostscript完成其工作,事故已经发生,并且转换不能在后果中将任何“字体”插入到栅格数据中。

  7. The fonts that convert can use are only for its own drawing, writing, captioning and annotating operations.
  8. 转换可以使用的字体仅用于其自己的绘图,书写,字幕和注释操作。

  9. So you have to run Ghostscript directly, and supply the -sFONTPATH=... argument.
  10. 所以你必须直接运行Ghostscript,并提供-sFONTPATH = ...参数。

  11. You have to find out yourself, where on your system that Intro font family is. I cannot do that for you, sorry.
  12. 您必须找到自己,在您的系统上的Intro字体系列。抱歉,我不能为你做那件事。

Running convert -verbose will give you some insight about how exactly ImageMagick employs Ghostscript as its 'delegate' for PDF input processing, and which command line parameters it uses....

运行convert -verbose将让您深入了解ImageMagick如何使用Ghostscript作为PDF输入处理的“委托”,以及它使用的命令行参数....