如何将乳胶转换为纯文本(ASCII)?

时间:2022-10-30 14:35:12

Scenario:
I have a document I created using LaTeX (my resume in this case), it's compiling in pdflatex correctly and outputting exactly what I'd like. Now I need the same document to be converted to plain old ASCII.

场景:我有一个使用LaTeX创建的文档(在本例中是我的简历),它在pdflatex中正确地编译并输出了我想要的。现在我需要将相同的文档转换为普通的旧ASCII。

Example:
I have seen this done (at least once) here, where the author has a PDF version and an ASCII version that matches the PDF version in almost every way, including margins, spacing and bullet points.

示例:我在这里已经看到了这一点(至少有一次),作者有一个PDF版本和一个ASCII版本,几乎在所有方面都与PDF版本匹配,包括页边距、间距和要点。

I realize this type of conversion cannot be exact due to limitations in the ASCII format, but a very close approximation does seem possible based on what I have found so far. What is the process for doing this?

我意识到,由于ASCII格式的限制,这种类型的转换不可能是精确的,但是根据我到目前为止所发现的,非常接近的近似是可能的。这样做的过程是什么?

14 个解决方案

#1


16  

CatDVI can convert DVI to text and attempts to preserve the formatting.

CatDVI可以将DVI转换为文本并尝试保存格式。

#2


39  

Opendetex is available both for Windows and Linux (compiles fine on a Mac as well). It can be downloaded from http://code.google.com/p/opendetex/downloads/list

Opendetex可用于Windows和Linux(在Mac上也可以很好地编译)。可以从http://code.google.com/p/opendetex/downloads/list下载

Usage: http://code.google.com/p/opendetex/wiki/Usage

用法:http://code.google.com/p/opendetex/wiki/Usage

Extract it to any directory of your choice. Say you extracted it to your Downloads directory.

将它解压到您选择的任何目录。假设您将它提取到您的下载目录。

Create another directory of any name in that (this is optional but recommended). Let's say the directory name is “my_paper”. Put your paper in the “my_paper” directory. Assume your paper name is project.tex.

创建其中任意名称的另一个目录(这是可选的,但推荐)。我们说目录名是“my_paper”。把你的论文放在“my_paper”目录中。假设你的论文名称是project.tex。

Navigate to the path

导航到路径

    cd ~/Downloads/opendetex

Run the command

运行以下命令

    detex my_paper/project.tex  > out.txt

generic form

通用的形式

    detex -n full_path_to_tex_file.tex > output_text_file.txt

#3


14  

You can try some of the programs proposed here:

你可以试试这里提出的一些方案:

TeX to ASCII

特克斯为ASCII

#4


8  

Another option is to use htlatex to create a web page from the LaTeX sources, then use links to convert to plain text. I used the command line

另一种选择是使用htlatex来从LaTeX源创建web页面,然后使用链接将其转换为纯文本。我使用命令行

links -dump -no-numbering -no-references input.html > output.txt

链接-转储-无编号-无引用输入。html > output.txt

in the past which gave a rather nice result. This will of course rather match the view of the rendered HTML than the original PDF, thus maybe not exactly what you want.

在过去给了一个相当不错的结果。这当然更符合呈现的HTML的视图,而不是原始的PDF,因此可能不是您想要的。

#5


8  

You can also try Pandoc, it can transform latex to many other formats. I suggest reading its documentation, for there may be some tricky cases that you need pass some arguments to handle.

您也可以尝试Pandoc,它可以将latex转换成许多其他格式。我建议阅读它的文档,因为您可能需要传递一些参数来处理一些棘手的情况。

#6


5  

If you are using pdflatex, you probably don't want to mess around with your package options to switch to latex to generate a DVI.

如果您正在使用pdflatex,那么您可能不希望将包选项切换到latex以生成DVI。

Instead, take your pdf file and convert that instead. This worked for my CV/resume made with the Curve package:

相反,把你的pdf文件转换成。这适用于我的简历和曲线包装:

pdftotext  -layout MyResume.pdf

Note the -layout flag.

注意布局国旗。

#7


3  

My usual strategy is to use hyperlatex to turn it into a web page, and then cope and paste from a web browser. I find that this gives the best formatting.

我通常的策略是使用hyperlatex来将其转换为web页面,然后从web浏览器处理和粘贴。我发现这提供了最好的格式。

I usually then have to go through and manually fix some line-wrapping...

然后我通常需要手动修改一些行包装……

#8


3  

Try the steps here: http://zanedp.livejournal.com/201222.html

试试下面的步骤:http://zanedp.livejournal.com/201222.html

Here is a sequence that converts my LaTeX file to plain text:

下面是一个将我的LaTeX文件转换成纯文本的序列:

$ latex file.tex
$ catdvi -e 1 -U file.dvi | sed -re "s/\[U\+2022\]/*/g" | sed -re "s/([^^[:space:]])\s+/\1 /g" > file.txt

The -e 1 option to catdvi tells it to output ASCII. If you use 0 instead of 1, it will output Unicode. Unicode will include all the special characters like bullets, emdashes, and Greek letters. It also include ligatures for some letter combinations like "fi" and "fl." You may not like that. So, use -e 1 instead. Use the -U option to tell it to print out the unicode value for unknown characters so that you can easily find and replace them.

catdvi的- e1选项告诉它输出ASCII。如果您使用0而不是1,它将输出Unicode。Unicode将包括所有的特殊字符,如子弹、破折号和希腊字母。它还包括一些字母组合的连接,如“fi”和“fl”。你可能不喜欢这样。用- e1代替。使用-U选项告诉它打印未知字符的unicode值,以便您可以轻松找到并替换它们。

The second part of the command finds the string [U+2022] which is used to designate bullet characters (•) and replaces them with an asterisk (*).

命令的第二部分查找字符串[U+2022],该字符串用于指定子弹字符(•),并用星号(*)替换它们。

The third part eats up all the extra whitespace catdvi threw in to make the text full-justified while preserving spaces at the start of lines (indentation).

第三部分将耗尽catdvi所添加的所有额外的空白,以使文本完整,同时在行开头保留空格(缩进)。

After running these commands, you would be wise to search the .txt file for the string [U+ to make sure no Unicode characters that can't be mapped to ASCII were left behind and fix them.

运行这些命令之后,您应该明智地搜索.txt文件中的字符串[U+,以确保不会留下不能映射到ASCII的Unicode字符并修复它们。

#9


3  

When I needed to get the plain text from my TEX file for indexing and searching, I found LaTeX2RTF to be a good solution - it has an installer and GUI for windows, and it produced a RTF file of my 50 pages thesis that I could open in Word.

当我需要从TEX文件中获取纯文本进行索引和搜索时,我发现LaTeX2RTF是一个很好的解决方案——它有windows的安装程序和GUI,它生成了我的50页论文的RTF文件,我可以在Word中打开。

#10


3  

The solution that works best for me is the following. Assuming you have the latex document name (without extension) stored in ${BASENAME} you apply these 3 steps:

对我最有效的解决方案如下。假设您将latex文档名称(没有扩展名)存储在${BASENAME}中,则需要执行以下三个步骤:

htlatex ${BASENAME}.tex

htlatex $ { BASENAME } .tex

iconv -f iso-8859-1 -t utf-8 ${BASENAME}.html > ${BASENAME}-utf8.html

iconv -f iso-8859-1 -t utf-8 ${BASENAME}。html > $ { BASENAME } -utf8.html

html2markdown ${BASENAME}-utf8.html > ${BASENAME}.txt

html2markdown $ { BASENAME } use utf8。html > $ { BASENAME } . txt

Apparently, you need to have tex4ht and python-html2text installed.

显然,您需要安装tex4ht和python-html2text。

#11


2  

I've tried LyX and it works pretty well. The only nuance is that if you have a TeX file that is including other TeX files, you will need to export them all separately, unless I'm missing something.

我尝试过LyX,效果很好。唯一的区别是,如果您有一个包含其他TeX文件的TeX文件,您将需要分别导出它们,除非我漏掉了什么。

#12


0  

you can import into lyx and use lyx's export to text feature.

您可以导入到lyx并使用lyx的文本导出功能。

kind of silly if you don't use lyx but if you already have it, very quick and easy solution. Good result for me, although to be fair my files are pretty simple. Not sure how more elaborate files get converted.

如果你不使用lyx,那就有点傻了,但是如果你已经有了它,就可以快速简单的解决方案了。对我来说,这是一个很好的结果,尽管公平地说,我的文件非常简单。不知道如何转换更复杂的文件。

#13


0  

Emacs has the commands iso-iso2tex and iso-tex2iso that work very well, except it doesn't convert single commands like \OE to Œ.

Emacs命令iso-iso2tex和iso-tex2iso工作很好,除了它不喜欢\ OEœ转换单一命令。

#14


0  

Pandoc allows you to convert files from one format to other Use following pandoc command:

Pandoc允许您按照Pandoc命令将文件从一种格式转换为另一种格式:

pandoc -s /path/to/foobar.tex -o foobar.txt

If you want your lines to break at a certain column use --column flag. Use --columns 10000 for non-breaking line.

如果您希望您的行在某个列使用—列标志符时中断。使用——10000column用于不间断行。

You can convert -o foobar.txt to a number of other formats like markdown (.md) etc. If you don't specify the -o foobar.txt, pandoc will print the html that you can render in any online tool.

你可以转换-o foobar。txt到许多其他格式,如markdown (.md)等。如果不指定-o foobar。txt, pandoc会打印出你可以在任何在线工具中呈现的html。

To install pandoc follow this official documentation

要安装pandoc,请遵循以下官方文档

#1


16  

CatDVI can convert DVI to text and attempts to preserve the formatting.

CatDVI可以将DVI转换为文本并尝试保存格式。

#2


39  

Opendetex is available both for Windows and Linux (compiles fine on a Mac as well). It can be downloaded from http://code.google.com/p/opendetex/downloads/list

Opendetex可用于Windows和Linux(在Mac上也可以很好地编译)。可以从http://code.google.com/p/opendetex/downloads/list下载

Usage: http://code.google.com/p/opendetex/wiki/Usage

用法:http://code.google.com/p/opendetex/wiki/Usage

Extract it to any directory of your choice. Say you extracted it to your Downloads directory.

将它解压到您选择的任何目录。假设您将它提取到您的下载目录。

Create another directory of any name in that (this is optional but recommended). Let's say the directory name is “my_paper”. Put your paper in the “my_paper” directory. Assume your paper name is project.tex.

创建其中任意名称的另一个目录(这是可选的,但推荐)。我们说目录名是“my_paper”。把你的论文放在“my_paper”目录中。假设你的论文名称是project.tex。

Navigate to the path

导航到路径

    cd ~/Downloads/opendetex

Run the command

运行以下命令

    detex my_paper/project.tex  > out.txt

generic form

通用的形式

    detex -n full_path_to_tex_file.tex > output_text_file.txt

#3


14  

You can try some of the programs proposed here:

你可以试试这里提出的一些方案:

TeX to ASCII

特克斯为ASCII

#4


8  

Another option is to use htlatex to create a web page from the LaTeX sources, then use links to convert to plain text. I used the command line

另一种选择是使用htlatex来从LaTeX源创建web页面,然后使用链接将其转换为纯文本。我使用命令行

links -dump -no-numbering -no-references input.html > output.txt

链接-转储-无编号-无引用输入。html > output.txt

in the past which gave a rather nice result. This will of course rather match the view of the rendered HTML than the original PDF, thus maybe not exactly what you want.

在过去给了一个相当不错的结果。这当然更符合呈现的HTML的视图,而不是原始的PDF,因此可能不是您想要的。

#5


8  

You can also try Pandoc, it can transform latex to many other formats. I suggest reading its documentation, for there may be some tricky cases that you need pass some arguments to handle.

您也可以尝试Pandoc,它可以将latex转换成许多其他格式。我建议阅读它的文档,因为您可能需要传递一些参数来处理一些棘手的情况。

#6


5  

If you are using pdflatex, you probably don't want to mess around with your package options to switch to latex to generate a DVI.

如果您正在使用pdflatex,那么您可能不希望将包选项切换到latex以生成DVI。

Instead, take your pdf file and convert that instead. This worked for my CV/resume made with the Curve package:

相反,把你的pdf文件转换成。这适用于我的简历和曲线包装:

pdftotext  -layout MyResume.pdf

Note the -layout flag.

注意布局国旗。

#7


3  

My usual strategy is to use hyperlatex to turn it into a web page, and then cope and paste from a web browser. I find that this gives the best formatting.

我通常的策略是使用hyperlatex来将其转换为web页面,然后从web浏览器处理和粘贴。我发现这提供了最好的格式。

I usually then have to go through and manually fix some line-wrapping...

然后我通常需要手动修改一些行包装……

#8


3  

Try the steps here: http://zanedp.livejournal.com/201222.html

试试下面的步骤:http://zanedp.livejournal.com/201222.html

Here is a sequence that converts my LaTeX file to plain text:

下面是一个将我的LaTeX文件转换成纯文本的序列:

$ latex file.tex
$ catdvi -e 1 -U file.dvi | sed -re "s/\[U\+2022\]/*/g" | sed -re "s/([^^[:space:]])\s+/\1 /g" > file.txt

The -e 1 option to catdvi tells it to output ASCII. If you use 0 instead of 1, it will output Unicode. Unicode will include all the special characters like bullets, emdashes, and Greek letters. It also include ligatures for some letter combinations like "fi" and "fl." You may not like that. So, use -e 1 instead. Use the -U option to tell it to print out the unicode value for unknown characters so that you can easily find and replace them.

catdvi的- e1选项告诉它输出ASCII。如果您使用0而不是1,它将输出Unicode。Unicode将包括所有的特殊字符,如子弹、破折号和希腊字母。它还包括一些字母组合的连接,如“fi”和“fl”。你可能不喜欢这样。用- e1代替。使用-U选项告诉它打印未知字符的unicode值,以便您可以轻松找到并替换它们。

The second part of the command finds the string [U+2022] which is used to designate bullet characters (•) and replaces them with an asterisk (*).

命令的第二部分查找字符串[U+2022],该字符串用于指定子弹字符(•),并用星号(*)替换它们。

The third part eats up all the extra whitespace catdvi threw in to make the text full-justified while preserving spaces at the start of lines (indentation).

第三部分将耗尽catdvi所添加的所有额外的空白,以使文本完整,同时在行开头保留空格(缩进)。

After running these commands, you would be wise to search the .txt file for the string [U+ to make sure no Unicode characters that can't be mapped to ASCII were left behind and fix them.

运行这些命令之后,您应该明智地搜索.txt文件中的字符串[U+,以确保不会留下不能映射到ASCII的Unicode字符并修复它们。

#9


3  

When I needed to get the plain text from my TEX file for indexing and searching, I found LaTeX2RTF to be a good solution - it has an installer and GUI for windows, and it produced a RTF file of my 50 pages thesis that I could open in Word.

当我需要从TEX文件中获取纯文本进行索引和搜索时,我发现LaTeX2RTF是一个很好的解决方案——它有windows的安装程序和GUI,它生成了我的50页论文的RTF文件,我可以在Word中打开。

#10


3  

The solution that works best for me is the following. Assuming you have the latex document name (without extension) stored in ${BASENAME} you apply these 3 steps:

对我最有效的解决方案如下。假设您将latex文档名称(没有扩展名)存储在${BASENAME}中,则需要执行以下三个步骤:

htlatex ${BASENAME}.tex

htlatex $ { BASENAME } .tex

iconv -f iso-8859-1 -t utf-8 ${BASENAME}.html > ${BASENAME}-utf8.html

iconv -f iso-8859-1 -t utf-8 ${BASENAME}。html > $ { BASENAME } -utf8.html

html2markdown ${BASENAME}-utf8.html > ${BASENAME}.txt

html2markdown $ { BASENAME } use utf8。html > $ { BASENAME } . txt

Apparently, you need to have tex4ht and python-html2text installed.

显然,您需要安装tex4ht和python-html2text。

#11


2  

I've tried LyX and it works pretty well. The only nuance is that if you have a TeX file that is including other TeX files, you will need to export them all separately, unless I'm missing something.

我尝试过LyX,效果很好。唯一的区别是,如果您有一个包含其他TeX文件的TeX文件,您将需要分别导出它们,除非我漏掉了什么。

#12


0  

you can import into lyx and use lyx's export to text feature.

您可以导入到lyx并使用lyx的文本导出功能。

kind of silly if you don't use lyx but if you already have it, very quick and easy solution. Good result for me, although to be fair my files are pretty simple. Not sure how more elaborate files get converted.

如果你不使用lyx,那就有点傻了,但是如果你已经有了它,就可以快速简单的解决方案了。对我来说,这是一个很好的结果,尽管公平地说,我的文件非常简单。不知道如何转换更复杂的文件。

#13


0  

Emacs has the commands iso-iso2tex and iso-tex2iso that work very well, except it doesn't convert single commands like \OE to Œ.

Emacs命令iso-iso2tex和iso-tex2iso工作很好,除了它不喜欢\ OEœ转换单一命令。

#14


0  

Pandoc allows you to convert files from one format to other Use following pandoc command:

Pandoc允许您按照Pandoc命令将文件从一种格式转换为另一种格式:

pandoc -s /path/to/foobar.tex -o foobar.txt

If you want your lines to break at a certain column use --column flag. Use --columns 10000 for non-breaking line.

如果您希望您的行在某个列使用—列标志符时中断。使用——10000column用于不间断行。

You can convert -o foobar.txt to a number of other formats like markdown (.md) etc. If you don't specify the -o foobar.txt, pandoc will print the html that you can render in any online tool.

你可以转换-o foobar。txt到许多其他格式,如markdown (.md)等。如果不指定-o foobar。txt, pandoc会打印出你可以在任何在线工具中呈现的html。

To install pandoc follow this official documentation

要安装pandoc,请遵循以下官方文档