如何更正文件的字符编码?

时间:2023-01-06 19:30:48

I have an ANSI encoded text file that should not have been encoded as ANSI as there were accented characters that ANSI does not support. I would rather work with UTF-8.

我有一个ANSI编码的文本文件,不应该编码为ANSI,因为ANSI不支持重音字符。我宁愿使用UTF-8。

Can the data be decoded correctly or is it lost in transcoding?

数据可以正确解码还是在转码中丢失?

What tools could I use?

我可以使用哪些工具?

Here is a sample of what I have:

这是我的样本:

ç é

I can tell from context (café should be café) that these should be these two characters:

我可以从上下文(café应该是café)告诉我们这些应该是这两个字符:

ç é

12 个解决方案

#1


19  

EDIT: A simple possibility to eliminate before getting into more complicated solutions: have you tried setting the character set to utf8 in the text editor in which you're reading the file? This could just be a case of somebody sending you a utf8 file that you're reading in an editor set to say cp1252.

编辑:在进入更复杂的解决方案之前消除一种简单的可能性:您是否尝试在正在读取文件的文本编辑器中将字符集设置为utf8?这可能只是某人向你发送一个utf8文件的情况,你正在编辑器中读到cp1252。

Just taking the two examples, this is a case of utf8 being read through the lens of a single-byte encoding, likely one of iso-8859-1, iso-8859-15, or cp1252. If you can post examples of other problem characters, it should be possible to narrow that down more.

仅举两个例子,这是通过单字节编码镜头读取utf8的情况,可能是iso-8859-1,iso-8859-15或cp1252之一。如果您可以发布其他问题字符的示例,则应该可以将其缩小范围。

As visual inspection of the characters can be misleading, you'll also need to look at the underlying bytes: the § you see on screen might be either 0xa7 or 0xc2a7, and that will determine the kind of character set conversion you have to do.

由于对字符的视觉检查可能会产生误导,因此您还需要查看基础字节:您在屏幕上看到的§可能是0xa7或0xc2a7,这将决定您必须执行的字符集转换的类型。

Can you assume that all of your data has been distorted in exactly the same way - that it's come from the same source and gone through the same sequence of transformations, so that for example there isn't a single é in your text, it's always ç? If so, the problem can be solved with a sequence of character set conversions. If you can be more specific about the environment you're in and the database you're using, somebody here can probably tell you how to perform the appropriate conversion.

您能否假设所有数据都以完全相同的方式扭曲 - 它来自同一个源并经历了相同的转换序列,因此例如您的文本中没有单个é,它始终是一个§?如果是这样,可以通过一系列字符集转换来解决问题。如果您可以更加具体地了解您所处的环境以及您正在使用的数据库,那么此处的某些人可能会告诉您如何执行适当的转换。

Otherwise, if the problem characters are only occurring in some places in your data, you'll have to take it instance by instance, based on assumptions along the lines of "no author intended to put ç in their text, so whenever you see it, replace by ç". The latter option is more risky, firstly because those assumptions about the intentions of the authors might be wrong, secondly because you'll have to spot every problem character yourself, which might be impossible if there's too much text to visually inspect or if it's written in a language or writing system that's foreign to you.

否则,如果问题字符仅出现在数据的某些位置,则必须逐个实例,基于“没有作者打算将ç放入文本中”的假设,所以无论何时看到它,替换为ç“。后一种选择风险更大,首先是因为那些关于作者意图的假设可能是错误的,其次是因为你必须自己发现每一个问题,如果有太多的文字要进行视觉检查或者是否有书面文字,这可能是不可能的。在一种对你来说很陌生的语言或书写系统中。

#2


18  

Follow these steps with Notepad++

使用Notepad ++执行以下步骤

1- Copy the original text

1-复制原始文本

2- In Notepad++, open new file, change Encoding -> pick an encoding you think the original text follows. Try as well the encoding "ANSI" as sometimes Unicode files are read as ANSI by certain programs

2-在Notepad ++中,打开新文件,更改编码 - >选择您认为原始文本如下的编码。尝试编码“ANSI”,因为有时Unicode文件被某些程序读作ANSI

3- Paste

4- Then to convert to Unicode by going again over the same menu: Encoding -> "Encode in UTF-8" (Not "Convert to UTF-8") and hopefully it will become readable

4-然后通过在同一菜单上再次转换为Unicode:编码 - >“以UTF-8编码”(不是“转换为UTF-8”)并希望它将变得可读

The above steps apply for most languages. You just need to guess the original encoding before pasting in notepad++, then convert through the same menu to an alternate Unicode-based encoding to see if things become readable.

以上步骤适用于大多数语言。您只需要在粘贴notepad ++之前猜测原始编码,然后通过相同的菜单转换为基于Unicode的备用编码,以查看事物是否可读。

Most languages exist in 2 forms of encoding: 1- The old legacy ANSI (ASCII) form, only 8 bits, was used initially by most computers. 8 bits only allowed 256 possibilities, 128 of them where the regular latin and control characters, the final 128 bits were read differently depending on the PC language settings 2- The new Unicode standard (up to 32 bit) give a unique code for each character in all currently known languages and plenty more to come. if a file is unicode it should be understood on any PC with the language's font installed. Note that even UTF-8 goes up to 32 bit and is just as broad as UTF-16 and UTF-32 only it tries to stay 8 bits with latin characters just to save up disk space

大多数语言以2种编码形式存在:1-大多数计算机最初使用旧的传统ANSI(ASCII)形式,仅为8位。 8位只允许256种可能性,128种常规拉丁和控制字符,最后128位根据PC语言设置读取不同2-新的Unicode标准(最多32位)为每个字符提供唯一的代码在所有目前已知的语言和更多的未来。如果文件是unicode,应该在任何安装了该语言字体的PC上理解。请注意,即使UTF-8最高可达32位,也与UTF-16和UTF-32一样宽,但它只是为了节省磁盘空间而尝试保留8位拉丁字符

#3


8  

When you see character sequences like ç and é, it's usually an indication that a UTF-8 file has been opened by a program that reads it in as ANSI (or similar). Unicode characters such as these:

当你看到像Ã和é这样的字符序列时,它通常表示一个程序打开了一个UTF-8文件,该程序以ANSI(或类似的形式)读取它。 Unicode字符,例如:

U+00C2 Latin capital letter A with circumflex
U+00C3 Latin capital letter A with tilde
U+0082 Break permitted here
U+0083 No break here

U + 00C2拉丁大写字母A,带有旋转U + 00C3拉丁大写字母A,带有波浪号U + 0082允许断开U + 0083这里没有休息

tend to show up in ANSI text because of the variable-byte strategy that UTF-8 uses. This strategy is explained very well here.

由于UTF-8使用的可变字节策略,往往会出现在ANSI文本中。这个策略在这里得到了很好的解释。

The advantage for you is that the appearance of these odd characters makes it relatively easy to find, and thus replace, instances of incorrect conversion.

对您而言,优势在于这些奇怪字符的外观使得查找错误转换的实例变得相对容易,从而取而代之。

I believe that, since ANSI always uses 1 byte per character, you can handle this situation with a simple search-and-replace operation. Or more conveniently, with a program that includes a table mapping between the offending sequences and the desired characters, like these:

我相信,由于ANSI每个字符总是使用1个字节,因此您可以通过简单的搜索和替换操作来处理这种情况。或者更方便的是,程序包含违规序列和所需字符之间的表映射,如下所示:

“ -> “ # should be an opening double curly quote
â€? -> ” # should be a closing double curly quote

“ - >”#应该是一个开头的双重卷曲引语â€? - >“#应该是一个收尾双曲引语

Any given text, assuming it's in English, will have a relatively small number of different types of substitutions.

任何给定的文本,假设它是英文的,将具有相对少量的不同类型的替换。

Hope that helps.

希望有所帮助。

#4


6  

With vim from command line:

使用命令行中的vim:

vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename

#5


3  

Use iconv - see Best way to convert text files between character sets?

使用iconv - 请参阅在字符集之间转换文本文件的最佳方法?

#6


2  

In sublime text editor, file -> reopen with encoding -> choose the correct encoding.

在崇高的文本编辑器中,文件 - >重新打开编码 - >选择正确的编码。

Generally, the encoding is auto-detected, but if not, you can use the above method.

通常,编码是自动检测的,但如果没有,则可以使用上述方法。

#7


1  

If you see question marks in the file or if the accents are already lost, going back to utf8 will not help your cause. e.g. if café became cafe - changing encoding alone will not help (and you'll need original data).

如果您在文件中看到问号或者重音已经丢失,那么返回utf8将无助于您的原因。例如如果咖啡馆成为咖啡馆 - 单独改变编码将无济于事(而且你需要原始数据)。

Can you paste some text here, that'll help us answer for sure.

你能在这里贴一些文字吗,这有助于我们肯定回答。

#8


0  

And then there is the somewhat older recode program.

然后有一些较旧的重新编码程序。

#9


0  

There are programs that try to detect the encoding of an file like chardet. Then you could convert it to a different encoding using iconv. But that requires that the original text is still intact and no information is lost (for example by removing accents or whole accented letters).

有些程序试图检测像chardet这样的文件的编码。然后你可以使用iconv将它转换为不同的编码。但这要求原始文本仍然完好无损并且不会丢失任何信息(例如删除重音符号或整个重音字母)。

#10


0  

I found a simple way to auto-detect file encodings - change the file to a text file (on a mac rename the file extension to .txt) and drag it to a Mozilla Firefox window (or File -> Open). Firefox will detect the encoding - you can see what it came up with under View -> Character Encoding.

我发现了一种自动检测文件编码的简单方法 - 将文件更改为文本文件(在mac上将文件扩展名重命名为.txt)并将其拖到Mozilla Firefox窗口(或文件 - >打开)。 Firefox将检测编码 - 你可以在View - > Character Encoding下看到它的结果。

I changed my file's encoding using TextMate once I knew the correct encoding. File -> Reopen using encoding and choose your encoding. Then File -> Save As and change the encoding to UTF-8 and line endings to LF (or whatever you want)

一旦我知道正确的编码,我就使用TextMate更改了文件的编码。文件 - >重新打开使用编码并选择您的编码。然后文件 - >另存为并将编码更改为UTF-8并将行结尾更改为LF(或任何您想要的)

#11


0  

On OS X Synalyze It! lets you display parts of your file in different encodings (all which are supported by the ICU library). Once you know what's the source encoding you can copy the whole file (bytes) via clipboard and insert into a new document where the target encoding (UTF-8 or whatever you like) is selected.

在OS X上分析它!允许您以不同的编码显示文件的各个部分(ICU库支持所有这些编码)。一旦知道了源编码是什么,就可以通过剪贴板复制整个文件(字节)并插入到一个新文档中,其中选择了目标编码(UTF-8或任何你喜欢的)。

Very helpful when working with UTF-8 or other Unicode representations is UnicodeChecker

使用UTF-8或其他Unicode表示时非常有用的是UnicodeChecker

#12


0  

I found this question when searching for a solution to a code page issue i had with Chinese characters, but in the end my problem was just an issue with Windows not displaying them correctly in the UI.

我在搜索具有中文字符的代码页问题的解决方案时发现了这个问题,但最后我的问题只是Windows无法在UI中正确显示它们的问题。

In case anyone else has that same issue, you can fix it simply by changing the local in windows to China and then back again.

如果其他人有同样的问题,您可以通过将Windows中的本地更改为中国然后再返回来修复它。

I found the solution here:

我在这里找到了解决方案:

http://answers.microsoft.com/en-us/windows/forum/windows_7-desktop/how-can-i-get-chinesejapanese-characters-to/fdb1f1da-b868-40d1-a4a4-7acadff4aafa?page=2&auth=1

Also upvoted Gabriel's answer as looking at the data in notepad++ was what tipped me off about windows.

同样赞同加布里埃尔的答案就是看着记事本++中的数据,这让我想到了Windows。

#1


19  

EDIT: A simple possibility to eliminate before getting into more complicated solutions: have you tried setting the character set to utf8 in the text editor in which you're reading the file? This could just be a case of somebody sending you a utf8 file that you're reading in an editor set to say cp1252.

编辑:在进入更复杂的解决方案之前消除一种简单的可能性:您是否尝试在正在读取文件的文本编辑器中将字符集设置为utf8?这可能只是某人向你发送一个utf8文件的情况,你正在编辑器中读到cp1252。

Just taking the two examples, this is a case of utf8 being read through the lens of a single-byte encoding, likely one of iso-8859-1, iso-8859-15, or cp1252. If you can post examples of other problem characters, it should be possible to narrow that down more.

仅举两个例子,这是通过单字节编码镜头读取utf8的情况,可能是iso-8859-1,iso-8859-15或cp1252之一。如果您可以发布其他问题字符的示例,则应该可以将其缩小范围。

As visual inspection of the characters can be misleading, you'll also need to look at the underlying bytes: the § you see on screen might be either 0xa7 or 0xc2a7, and that will determine the kind of character set conversion you have to do.

由于对字符的视觉检查可能会产生误导,因此您还需要查看基础字节:您在屏幕上看到的§可能是0xa7或0xc2a7,这将决定您必须执行的字符集转换的类型。

Can you assume that all of your data has been distorted in exactly the same way - that it's come from the same source and gone through the same sequence of transformations, so that for example there isn't a single é in your text, it's always ç? If so, the problem can be solved with a sequence of character set conversions. If you can be more specific about the environment you're in and the database you're using, somebody here can probably tell you how to perform the appropriate conversion.

您能否假设所有数据都以完全相同的方式扭曲 - 它来自同一个源并经历了相同的转换序列,因此例如您的文本中没有单个é,它始终是一个§?如果是这样,可以通过一系列字符集转换来解决问题。如果您可以更加具体地了解您所处的环境以及您正在使用的数据库,那么此处的某些人可能会告诉您如何执行适当的转换。

Otherwise, if the problem characters are only occurring in some places in your data, you'll have to take it instance by instance, based on assumptions along the lines of "no author intended to put ç in their text, so whenever you see it, replace by ç". The latter option is more risky, firstly because those assumptions about the intentions of the authors might be wrong, secondly because you'll have to spot every problem character yourself, which might be impossible if there's too much text to visually inspect or if it's written in a language or writing system that's foreign to you.

否则,如果问题字符仅出现在数据的某些位置,则必须逐个实例,基于“没有作者打算将ç放入文本中”的假设,所以无论何时看到它,替换为ç“。后一种选择风险更大,首先是因为那些关于作者意图的假设可能是错误的,其次是因为你必须自己发现每一个问题,如果有太多的文字要进行视觉检查或者是否有书面文字,这可能是不可能的。在一种对你来说很陌生的语言或书写系统中。

#2


18  

Follow these steps with Notepad++

使用Notepad ++执行以下步骤

1- Copy the original text

1-复制原始文本

2- In Notepad++, open new file, change Encoding -> pick an encoding you think the original text follows. Try as well the encoding "ANSI" as sometimes Unicode files are read as ANSI by certain programs

2-在Notepad ++中,打开新文件,更改编码 - >选择您认为原始文本如下的编码。尝试编码“ANSI”,因为有时Unicode文件被某些程序读作ANSI

3- Paste

4- Then to convert to Unicode by going again over the same menu: Encoding -> "Encode in UTF-8" (Not "Convert to UTF-8") and hopefully it will become readable

4-然后通过在同一菜单上再次转换为Unicode:编码 - >“以UTF-8编码”(不是“转换为UTF-8”)并希望它将变得可读

The above steps apply for most languages. You just need to guess the original encoding before pasting in notepad++, then convert through the same menu to an alternate Unicode-based encoding to see if things become readable.

以上步骤适用于大多数语言。您只需要在粘贴notepad ++之前猜测原始编码,然后通过相同的菜单转换为基于Unicode的备用编码,以查看事物是否可读。

Most languages exist in 2 forms of encoding: 1- The old legacy ANSI (ASCII) form, only 8 bits, was used initially by most computers. 8 bits only allowed 256 possibilities, 128 of them where the regular latin and control characters, the final 128 bits were read differently depending on the PC language settings 2- The new Unicode standard (up to 32 bit) give a unique code for each character in all currently known languages and plenty more to come. if a file is unicode it should be understood on any PC with the language's font installed. Note that even UTF-8 goes up to 32 bit and is just as broad as UTF-16 and UTF-32 only it tries to stay 8 bits with latin characters just to save up disk space

大多数语言以2种编码形式存在:1-大多数计算机最初使用旧的传统ANSI(ASCII)形式,仅为8位。 8位只允许256种可能性,128种常规拉丁和控制字符,最后128位根据PC语言设置读取不同2-新的Unicode标准(最多32位)为每个字符提供唯一的代码在所有目前已知的语言和更多的未来。如果文件是unicode,应该在任何安装了该语言字体的PC上理解。请注意,即使UTF-8最高可达32位,也与UTF-16和UTF-32一样宽,但它只是为了节省磁盘空间而尝试保留8位拉丁字符

#3


8  

When you see character sequences like ç and é, it's usually an indication that a UTF-8 file has been opened by a program that reads it in as ANSI (or similar). Unicode characters such as these:

当你看到像Ã和é这样的字符序列时,它通常表示一个程序打开了一个UTF-8文件,该程序以ANSI(或类似的形式)读取它。 Unicode字符,例如:

U+00C2 Latin capital letter A with circumflex
U+00C3 Latin capital letter A with tilde
U+0082 Break permitted here
U+0083 No break here

U + 00C2拉丁大写字母A,带有旋转U + 00C3拉丁大写字母A,带有波浪号U + 0082允许断开U + 0083这里没有休息

tend to show up in ANSI text because of the variable-byte strategy that UTF-8 uses. This strategy is explained very well here.

由于UTF-8使用的可变字节策略,往往会出现在ANSI文本中。这个策略在这里得到了很好的解释。

The advantage for you is that the appearance of these odd characters makes it relatively easy to find, and thus replace, instances of incorrect conversion.

对您而言,优势在于这些奇怪字符的外观使得查找错误转换的实例变得相对容易,从而取而代之。

I believe that, since ANSI always uses 1 byte per character, you can handle this situation with a simple search-and-replace operation. Or more conveniently, with a program that includes a table mapping between the offending sequences and the desired characters, like these:

我相信,由于ANSI每个字符总是使用1个字节,因此您可以通过简单的搜索和替换操作来处理这种情况。或者更方便的是,程序包含违规序列和所需字符之间的表映射,如下所示:

“ -> “ # should be an opening double curly quote
â€? -> ” # should be a closing double curly quote

“ - >”#应该是一个开头的双重卷曲引语â€? - >“#应该是一个收尾双曲引语

Any given text, assuming it's in English, will have a relatively small number of different types of substitutions.

任何给定的文本,假设它是英文的,将具有相对少量的不同类型的替换。

Hope that helps.

希望有所帮助。

#4


6  

With vim from command line:

使用命令行中的vim:

vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename

#5


3  

Use iconv - see Best way to convert text files between character sets?

使用iconv - 请参阅在字符集之间转换文本文件的最佳方法?

#6


2  

In sublime text editor, file -> reopen with encoding -> choose the correct encoding.

在崇高的文本编辑器中,文件 - >重新打开编码 - >选择正确的编码。

Generally, the encoding is auto-detected, but if not, you can use the above method.

通常,编码是自动检测的,但如果没有,则可以使用上述方法。

#7


1  

If you see question marks in the file or if the accents are already lost, going back to utf8 will not help your cause. e.g. if café became cafe - changing encoding alone will not help (and you'll need original data).

如果您在文件中看到问号或者重音已经丢失,那么返回utf8将无助于您的原因。例如如果咖啡馆成为咖啡馆 - 单独改变编码将无济于事(而且你需要原始数据)。

Can you paste some text here, that'll help us answer for sure.

你能在这里贴一些文字吗,这有助于我们肯定回答。

#8


0  

And then there is the somewhat older recode program.

然后有一些较旧的重新编码程序。

#9


0  

There are programs that try to detect the encoding of an file like chardet. Then you could convert it to a different encoding using iconv. But that requires that the original text is still intact and no information is lost (for example by removing accents or whole accented letters).

有些程序试图检测像chardet这样的文件的编码。然后你可以使用iconv将它转换为不同的编码。但这要求原始文本仍然完好无损并且不会丢失任何信息(例如删除重音符号或整个重音字母)。

#10


0  

I found a simple way to auto-detect file encodings - change the file to a text file (on a mac rename the file extension to .txt) and drag it to a Mozilla Firefox window (or File -> Open). Firefox will detect the encoding - you can see what it came up with under View -> Character Encoding.

我发现了一种自动检测文件编码的简单方法 - 将文件更改为文本文件(在mac上将文件扩展名重命名为.txt)并将其拖到Mozilla Firefox窗口(或文件 - >打开)。 Firefox将检测编码 - 你可以在View - > Character Encoding下看到它的结果。

I changed my file's encoding using TextMate once I knew the correct encoding. File -> Reopen using encoding and choose your encoding. Then File -> Save As and change the encoding to UTF-8 and line endings to LF (or whatever you want)

一旦我知道正确的编码,我就使用TextMate更改了文件的编码。文件 - >重新打开使用编码并选择您的编码。然后文件 - >另存为并将编码更改为UTF-8并将行结尾更改为LF(或任何您想要的)

#11


0  

On OS X Synalyze It! lets you display parts of your file in different encodings (all which are supported by the ICU library). Once you know what's the source encoding you can copy the whole file (bytes) via clipboard and insert into a new document where the target encoding (UTF-8 or whatever you like) is selected.

在OS X上分析它!允许您以不同的编码显示文件的各个部分(ICU库支持所有这些编码)。一旦知道了源编码是什么,就可以通过剪贴板复制整个文件(字节)并插入到一个新文档中,其中选择了目标编码(UTF-8或任何你喜欢的)。

Very helpful when working with UTF-8 or other Unicode representations is UnicodeChecker

使用UTF-8或其他Unicode表示时非常有用的是UnicodeChecker

#12


0  

I found this question when searching for a solution to a code page issue i had with Chinese characters, but in the end my problem was just an issue with Windows not displaying them correctly in the UI.

我在搜索具有中文字符的代码页问题的解决方案时发现了这个问题,但最后我的问题只是Windows无法在UI中正确显示它们的问题。

In case anyone else has that same issue, you can fix it simply by changing the local in windows to China and then back again.

如果其他人有同样的问题,您可以通过将Windows中的本地更改为中国然后再返回来修复它。

I found the solution here:

我在这里找到了解决方案:

http://answers.microsoft.com/en-us/windows/forum/windows_7-desktop/how-can-i-get-chinesejapanese-characters-to/fdb1f1da-b868-40d1-a4a4-7acadff4aafa?page=2&auth=1

Also upvoted Gabriel's answer as looking at the data in notepad++ was what tipped me off about windows.

同样赞同加布里埃尔的答案就是看着记事本++中的数据,这让我想到了Windows。