Ruby 2:从二进制ASCII-8BIT数据中检测编码

时间:2022-05-07 10:57:09

I have to load some data from external sources. When I look at the encoding, Ruby tells me ASCII-8BIT, binary file. However, some of the sources are encoded ISO-8859-1 and some of them are in UTF-8. When I try to convert the ISO-8859-1 encoded stuff to UTF-8, I get an error. But when I do something like content.force_encoding('ISO-8859-1').encode('UTF-8') everything works fine.

我必须从外部源加载一些数据。当我查看编码时,Ruby告诉我ASCII-8BIT,二进制文件。但是,有些源代码编码为ISO-8859-1,其中一些源代码为UTF-8。当我尝试将ISO-8859-1编码的东西转换为UTF-8时,我收到一个错误。但是,当我执行content.force_encoding('ISO-8859-1')。encode('UTF-8')之类的操作时,一切正常。

However, this doesn't work the other way round. When I try to encode the UTF-8 data to ISO, it ends up with broken characters like .

但是,这并不相反。当我尝试将UTF-8数据编码为ISO时,它最终会出现像这样的破碎字符。

So, is there a way to detect the "underlying" encoding of the ASCII-8BIT data, and then convert it to UTF-8?

那么,有没有办法检测ASCII-8BIT数据的“底层”编码,然后将其转换为UTF-8?

1 个解决方案

#1


I had a quick google and found the Charlock Holmes gem by Brian Lopez. It looks like it does the detection process you're after.

我有一个快速的谷歌,发现了Brian Lopez的Charlock Holmes宝石。它看起来像你正在进行的检测过程。

https://github.com/brianmario/charlock_holmes

#1


I had a quick google and found the Charlock Holmes gem by Brian Lopez. It looks like it does the detection process you're after.

我有一个快速的谷歌,发现了Brian Lopez的Charlock Holmes宝石。它看起来像你正在进行的检测过程。

https://github.com/brianmario/charlock_holmes