强制设置编码从未知到UTF-8或R中的任何编码?

时间:2023-01-06 08:18:12

I am reading data from an old proprietary database. Unfortunately I end up (only for some strings) with Encoding(mychar_vector) returning "unknown". Unfortunately I am using a wrapper around a closed source c hli (host language interface), so there's probably not much I can do about that – if so I am glad to be proven wrong here...

我正在从旧的专有数据库中读取数据。不幸的是,我最终(仅针对某些字符串)使用Encoding(mychar_vector)返回“unknown”。不幸的是我在一个封闭源c hli(宿主语言界面)周围使用了一个包装器,所以我可能做的不多 - 如果是这样的话我很高兴在这里被证明是错的......

However, looking at the string vector except for a few replacements I had to make (see my related question) using gsub the strings look ok. I would love to re-gain control of the encoding. Is there a way to forcefully set the encoding to UTF-8? I tried to

但是,看看字符串向量除了一些替换,我必须使用gsub(参见我的相关问题),字符串看起来不错。我希望重新获得对编码的控制权。有没有办法强制将编码设置为UTF-8?我尝试过了

Encoding(mychar_vector) <- "UTF-8"
# or
mychar_vector <- enc2utf8(mychar_vector)

But none of this worked out. Just got "unknown" in return immediately after checking. Also looked into iconv but there is obviously no way converting from "unknown" to UTF-8 as there is no mapping.

但这一切都没有成功。检查后立即得到“未知”的回报。也查看了iconv,但显然没有办法从“未知”转换为UTF-8,因为没有映射。

Is there a way to tell R, that only UTF-8 characters are involved and thus the encoding can be set to UTF-8. Note that some of the elements of the vector are already UTF-8.

有没有办法告诉R,只涉及UTF-8字符,因此编码可以设置为UTF-8。请注意,向量的某些元素已经是UTF-8。

2 个解决方案

#1


0  

When I have dealt with files that are not UTF-8 encoded properly, I have used iconv with great success to forcefully convert the file by simply running a bash script in my rmarkdown notebook:

当我处理不正确UTF-8编码的文件时,我已经使用iconv取得了巨大成功,只需在我的rmarkdown笔记本中运行bash脚本即可强制转换文件:

iconv -c -t UTF-8 myfile.txt > Ratebeer-myfile.txt

You could also try this where file is your original file, and file-iconv is the modified file:

您也可以尝试使用file作为原始文件,file-iconv是修改后的文件:

#iconv −f iso−8859−1 −t UTF−8 file.txt > file-iconv.txt

Verify the encoding with:

验证编码:

file -I file-iconv.txt

Let me know if this helps or not.

如果这有帮助,请告诉我。

#2


0  

If you can query the datasource in a way to return delimited table-like input, instead of a string, you can use read.table. It allows an explicit encoding parameter. This common usage works well.:

如果您可以以某种方式查询数据源以返回分隔的类表输入,而不是字符串,则可以使用read.table。它允许显式编码参数。这种常见用法效果很好:

read.table(filesource, header = TRUE, stringsAsFactors = FALSE, encoding = "UTF-8")

#1


0  

When I have dealt with files that are not UTF-8 encoded properly, I have used iconv with great success to forcefully convert the file by simply running a bash script in my rmarkdown notebook:

当我处理不正确UTF-8编码的文件时,我已经使用iconv取得了巨大成功,只需在我的rmarkdown笔记本中运行bash脚本即可强制转换文件:

iconv -c -t UTF-8 myfile.txt > Ratebeer-myfile.txt

You could also try this where file is your original file, and file-iconv is the modified file:

您也可以尝试使用file作为原始文件,file-iconv是修改后的文件:

#iconv −f iso−8859−1 −t UTF−8 file.txt > file-iconv.txt

Verify the encoding with:

验证编码:

file -I file-iconv.txt

Let me know if this helps or not.

如果这有帮助,请告诉我。

#2


0  

If you can query the datasource in a way to return delimited table-like input, instead of a string, you can use read.table. It allows an explicit encoding parameter. This common usage works well.:

如果您可以以某种方式查询数据源以返回分隔的类表输入,而不是字符串,则可以使用read.table。它允许显式编码参数。这种常见用法效果很好:

read.table(filesource, header = TRUE, stringsAsFactors = FALSE, encoding = "UTF-8")