从R中写入UTF-8文件。

时间:2023-01-06 08:13:30

Whereas R seems to handle Unicode characters well internally, I'm not able to output a data frame in R with such UTF-8 Unicode characters. Is there any way to force this?

而R似乎在内部很好地处理Unicode字符,我不能用UTF-8 Unicode字符输出R的数据帧。有什么办法可以强制执行吗?

data.frame(c("hīersumian","ǣmettigan"))->test
write.table(test,"test.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8")

The output text file reads:

输出文本文件为:

hiersumian <U+01E3>mettigan

hiersumian < U + 01 e3 > mettigan

I am using R version 3.0.2 in a Windows environment (Windows 7).

我在Windows环境中使用R版本3.0.2 (Windows 7)。

EDIT

编辑


It's been suggested in the answers that R is writing the file correctly in UTF-8, and that the problem lies with the software I'm using to view the file. Here's some code where I'm doing everything in R. I'm reading in a text file encoded in UTF-8, and R reads it correctly. Then R writes the file out in UTF-8 and reads it back in again, and now the correct Unicode characters are gone.

答案是R在UTF-8中正确地写文件,问题在于我使用的软件查看文件。这里有一些代码,我在R中做所有的事情,我在用UTF-8编码的文本文件中读取,R读取正确。然后R用UTF-8编写文件,然后再读取它,现在正确的Unicode字符就消失了。

read.table("myinputfile.txt",encoding="UTF-8")->myinputfile
myinputfile[1,1]
write.table(myinputfile,"myoutputfile.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8")
read.table("myoutputfile.txt",encoding="UTF-8")->myoutputfile
myoutputfile[1,1]

Console output:

控制台输出:

> read.table("myinputfile.txt",encoding="UTF-8")->myinputfile
> myinputfile[1,1]
[1] hīersumian
Levels: hīersumian ǣmettigan
> write.table(myinputfile,"myoutputfile.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8")
> read.table("myoutputfile.txt",encoding="UTF-8")->myoutputfile
> myoutputfile[1,1]
[1] <U+FEFF>hiersumian
Levels: <U+01E3>mettigan <U+FEFF>hiersumian
> 

2 个解决方案

#1


7  

This "answer" serves rather the purpose of clarifying that there is something odd going on behind the scenes:

这个“回答”的目的是澄清在幕后有一些奇怪的事情发生:

"hīersumian" doesn't even make it into the data frame it seems. The "ī"-symbol is in all cases converted to "i".

“hīersumian”甚至没有进入数据帧。“ī”符号是在所有情况下转换为“我”。

options("encoding" = "native.enc")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
#             a
# 1 hiersumian 

options("encoding" = "UTF-8")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
#             a
# 1 hiersumian 

options("encoding" = "UTF-16")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
#             a
# 1 hiersumian 

The following sequence successfully writes "ǣmettigan" to the text file:

以下序列成功地将“mettigan”写入文本文件:

t2 <- data.frame(a = c("ǣmettigan"), stringsAsFactors=F)

getOption("encoding")
# [1] "native.enc"

Encoding(t2[,"a"]) <- "UTF-16"

write.table(t2,"test.txt",row.names=F,col.names=F,quote=F)

从R中写入UTF-8文件。

It is not going to work with "encoding" as "UTF-8" or "UTF-16" and also specifying "fileEncoding" will either lead to a defect or no output.

它不会使用“编码”作为“UTF-8”或“UTF-16”,并且指定“fileEncoding”将导致缺陷或无输出。

Somewhat disappointing as so far I managed to get all Unicode issues fixed somehow.

有点令人失望,因为到目前为止,我设法解决了所有Unicode问题。

#2


1  

I may be missing something OS-specific, but data.table appears to have no problem with this (or perhaps more likely it's an update to R internals since this question was originally posed):

我可能漏掉了一些特定的,但数据。表似乎对这一点没有问题(或者更有可能是由于这个问题最初提出时,它是对R内部的更新):

t1 = data.table(a = c("hīersumian", "ǣmettigan"))
tmp = tempfile()
fwrite(t1, tmp)
system(paste('cat', tmp))
# a
# hīersumian
# ǣmettigan
fread(tmp)
#             a
# 1: hīersumian
# 2:  ǣmettigan

#1


7  

This "answer" serves rather the purpose of clarifying that there is something odd going on behind the scenes:

这个“回答”的目的是澄清在幕后有一些奇怪的事情发生:

"hīersumian" doesn't even make it into the data frame it seems. The "ī"-symbol is in all cases converted to "i".

“hīersumian”甚至没有进入数据帧。“ī”符号是在所有情况下转换为“我”。

options("encoding" = "native.enc")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
#             a
# 1 hiersumian 

options("encoding" = "UTF-8")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
#             a
# 1 hiersumian 

options("encoding" = "UTF-16")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
#             a
# 1 hiersumian 

The following sequence successfully writes "ǣmettigan" to the text file:

以下序列成功地将“mettigan”写入文本文件:

t2 <- data.frame(a = c("ǣmettigan"), stringsAsFactors=F)

getOption("encoding")
# [1] "native.enc"

Encoding(t2[,"a"]) <- "UTF-16"

write.table(t2,"test.txt",row.names=F,col.names=F,quote=F)

从R中写入UTF-8文件。

It is not going to work with "encoding" as "UTF-8" or "UTF-16" and also specifying "fileEncoding" will either lead to a defect or no output.

它不会使用“编码”作为“UTF-8”或“UTF-16”,并且指定“fileEncoding”将导致缺陷或无输出。

Somewhat disappointing as so far I managed to get all Unicode issues fixed somehow.

有点令人失望,因为到目前为止,我设法解决了所有Unicode问题。

#2


1  

I may be missing something OS-specific, but data.table appears to have no problem with this (or perhaps more likely it's an update to R internals since this question was originally posed):

我可能漏掉了一些特定的,但数据。表似乎对这一点没有问题(或者更有可能是由于这个问题最初提出时,它是对R内部的更新):

t1 = data.table(a = c("hīersumian", "ǣmettigan"))
tmp = tempfile()
fwrite(t1, tmp)
system(paste('cat', tmp))
# a
# hīersumian
# ǣmettigan
fread(tmp)
#             a
# 1: hīersumian
# 2:  ǣmettigan