为什么R读取UTF-8头作为文本?

I saved an Excel table as text (*.txt). Unfortunately, Excel don't let me choose the encoding. So I need to open it in Notepad (which opens as ANSI) and save it as UTF-8. Then, when I read it in R:

我将Excel表格保存为text (*.txt)。不幸的是，Excel不让我选择编码。因此，我需要在记事本中打开它(它以ANSI的形式打开)，并将其保存为UTF-8。然后，当我在R:

data <- read.csv("my_file.txt",header=TRUE,sep="\t",encoding="UTF-8")

it shows the name of the first column beginning with "X.U.FEFF.". I know these are the bytes reserved to tell any program that the file is in UTF-8 format. So it shouldn't appear as text! Is this a bug? Or am I missing some option? Thanks in advance!

它显示了以“X.U.FEFF”开头的第一列的名称。我知道这些字节是用来告诉任何程序文件是UTF-8格式的。所以它不应该作为文本出现!这是一个错误吗?还是我错过了一些选择?提前谢谢!

3 个解决方案

#1

So I was going to give you instructions on how to manually open the file and check for and discard the BOM, but then I noticed this (in ?file):

所以我打算给你一些关于如何手动打开文件并检查和丢弃BOM的说明，但是我注意到这个(在文件中):

As from R 3.0.0 the encoding "UTF-8-BOM" is accepted and will remove a Byte Order Mark if present (which it often is for files and webpages generated by Microsoft applications).

从r3.0.0开始，“UTF-8-BOM”编码被接受，并将删除一个字节顺序标记(它通常用于由微软应用程序生成的文件和网页)。

which means that if you have a sufficiently new R interpreter,

这意味着如果你有一个足够新的R解释器，

read.csv("my_file.txt", fileEncoding="UTF-8-BOM", ...other args...)

should do what you want.

应该做你想做的。

#2

most of the arguments in read.csv are dummy args -- including fileEncoding.

阅读中的大部分论点。csv是虚拟的args——包括文件编码。

use read.table instead

使用阅读。表而不是

 read.table("my_file.txt", header=TRUE, sep="\t", fileEncoding="UTF-8")

#3

Possible solution from the comments:

可能的解决方案:

Try it with the read.csv argument check.names=FALSE. Note that if you use this, you will not be able to directly reference columns with the $ notation, unless you surround the name in quotes. For instance: yourdf$"first col".

试着读一下。csv论点check.names = FALSE。请注意，如果您使用这个，您将无法使用$ notation直接引用列，除非您在引号中包围该名称。例如:yourdf“第一坳”美元。

#1