如何从数据中删除列?

Not so much 'How do you...?' but more 'How do YOU...?'

不是“你怎么…?”但更多的是“你怎么…?”

If you have a file someone gives you with 200 columns, and you want to reduce it to the few ones you need for analysis, how do you go about it? Does one solution offer benefits over another?

如果你有一个文件，有人给你200列，你想把它减少到你需要分析的那几个，你怎么去做?一个解决方案能提供另一个解决方案吗?

Assuming we have a data frame with columns col1, col2 through col200. If you only wanted 1-100 and then 125-135 and 150-200, you could:

假设我们有一个带有列col1, col2到col200的数据帧。如果你只需要1-100，然后125-135和150-200，你可以:

dat$col101 <- NULL
dat$col102 <- NULL # etc

或

dat <- dat[,c("col1","col2",...)]

或

dat <- dat[,c(1:100,125:135,...)] # shortest probably but I don't like this

或

dat <- dat[,!names(dat) %in% c("dat101","dat102",...)]

Anything else I'm missing? I know this is sightly subjective but it's one of those nitty gritty things where you might dive in and start doing it one way and fall into a habit when there are far more efficient ways out there. Much like this question about which.

什么我错过吗?我知道这很主观，但这是一种很重要的事情，你可能会潜心钻研，开始做一件事，当有更有效的方法时，你就会养成一种习惯。就像这个问题。

EDIT:

编辑:

Or, is there an easy way to create a workable vector of column names? name(dat) doesn't print them with commas in between, which you need in the code examples above, so if you print out the names in that way you have spaces everywhere and have to manually put in commas... Is there a command that will give you "col1","col2","col3",... as your output so you can easily grab what you want?

或者，是否有一种简单的方法来创建一个可行的列名向量?name(dat)不会用逗号将它们打印出来，在上面的代码示例中需要这样做，因此，如果您以这种方式打印出这些名称，那么您将在任何地方都有空格，并且必须手工输入逗号……是否有一个命令可以给你“col1”，“col2”，“col3”，…作为您的输出，您可以轻松地获取您想要的?

11 个解决方案

#1

I use data.table's := operator to delete columns instantly regardless of the size of the table.

我使用数据。表的:=操作符可以立即删除列，而不考虑表的大小。

DT[,coltodelete:=NULL]

或

DT[,c("col1","col20"):=NULL]

或

DT[,(125:135):=NULL]

或

DT[,(variableHoldingNamesOrNumbers):=NULL]

Any solution using <- or subset will copy the whole table. data.table's := operator merely modifies the internal vector of pointers to the columns, in place. That operation is therefore (almost) instant.

任何使用<-或子集的解决方案都将复制整个表。数据。表的:=操作符仅仅修改了指向列的指针的内部向量。因此，这个操作几乎是即时的。

#2

To delete single columns, I'll just use dat$x <- NULL.

要删除单个列，我将只使用dat$x <- NULL。

To delete multiple columns, but less than about 3-4, I'll use dat$x <- dat$y <- dat$z <- NULL.

要删除多个列，但小于3-4，我将使用dat$x <- dat$y <- dat$z <- NULL。

For more than that, I'll use subset, with negative names (!):

更重要的是，我将使用带有负面名称的子集(!):

subset(mtcars, , -c(mpg, cyl, disp, hp))

#3

For clarity purposes, I often use the select argument in subset. With newer folks, I've learned that keeping the # of commands they need to pick up to a minimum helps adoption. As their skills increase, so too will their coding ability. And subset is one of the first commands I show people when needing to select data within a given criteria.

为了清楚起见，我经常使用子集中的select参数。对于更新的用户，我已经了解到，保持他们需要的命令数量最多可以帮助采用。随着他们技能的提高，他们的编码能力也会增加。子集是我向人们展示的第一个命令，当需要在给定的条件下选择数据时。

Something like:

喜欢的东西:

> subset(mtcars, select = c("mpg", "cyl", "vs", "am"))
                     mpg cyl vs am
Mazda RX4           21.0   6  0  1
Mazda RX4 Wag       21.0   6  0  1
Datsun 710          22.8   4  1  1
....

I'm sure this will test slower than most other solutions, but I'm rarely at the point where microseconds make a difference.

我确信这将比大多数其他的解决方案更慢，但是我很少在微秒做出改变的时候。

#4

Use read.table with colClasses instances of "NULL" to avoid creating them in the first place:

使用阅读。表与colClasses实例的“NULL”，以避免在第一个地方创建它们:

## example data and temp file
x <- data.frame(x = 1:10, y = rnorm(10), z = runif(10), a = letters[1:10], stringsAsFactors = FALSE)
tmp <- tempfile()
write.table(x, tmp, row.names = FALSE)


(y <- read.table(tmp, colClasses = c("numeric", rep("NULL", 2), "character"), header = TRUE))

x a
1   1 a
2   2 b
3   3 c
4   4 d
5   5 e
6   6 f
7   7 g
8   8 h
9   9 i
10 10 j

unlink(tmp)

#5

For the kinds of large files I tend to get, I generally wouldn't even do this in R. I would use the cut command in Linux to process data before it gets to R. This isn't a critique of R, just a preference for using some very basic Linux tools like grep, tr, cut, sort, uniq, and occasionally sed & awk (or Perl) when there's something to be done about regular expressions.

大文件的类型我倾向于,我通常甚至不会用R .我会在Linux中使用的命令处理数据之前R R .这不是批判,只是倾向于使用一些非常基本的Linux工具像grep,tr,削减,排序,uniq,偶尔sed和awk或Perl当有一些关于正则表达式。

Another reason to use standard GNU commands is that I can pass them back to the source of the data and ask that they prefilter the data so that I don't get extraneous data. Most of my colleagues are competent with Linux, fewer know R.

使用标准GNU命令的另一个原因是，我可以将它们传递回数据源，并要求它们预先过滤数据，这样我就不会得到额外的数据。我的大多数同事都能胜任Linux，很少人知道R。

(Updated) A method that I would like to use before long is to pair mmap with a text file and examine the data in situ, rather than read it at all into RAM. I have done this with C, and it can be blisteringly fast.

(更新)我想在不久之前使用的方法是用文本文件对mmap进行配对，并在原位检查数据，而不是将其全部读取到RAM中。我已经用C做过了，而且它的速度非常快。

#6

Sometimes I like to do this using column ids instead.

有时候我喜欢用列id来做这个。

df <- data.frame(a=rnorm(100),
b=rnorm(100),
c=rnorm(100),
d=rnorm(100),
e=rnorm(100),
f=rnorm(100),
g=rnorm(100))

as.data.frame(names(df))

as.data.frame(名字(df))

  names(df)
1         a
2         b
3         c
4         d
5         e
6         f
7         g

Removing columns "c" and "g"

删除列“c”和“g”

df[,-c(3,7)]

This is especially useful if you have data.frames that are large or have long column names that you don't want to type. Or column names that follow a pattern, because then you can use seq() to remove.

如果您拥有大的或具有不希望键入的长列名称的数据，那么这一点尤其有用。或者按模式的列名，因为这样您可以使用seq()来删除。

RE: Your edit

再保险:你的编辑

You don't necessarily have to put "" around a string, nor "," to create a character vector. I find this little trick handy:

你不必在字符串周围加上“，”来创建一个字符向量。我发现这个小技巧很有用:

x <- unlist(strsplit(
'A
B
C
D
E',"\n"))

#7

Just addressing the edit.

但是编辑。

@nzcoops, you do not need the column names in a comma delimited character vector. You are thinking about this the wrong way round. When you do

@nzcoops，您不需要在逗号分隔字符向量中使用列名。你在想这事是错的。当你做

vec <- c("col1", "col2", "col3")

you are creating a character vector. The , just separates arguments taken by the c() function when you define that vector. names() and similar functions return a character vector of names.

您正在创建一个字符向量。在定义这个向量时，将c()函数的参数分离。名称()和类似的函数返回一个名称的字符向量。

> dat <- data.frame(col1 = 1:3, col2 = 1:3, col3 = 1:3)
> dat
  col1 col2 col3
1    1    1    1
2    2    2    2
3    3    3    3
> names(dat)
[1] "col1" "col2" "col3"

It is far easier and less error prone to select from the elements of names(dat) than to process its output to a comma separated string you can cut and paste from.

从名称元素(dat)中选择的元素要比处理它的输出到一个逗号分隔的字符串(可以剪切和粘贴)要容易得多，也容易得多。

Say we want columns col1 and col2, subset names(dat), retaining only the ones we want:

比方说我们想要列col1和col2，子集名称(dat)，只保留我们想要的:

> names(dat)[c(1,3)]
[1] "col1" "col3"
> dat[, names(dat)[c(1,3)]]
  col1 col3
1    1    1
2    2    2
3    3    3

You can kind of do what you want, but R will always print the vector the screen in quotes ":

你可以做你想做的事，但是R总是会用引号将矢量屏幕打印出来。

> paste('"', names(dat), '"', sep = "", collapse = ", ")
[1] "\"col1\", \"col2\", \"col3\""
> paste("'", names(dat), "'", sep = "", collapse = ", ")
[1] "'col1', 'col2', 'col3'"

so the latter may be more useful. However, now you have to cut and past from that string. Far better to work with objects that return what you want and use standard subsetting routines to keep what you need.

因此后者可能更有用。但是，现在您必须从该字符串中剪切和过去。更好的方法是使用返回您想要的东西的对象，并使用标准的子设置例程来保留您所需要的。

#8

If you have a vector of names already,which there are several ways to create, you can easily use the subset function to keep or drop an object.

如果您已经有了一个名称向量，有几种方法可以创建，那么您可以很容易地使用子集函数来保存或删除一个对象。

dat2 <- subset(dat, select = names(dat) %in% c(KEEP))

In this case KEEP is a vector of column names which is pre-created. For example:

在本例中，KEEP是一个预先创建的列名向量。例如:

#sample data via Brandon Bertelsen
df <- data.frame(a=rnorm(100),
                 b=rnorm(100),
                 c=rnorm(100),
                 d=rnorm(100),
                 e=rnorm(100),
                 f=rnorm(100),
                 g=rnorm(100))

#creating the initial vector of names
df1 <- as.matrix(as.character(names(df)))

#retaining only the name values you want to keep
KEEP <- as.vector(df1[c(1:3,5,6),])

#subsetting the intial dataset with the object KEEP
df3 <- subset(df, select = names(df) %in% c(KEEP))

Which results in:

结果:

> head(df)
            a          b           c          d
1  1.05526388  0.6316023 -0.04230455 -0.1486299
2 -0.52584236  0.5596705  2.26831758  0.3871873
3  1.88565261  0.9727644  0.99708383  1.8495017
4 -0.58942525 -0.3874654  0.48173439  1.4137227
5 -0.03898588 -1.5297600  0.85594964  0.7353428
6  1.58860643 -1.6878690  0.79997390  1.1935813
            e           f           g
1 -1.42751190  0.09842343 -0.01543444
2 -0.62431091 -0.33265572 -0.15539472
3  1.15130591  0.37556903 -1.46640276
4 -1.28886526 -0.50547059 -2.20156926
5 -0.03915009 -1.38281923  0.60811360
6 -1.68024349 -1.18317733  0.42014397

> head(df3)
        a          b           c           e
1  1.05526388  0.6316023 -0.04230455 -1.42751190
2 -0.52584236  0.5596705  2.26831758 -0.62431091
3  1.88565261  0.9727644  0.99708383  1.15130591
4 -0.58942525 -0.3874654  0.48173439 -1.28886526
5 -0.03898588 -1.5297600  0.85594964 -0.03915009
6  1.58860643 -1.6878690  0.79997390 -1.68024349
            f
1  0.09842343
2 -0.33265572
3  0.37556903
4 -0.50547059
5 -1.38281923
6 -1.18317733

#9

From http://www.statmethods.net/management/subset.html

从http://www.statmethods.net/management/subset.html

# exclude variables v1, v2, v3
myvars <- names(mydata) %in% c("v1", "v2", "v3") 
newdata <- mydata[!myvars]

# exclude 3rd and 5th variable 
newdata <- mydata[c(-3,-5)]

# delete variables v3 and v5
mydata$v3 <- mydata$v5 <- NULL

Thought it was really clever make a list of "not to include"

认为它很聪明，列出了“不包括”的清单

#10

The select() function from dplyr is powerful for subsetting columns. See ?select_helpers for a list of approaches.

来自dplyr的select()函数对于子设置列具有强大的功能。查看?select_helper列表的方法。

In this case, where you have a common prefix and sequential numbers for column names, you could use num_range:

在这种情况下，对于列名称有一个公共前缀和序号，您可以使用num_range:

library(dplyr)

df1 <- data.frame(first = 0, col1 = 1, col2 = 2, col3 = 3, col4 = 4)
df1 %>%
  select(num_range("col", c(1, 4)))
#>   col1 col4
#> 1    1    4

More generally you can use the minus sign in select() to drop columns, like:

更一般地，您可以在select()中使用减号来删除列，比如:

mtcars %>%
   select(-mpg, -wt)

Finally, to your question "is there an easy way to create a workable vector of column names?" - yes, if you need to edit a list of names manually, use dput to get a comma-separated, quoted list you can easily manipulate:

最后，对于您的问题，“是否有一种简单的方法来创建一个可行的列名向量?”-是的，如果你需要手动编辑一个名字列表，使用dput来获得一个逗号分隔的，引用列表，你可以很容易地操作:

dput(names(mtcars))
#> c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", 
#> "gear", "carb")

#11

Can use setdiff function:

可以使用setdiff功能:

If there are more columns to keep than to delete: Suppose you want to delete 2 columns say col1, col2 from a data.frame DT; you can do the following:

如果有更多的列要保留，而不是删除:假设您想删除2列，从data.frame DT中删除col1、col2;你可以这样做:

DT<-DT[,setdiff(names(DT),c("col1","col2"))]

If there are more columns to delete than to keep: Suppose you want to keep only col1 and col2:

如果有更多的列要删除，而不是保留:假设您只想保留col1和col2:

DT<-DT[,c("col1","col2")]

#1