将一个变量条件赋值给另外两个变量之一的值

时间:2021-06-04 14:18:20

I want to create a new variable that is equal to the value of one of two other variables, conditional on the values of still other variables. Here's a toy example with fake data.

我想创建一个新变量,它等于其他两个变量之一的值,以其他变量的值为条件。这是一个假数据的玩具示例。

Each row of the data frame represents a student. Each student can be studying up to two subjects (subj1 and subj2), and can be pursuing a degree ("BA") or a minor ("MN") in each subject. My real data includes thousands of students, several types of degree, about 50 subjects, and students can have up to five majors/minors.

数据框的每一行代表一名学生。每个学生可以学习最多两个科目(subj1和subj2),并且可以在每个科目中攻读学位(“BA”)或未成年人(“MN”)。我的真实数据包括数千名学生,几种类型的学位,约50个科目,学生可以有多达五个专业/未成年人。

   ID  subj1 degree1  subj2 degree2
1   1    BUS      BA   <NA>    <NA>
2   2    SCI      BA    ENG      BA
3   3    BUS      MN    ENG      BA
4   4    SCI      MN    BUS      BA
5   5    ENG      BA    BUS      MN
6   6    SCI      MN   <NA>    <NA>
7   7    ENG      MN    SCI      BA
8   8    BUS      BA    ENG      MN
...

Now I want to create a sixth variable, df$major, that equals the value of subj1 if subj1 is the student's primary major, or the value of subj2 if subj2 is the primary major. The primary major is the first subject with degree equal to "BA". I tried the following code:

现在我想创建第六个变量df $ major,如果subj1是学生的主要专业,则等于subj1的值,如果subj2是主要专业,则创建subj2的值。主要专业是第一个学位等于“BA”的学科。我尝试了以下代码:

df$major[df$degree1 == "BA"] = df$subj1
df$major[df$degree1 != "BA" & df$degree2 == "BA"] = df$subj2

Unfortunately, I got an error message:

不幸的是,我收到一条错误消息:

> df$major[df$degree1 == "BA"] = df$subj1
Error in df$major[df$degree1 == "BA"] = df$subj1 : 
  NAs are not allowed in subscripted assignments

I assume this means that a vectorized assignment can't be used if the assignment evaluates to NA for at least one row.

我假设这意味着如果赋值对于至少一行的求值为NA,则不能使用向量化赋值。

I feel like I must be missing something basic here, but the code above seemed like the obvious thing to do and I haven't been able to come up with an alternative.

我觉得我必须遗漏一些基本的东西,但上面的代码似乎是显而易见的事情,我无法想出一个替代方案。

In case it would be helpful in writing an answer, here's sample data, created using dput(), in the same format as the fake data listed above:

如果它有助于编写答案,这里的示例数据是使用dput()创建的,格式与上面列出的伪数据相同:

structure(list(ID = 1:20, subj1 = structure(c(3L, NA, 1L, 2L, 
2L, 3L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 3L, 3L, 1L, 2L, 1L
), .Label = c("BUS", "ENG", "SCI"), class = "factor"), degree1 = structure(c(2L, 
NA, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("BA", "MN"), class = "factor"), subj2 = structure(c(1L, 
2L, NA, NA, 1L, NA, 3L, 2L, NA, 2L, 2L, 1L, 3L, NA, 2L, 1L, 1L, 
NA, 2L, 2L), .Label = c("BUS", "ENG", "SCI"), class = "factor"), 
    degree2 = structure(c(2L, 2L, NA, NA, 2L, NA, 1L, 2L, NA, 
    2L, 1L, 1L, 2L, NA, 1L, 2L, 2L, NA, 1L, 2L), .Label = c("BA", 
    "MN"), class = "factor")), .Names = c("ID", "subj1", "degree1", 
"subj2", "degree2"), row.names = c(NA, -20L), class = "data.frame")

2 个解决方案

#1


28  

Your original method of assignment is failing for at least two reasons.

您的原始分配方法失败至少有两个原因。

1) A problem with the subscripted assignment df$major[df$degree1 == "BA"] <-. Using == can produce NA, which is what prompted the error. From ?"[<-": "When replacing (that is using indexing on the lhs of an assignment) NA does not select any element to be replaced. As there is ambiguity as to whether an element of the rhs should be used or not, this is only allowed if the rhs value is of length one (so the two interpretations would have the same outcome)." There are many ways to get around this, but I prefer using which:

1)下标df $ major [df $ degree1 ==“BA”] < - 的问题。使用==可以产生NA,这就是提示错误的原因。从?“[< - ”:“当替换时(即在赋值的lhs上使用索引),NA不会选择任何要替换的元素。因为是否应该使用rhs的元素存在歧义,只有当rhs值的长度为1时才允许这样做(因此两种解释会有相同的结果)。“有很多方法可以解决这个问题,但我更喜欢使用哪个:

df$major[which(df$degree1 == "BA")] <-

The difference is that == returns TRUE, FALSE and NA, while which returns the indices of an object that are TRUE

区别在于==返回TRUE,FALSE和NA,而返回TRUE对象的索引

> df$degree1 == "BA"
 [1] FALSE    NA  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

> which(df$degree1 == "BA")
 [1]  3  4  5  8  9 10 11 12 13 14 15 16 17 18 19 20

2) When you perform a subscripted assignment, the right hand side needs to fit into the left hand side sensibly (this is the way I think of it). This can mean left and right hand sides of equal length, which is what your example seems to imply. Therefore, you would need to subset the right hand side of the assignment as well:

2)当您执行下标任务时,右侧需要合理地放入左侧(这是我想到的方式)。这可能意味着左右两侧长度相等,这就是你的例子所暗示的。因此,您还需要对赋值的右侧进行子集化:

df$major[which(df$degree1 == "BA")] <- df$subj1[which(df$degree1 == "BA")]

I hope that clarifies why your original attempt produced an error.

我希望这可以澄清原始尝试产生错误的原因。

Using ifelse, as suggested by @DavidRobinson, is a good way of doing this type of assignment. My take on it:

正如@DavidRobinson所建议的,使用ifelse是进行此类任务的好方法。我接受它:

df$major2 <- ifelse(df$degree1 == "BA", df$subj1, ifelse(df$degree2 == "BA",
  df$subj2,NA))

This is equivalent to

这相当于

df$major[which(df$degree1 == "BA")] <- df$subj1[which(df$degree1 == "BA")]
df$major[which(df$degree1 != "BA" & df$degree2 == "BA")] <- 
  df$subj2[which(df$degree1 != "BA" & df$degree2 == "BA")]

Depending on the depth of the nested ifelse statements, another approach might be better for your real data.

根据嵌套ifelse语句的深度,另一种方法可能更适合您的实际数据。


EDIT:

编辑:

I was going to write a third reason for the original code failing (namely that df$major wasn't yet assigned), but it works for me without having to do that. This was a problem I remember having in the past, though. What version of R are you running? (2.15.0 for me.) This step is not necessary if you use the ifelse() approach. Your solution is fine when using [, although I would have chosen

我打算写原始代码失败的第三个原因(即df $ major尚未分配),但它对我有用,而不必这样做。不过,这是我记得的一个问题。你在运行什么版本的R? (对我来说是2.15.0。)如果使用ifelse()方法,则无需执行此步骤。使用[时你的解决方案很好,虽然我会选择

df$major <- NA

To get the character values of the subjects, instead of the factor level index, use as.character() (which for factors is equivalent to and calls levels(x)[x]):

要获取主题的字符值,而不是因子级别索引,请使用as.character()(因子相当于并调用levels(x)[x]):

df$major[which(df$degree1 == "BA")] <- as.character(df$subj1)[which(df$degree1 == "BA")]
df$major[which(df$degree1 != "BA" & df$degree2 == "BA")] <- 
  as.character(df$subj2)[which(df$degree1 != "BA" & df$degree2 == "BA")]

Same for the ifelse() way:

ifelse()方式相同:

df$major2 <- ifelse(df$degree1 == "BA", as.character(df$subj1),
  ifelse(df$degree2 == "BA", as.character(df$subj2), NA))

#2


7  

In general, the ifelse function is the right choice for these situations, something like:

通常,ifelse函数是这些情况的正确选择,例如:

df$major = ifelse((!is.na(df$degree1) & df$degree1 == "BA") & (is.na(df$degree2) | df$degree1 != "BA"), df$subj1, df$subj2)

However, its precise use depends on what you do if both df$degree1 and df$degree2 are "BA".

但是,如果df $ degree1和df $ degree2都是“BA”,它的精确使用取决于你的工作。

#1


28  

Your original method of assignment is failing for at least two reasons.

您的原始分配方法失败至少有两个原因。

1) A problem with the subscripted assignment df$major[df$degree1 == "BA"] <-. Using == can produce NA, which is what prompted the error. From ?"[<-": "When replacing (that is using indexing on the lhs of an assignment) NA does not select any element to be replaced. As there is ambiguity as to whether an element of the rhs should be used or not, this is only allowed if the rhs value is of length one (so the two interpretations would have the same outcome)." There are many ways to get around this, but I prefer using which:

1)下标df $ major [df $ degree1 ==“BA”] < - 的问题。使用==可以产生NA,这就是提示错误的原因。从?“[< - ”:“当替换时(即在赋值的lhs上使用索引),NA不会选择任何要替换的元素。因为是否应该使用rhs的元素存在歧义,只有当rhs值的长度为1时才允许这样做(因此两种解释会有相同的结果)。“有很多方法可以解决这个问题,但我更喜欢使用哪个:

df$major[which(df$degree1 == "BA")] <-

The difference is that == returns TRUE, FALSE and NA, while which returns the indices of an object that are TRUE

区别在于==返回TRUE,FALSE和NA,而返回TRUE对象的索引

> df$degree1 == "BA"
 [1] FALSE    NA  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

> which(df$degree1 == "BA")
 [1]  3  4  5  8  9 10 11 12 13 14 15 16 17 18 19 20

2) When you perform a subscripted assignment, the right hand side needs to fit into the left hand side sensibly (this is the way I think of it). This can mean left and right hand sides of equal length, which is what your example seems to imply. Therefore, you would need to subset the right hand side of the assignment as well:

2)当您执行下标任务时,右侧需要合理地放入左侧(这是我想到的方式)。这可能意味着左右两侧长度相等,这就是你的例子所暗示的。因此,您还需要对赋值的右侧进行子集化:

df$major[which(df$degree1 == "BA")] <- df$subj1[which(df$degree1 == "BA")]

I hope that clarifies why your original attempt produced an error.

我希望这可以澄清原始尝试产生错误的原因。

Using ifelse, as suggested by @DavidRobinson, is a good way of doing this type of assignment. My take on it:

正如@DavidRobinson所建议的,使用ifelse是进行此类任务的好方法。我接受它:

df$major2 <- ifelse(df$degree1 == "BA", df$subj1, ifelse(df$degree2 == "BA",
  df$subj2,NA))

This is equivalent to

这相当于

df$major[which(df$degree1 == "BA")] <- df$subj1[which(df$degree1 == "BA")]
df$major[which(df$degree1 != "BA" & df$degree2 == "BA")] <- 
  df$subj2[which(df$degree1 != "BA" & df$degree2 == "BA")]

Depending on the depth of the nested ifelse statements, another approach might be better for your real data.

根据嵌套ifelse语句的深度,另一种方法可能更适合您的实际数据。


EDIT:

编辑:

I was going to write a third reason for the original code failing (namely that df$major wasn't yet assigned), but it works for me without having to do that. This was a problem I remember having in the past, though. What version of R are you running? (2.15.0 for me.) This step is not necessary if you use the ifelse() approach. Your solution is fine when using [, although I would have chosen

我打算写原始代码失败的第三个原因(即df $ major尚未分配),但它对我有用,而不必这样做。不过,这是我记得的一个问题。你在运行什么版本的R? (对我来说是2.15.0。)如果使用ifelse()方法,则无需执行此步骤。使用[时你的解决方案很好,虽然我会选择

df$major <- NA

To get the character values of the subjects, instead of the factor level index, use as.character() (which for factors is equivalent to and calls levels(x)[x]):

要获取主题的字符值,而不是因子级别索引,请使用as.character()(因子相当于并调用levels(x)[x]):

df$major[which(df$degree1 == "BA")] <- as.character(df$subj1)[which(df$degree1 == "BA")]
df$major[which(df$degree1 != "BA" & df$degree2 == "BA")] <- 
  as.character(df$subj2)[which(df$degree1 != "BA" & df$degree2 == "BA")]

Same for the ifelse() way:

ifelse()方式相同:

df$major2 <- ifelse(df$degree1 == "BA", as.character(df$subj1),
  ifelse(df$degree2 == "BA", as.character(df$subj2), NA))

#2


7  

In general, the ifelse function is the right choice for these situations, something like:

通常,ifelse函数是这些情况的正确选择,例如:

df$major = ifelse((!is.na(df$degree1) & df$degree1 == "BA") & (is.na(df$degree2) | df$degree1 != "BA"), df$subj1, df$subj2)

However, its precise use depends on what you do if both df$degree1 and df$degree2 are "BA".

但是,如果df $ degree1和df $ degree2都是“BA”,它的精确使用取决于你的工作。