如何从一个数据列复制到另一个数据列,同时匹配R中的其他列?

时间:2022-04-08 04:21:54

I've searched a number of places (*, r-blogger, etc), but haven't quite found a good option for doing this in R. Hopefully someone has some ideas.

我已经搜索了很多地方(*, r-blogger,等等),但是还没有找到一个好的选择在r。希望有人有一些想法。

I have a set of environmental sampling data. The data includes a variety of fields (visit date, region, location, sample medium, sample component, result, etc.).

我有一组环境采样数据。数据包括各种字段(访问日期、区域、位置、样本介质、样本成分、结果等)。

Here's a subset of the pertinent fields. This is where I start...

这是相关字段的子集。这是我开始的地方……

visit_date   region    location     media      component     result
1990-08-20   LAKE      555723       water       Mg            *Nondetect
1999-07-01   HILL      432422       water       Ca            3.2
2010-09-12   LAKE      555723       water       pH            6.8
2010-09-12   LAKE      555723       water       Mg            2.1
2010-09-12   HILL      432423       water       pH            7.2
2010-09-12   HILL      432423       water       N             0.8
2010-09-12   HILL      432423       water       NH4          112

What I hope to reach is a table/dataframe like this:

我希望得到的是这样的一个表/dataframe:

visit_date   region    location     media      component     result        pH
1990-08-20   LAKE      555723       water       Mg            *Nondetect  *Not recorded
1999-07-01   HILL      432422       water       Ca            3.2         *Not recorded
2010-09-12   LAKE      555723       water       pH            6.8         6.8
2010-09-12   LAKE      555723       water       Mg            2.1         6.8
2010-09-12   HILL      432423       water       pH            7.2         7.2
2010-09-12   HILL      432423       water       N             0.8         7.2
2010-09-12   HILL      432423       water       NH4          112          7.2

I attempted to use the method here -- R finding rows of a data frame where certain columns match those of another -- but unfortunately didn't get to the result I wanted. Instead the pH column was either my pre-populated value -999 or NA and not the pH value for that particular visit date if it was collected. Since the result data set is around 500k records, I'm using unique(tResult$pH) to determine the values of the pH column.

我尝试在这里使用这个方法——R查找数据帧的行,其中某些列与另一个列匹配——但不幸的是没有得到我想要的结果。相反,pH值是我的预填充值-999或NA,如果它被收集的话,它的pH值不是特定访问日期的pH值。由于结果数据集大约为500k记录,所以我使用unique(tResult$pH)来确定pH列的值。

Here's that attempt. res is the original result data.frame and component would be the pH result subset (the pH sample results from the main results table).

这是尝试。res为原始结果数据,frame和component为pH结果子集(pH样本结果来自主结果表)。

keys <- c("region", "location", "visit_date", "media")

tResults <- data.table(res, key=keys)
tComponent <- data.table(component, key=keys)

tResults[tComponent, pH>0]

I've attempted using match, merge, and within on the original data frame without success. Since then I've generated a subset for the components (pH in this example) where I copied over the results column to a new "pH" column, thinking I could match the keys and update a new "pH" column in the main result set.

我尝试在原始数据框架中使用match、merge和within,但没有成功。从那时起,我为组件生成了一个子集(本例中为pH),我将结果列复制到一个新的“pH”列,认为我可以匹配键并更新主结果集中的一个新的“pH”列。

Since not all result values are numeric (with values like *Not recorded) I attempted to use numerics like -888 or other values which could substitute so I could force at least the result and pH columns to be numeric. Aside from the dates which are POSIXct values, the remaining columns are character columns. Original dataframe was created using StringsAsFactors=FALSE.

由于不是所有的结果值都是数值(如*未被记录),所以我尝试使用-888之类的数值或其他可以替代的数值,这样至少可以使结果和pH列为数值。除了日期是POSIXct值之外,其余的列是字符列。使用StringsAsFactors=FALSE创建原始dataframe。

Once I can do this, I'll be able to generate similar columns for other components that can be used to populate and calculate other values for a given sample. At least that's my goal.

一旦我可以这样做,我将能够为其他组件生成类似的列,这些组件可以用于填充和计算给定示例的其他值。至少这是我的目标。

So I'm stumped on this one. In my mind it should be easy but I'm certainly NOT seeing it!

我被这个难住了。在我看来,这应该很容易,但我肯定看不到!

Your help and ideas are certainly welcome and appreciated!

欢迎您的帮助和建议!

1 个解决方案

#1


4  

#df1 is your first data set and is dataframe
df1$phtem<-with(df1,ifelse(component=="pH",result,NA))

library(data.table)
library(zoo) # locf function

setDT(df1)[,pH:=na.locf(phtem,na.rm = FALSE)]
    visit_date region location media component     result phtem  pH
1: 1990-08-20   LAKE   555723 water        Mg *Nondetect    NA  NA
2: 1999-07-01   HILL   432422 water        Ca        3.2    NA  NA
3: 2010-09-12   LAKE   555723 water        pH        6.8   6.8 6.8
4: 2010-09-12   LAKE   555723 water        Mg        2.1    NA 6.8
5: 2010-09-12   HILL   432423 water        pH        7.2   7.2 7.2
6: 2010-09-12   HILL   432423 water         N        0.8    NA 7.2
7: 2010-09-12   HILL   432423 water       NH4        112    NA 7.2

# you can delete phtem if you don't need.

如果不需要,可以删除phtem。

Edit:

编辑:

library(data.table)
setDT(df1)[,pH:=result[component=="pH"],by="region,location,visit_date,media"]
df1

   visit_date region location media component     result  pH
1: 1990-08-20   LAKE   555723 water        Mg *Nondetect  NA
2: 1999-07-01   HILL   432422 water        Ca        3.2  NA
3: 2010-09-12   LAKE   555723 water        pH        6.8 6.8
4: 2010-09-12   LAKE   555723 water        Mg        2.1 6.8
5: 2010-09-12   HILL   432423 water        pH        7.2 7.2
6: 2010-09-12   HILL   432423 water         N        0.8 7.2
7: 2010-09-12   HILL   432423 water       NH4        112 7.2

#1


4  

#df1 is your first data set and is dataframe
df1$phtem<-with(df1,ifelse(component=="pH",result,NA))

library(data.table)
library(zoo) # locf function

setDT(df1)[,pH:=na.locf(phtem,na.rm = FALSE)]
    visit_date region location media component     result phtem  pH
1: 1990-08-20   LAKE   555723 water        Mg *Nondetect    NA  NA
2: 1999-07-01   HILL   432422 water        Ca        3.2    NA  NA
3: 2010-09-12   LAKE   555723 water        pH        6.8   6.8 6.8
4: 2010-09-12   LAKE   555723 water        Mg        2.1    NA 6.8
5: 2010-09-12   HILL   432423 water        pH        7.2   7.2 7.2
6: 2010-09-12   HILL   432423 water         N        0.8    NA 7.2
7: 2010-09-12   HILL   432423 water       NH4        112    NA 7.2

# you can delete phtem if you don't need.

如果不需要,可以删除phtem。

Edit:

编辑:

library(data.table)
setDT(df1)[,pH:=result[component=="pH"],by="region,location,visit_date,media"]
df1

   visit_date region location media component     result  pH
1: 1990-08-20   LAKE   555723 water        Mg *Nondetect  NA
2: 1999-07-01   HILL   432422 water        Ca        3.2  NA
3: 2010-09-12   LAKE   555723 water        pH        6.8 6.8
4: 2010-09-12   LAKE   555723 water        Mg        2.1 6.8
5: 2010-09-12   HILL   432423 water        pH        7.2 7.2
6: 2010-09-12   HILL   432423 water         N        0.8 7.2
7: 2010-09-12   HILL   432423 water       NH4        112 7.2