R统计:如何准确地说出一个单词

时间:2022-10-20 09:21:55

I'd like to grep for "nitrogen" in the following character vector and want to get back only the entry which is containing "nitrogen" and nothing of the rest (e.g. nitrogen fixation):

我想在下面的字符向量中找到“氮”并且想要仅返回包含“氮”的条目,而不是其余的(例如固氮):

varnames=c("nitrogen", "dissolved organic nitrogen", "nitrogen fixation", "total dissolved nitrogen", "total nitrogen")

I tried something like this:

我试过这样的事情:

grepl(pattern= "![[:space:]]nitrogen![[:space:]]", varnames)

But this doesn't work. I hope somebody can help me with this task.

但这不起作用。我希望有人可以帮我完成这项任务。

3 个解决方案

#1


13  

To get the indices that are exactly equal to "nitrogen" you could use

要获得完全等于“氮”的指数,您可以使用

which(varnames == "nitrogen")

Depending on what you want to do you might not even need the 'which' as varnames == "nitrogen" gives a logical vector of TRUE/FALSE. If you just want to do something like replace all of the occurances of "nitrogen" with "oxygen" this should suffice

根据你想要做什么,你可能甚至不需要'which'作为varnames ==“nitrogen”给出一个TRUE / FALSE的逻辑向量。如果你只想做一些事情,比如用“氧气”代替所有“氮”的出现,这就足够了

varnames[varnames == "nitrogen"] <- "oxygen"

#2


14  

Although Dason's answer is easier, you could do an exact match using grep via:

虽然Dason的答案更容易,但您可以使用grep进行完全匹配:

varnames=c("nitrogen", "dissolved organic nitrogen", "nitrogen fixation", "total dissolved nitrogen", "total nitrogen")

grep("^nitrogen$",varnames,value=TRUE)
[1] "nitrogen"

grep("^nitrogen$",varnames)
[1] 1

#3


2  

Or use fixed = TRUE if you want to match actual string (regexlessly):

如果要匹配实际字符串(regexlessly),请使用fixed = TRUE:

v <- sample(c("nitrogen", "potassium", "hidrogen"), size = 100, replace = TRUE, prob = c(.8, .1, .1))
grep("nitrogen", v, fixed = TRUE)
# [1]   3   4   5   6   7   8   9  11  12  13  14  16  19  20  21  22  23  24  25
# [20]  26  27  29  31  32  35  36  38  39  40  41  43  44  46  47  48  49  50  51
# [39]  52  53  54  56  57  60  61  62  65  66  67  69  70  71  72  73  74  75  76
# [58]  78  79  80  81  82  83  84  85  86  87  88  89  91  92  93  94  95  96  97
# [77]  98  99 100

Dunno about the speed issues, I like to test stuff and claim that approach A is faster than approach B, but in theory, at least from my experience, indexing/binary operators should be the fastest, so I vote for @Dason's approach. Also note that regexes are always slower than fixed = TRUE greping.

Dunno关于速度问题,我喜欢测试东西,并声称方法A比方法B快,但理论上,至少根据我的经验,索引/二元运算符应该是最快的,所以我投票给@Dason的方法。另请注意,正则表达式总是慢于fixed = TRUE greping。

A little proof is attached bellow. Note that this is a lame test, and system.time should be put inside replicate to get (more) accurate differences, you should take outliers into an account, etc. But surely this one proves that you should use which! =)

下面附上一点证据。请注意,这是一个蹩脚的测试,system.time应该放在复制内以获得(更多)准确的差异,你应该把异常值带到一个帐户等等。但是肯定这个证明你应该使用哪个! =)

(a0 <- system.time(replicate(1e5, grep("^nitrogen$", v))))
# user  system elapsed 
# 5.700   0.023   5.724  
(a1 <- system.time(replicate(1e5, grep("nitrogen", v, fixed = TRUE))))
# user  system elapsed 
# 1.147   0.020   1.168 
(a2 <- system.time(replicate(1e5, which(v == "nitrogen"))))
# user  system elapsed 
# 1.013   0.020   1.033 

#1


13  

To get the indices that are exactly equal to "nitrogen" you could use

要获得完全等于“氮”的指数,您可以使用

which(varnames == "nitrogen")

Depending on what you want to do you might not even need the 'which' as varnames == "nitrogen" gives a logical vector of TRUE/FALSE. If you just want to do something like replace all of the occurances of "nitrogen" with "oxygen" this should suffice

根据你想要做什么,你可能甚至不需要'which'作为varnames ==“nitrogen”给出一个TRUE / FALSE的逻辑向量。如果你只想做一些事情,比如用“氧气”代替所有“氮”的出现,这就足够了

varnames[varnames == "nitrogen"] <- "oxygen"

#2


14  

Although Dason's answer is easier, you could do an exact match using grep via:

虽然Dason的答案更容易,但您可以使用grep进行完全匹配:

varnames=c("nitrogen", "dissolved organic nitrogen", "nitrogen fixation", "total dissolved nitrogen", "total nitrogen")

grep("^nitrogen$",varnames,value=TRUE)
[1] "nitrogen"

grep("^nitrogen$",varnames)
[1] 1

#3


2  

Or use fixed = TRUE if you want to match actual string (regexlessly):

如果要匹配实际字符串(regexlessly),请使用fixed = TRUE:

v <- sample(c("nitrogen", "potassium", "hidrogen"), size = 100, replace = TRUE, prob = c(.8, .1, .1))
grep("nitrogen", v, fixed = TRUE)
# [1]   3   4   5   6   7   8   9  11  12  13  14  16  19  20  21  22  23  24  25
# [20]  26  27  29  31  32  35  36  38  39  40  41  43  44  46  47  48  49  50  51
# [39]  52  53  54  56  57  60  61  62  65  66  67  69  70  71  72  73  74  75  76
# [58]  78  79  80  81  82  83  84  85  86  87  88  89  91  92  93  94  95  96  97
# [77]  98  99 100

Dunno about the speed issues, I like to test stuff and claim that approach A is faster than approach B, but in theory, at least from my experience, indexing/binary operators should be the fastest, so I vote for @Dason's approach. Also note that regexes are always slower than fixed = TRUE greping.

Dunno关于速度问题,我喜欢测试东西,并声称方法A比方法B快,但理论上,至少根据我的经验,索引/二元运算符应该是最快的,所以我投票给@Dason的方法。另请注意,正则表达式总是慢于fixed = TRUE greping。

A little proof is attached bellow. Note that this is a lame test, and system.time should be put inside replicate to get (more) accurate differences, you should take outliers into an account, etc. But surely this one proves that you should use which! =)

下面附上一点证据。请注意,这是一个蹩脚的测试,system.time应该放在复制内以获得(更多)准确的差异,你应该把异常值带到一个帐户等等。但是肯定这个证明你应该使用哪个! =)

(a0 <- system.time(replicate(1e5, grep("^nitrogen$", v))))
# user  system elapsed 
# 5.700   0.023   5.724  
(a1 <- system.time(replicate(1e5, grep("nitrogen", v, fixed = TRUE))))
# user  system elapsed 
# 1.147   0.020   1.168 
(a2 <- system.time(replicate(1e5, which(v == "nitrogen"))))
# user  system elapsed 
# 1.013   0.020   1.033