使用r从字符串中提取电子邮件地址

时间:2022-09-13 16:02:08

These are 5 twitter user descriptions. The idea is to extract the e-mail from each string.

这些是5个Twitter用户描述。我们的想法是从每个字符串中提取电子邮件。

This is the code i've tried, it works but there is probably something better. I'd rather avoid using unlist() and do it in one go using regex. I've seen other questions of the kind for python/perl/php but not for R. I know i could use grep(..., perl = TRUE) but that should't be the only way to do it. If it works, of course it helps.

这是我试过的代码,它可以工作,但可能还有更好的东西。我宁愿避免使用unlist()并使用正则表达式一次性完成。我已经看到了python / perl / php的其他问题但不是R.我知道我可以使用grep(...,perl = TRUE),但这不应该是唯一的方法。如果它有效,当然它会有所帮助。

ds <- c("#MillonMusical | #PromotorMusical | #Diseñador | Contacto :        ezequielife@gmail.com | #Instagram : Ezeqielgram | 01-11-11 |           @_MillonMusical @flowfestar", "LipGLosSTudio by: SAndry RUbio           Maquilladora PRofesional estudiande de diseño profesional de maquillaje     artistico lipglosstudio@hotmail.com/", "Medico General Barranquillero   radicado con su familia en Buenos Aires para iniciar Especialidad       Medico Quirurgica. email jaenpavi@hotmail.com", "msn =
    rdt031169@hotmail.comskype = ronaldotorres-br", "Aguante piscis /       manuarias17@gmail.com  buenos aires"
    )

ds <- unlist(strsplit(ds, ' '))
ds <- ds[grep("mail.", ds)]

> print(ds)
[1] "\t\tezequielife@gmail.com"  "lipglosstudio@hotmail.com/"
[3] "jaenpavi@hotmail.com"       "rdt031169@hotmail.comskype"
[5] "/\t\tmanuarias17@gmail.com"

It would be nice to separate this one "rdt031169@hotmail.comskype" perhaps asking it to end in .com or .com.ar that would make sense for what i'm working on

将这个“rdt031169@hotmail.com”类型分开,可能要求它以.com或.com.ar结尾,这对我正在做的事情有意义。

1 个解决方案

#1


5  

Here's one alternative:

这是一个替代方案:

> regmatches(ds, regexpr("[[:alnum:]]+\\@[[:alpha:]]+\\.com", ds))
[1] "ezequielife@gmail.com"     "lipglosstudio@hotmail.com" "jaenpavi@hotmail.com"      "rdt031169@hotmail.com"    
[5] "manuarias17@gmail.com" 

Based on @Frank's comment, if you want to keep country identifier after .com as in your example .com.ar then, look at this:

基于@Frank的评论,如果你想在你的例子.com.ar之后保留.com之后的国家标识符,那么看看这个:

> ds <- c(ds, "fulanito13@somemail.com.ar")  # a new e-mail address
> regmatches(ds, regexpr("[[:alnum:]]+\\@[[:alpha:]]+\\.com(\\.[a-z]{2})?", ds))
[1] "ezequielife@gmail.com"      "lipglosstudio@hotmail.com"  "jaenpavi@hotmail.com"       "rdt031169@hotmail.com"     
[5] "manuarias17@gmail.com"      "fulanito13@somemail.com.ar"

#1


5  

Here's one alternative:

这是一个替代方案:

> regmatches(ds, regexpr("[[:alnum:]]+\\@[[:alpha:]]+\\.com", ds))
[1] "ezequielife@gmail.com"     "lipglosstudio@hotmail.com" "jaenpavi@hotmail.com"      "rdt031169@hotmail.com"    
[5] "manuarias17@gmail.com" 

Based on @Frank's comment, if you want to keep country identifier after .com as in your example .com.ar then, look at this:

基于@Frank的评论,如果你想在你的例子.com.ar之后保留.com之后的国家标识符,那么看看这个:

> ds <- c(ds, "fulanito13@somemail.com.ar")  # a new e-mail address
> regmatches(ds, regexpr("[[:alnum:]]+\\@[[:alpha:]]+\\.com(\\.[a-z]{2})?", ds))
[1] "ezequielife@gmail.com"      "lipglosstudio@hotmail.com"  "jaenpavi@hotmail.com"       "rdt031169@hotmail.com"     
[5] "manuarias17@gmail.com"      "fulanito13@somemail.com.ar"