匹配python正则表达式中的unicode表情符号

I need to extract the text between a number and an emoticon in a text

我需要在文本中提取数字和表情符号之间的文本

example text:

blah xzuyguhbc ibcbb bqw 2 extract1  ☺️ jbjhcb 6 extract2 ???? bjvcvvv

output:

extract1
extract2

The regex code that I wrote extracts the text between 2 numbers, I need to change the part where it identifies the unicode emoji characters and extracts text between them.

我写的正则表达式代码提取了2个数字之间的文本,我需要更改标识unicode表情符号字符的部分,并在它们之间提取文本。

(?<=[\s][\d])(.*?)(?=[\d])

Please suggest a python friendly method, and I need it to work with all the emoji's not only the one's given in the example

请建议一个python友好的方法,我需要它与所有的表情符号一起使用不仅是示例中给出的那个

https://regex101.com/r/uT1fM0/1

3 个解决方案

#1

Since there are a lot of emoji with different unicode values, you have to explicitly specify them in your regex, or if they are with a spesific range you can use a character class. In this case your second simbol is not a standard emoji, it's just a unicode character, but since it's greater than \u263a (the unicode representation of ☺️) you can put it in a range with \u263a:

由于有很多具有不同unicode值的表情符号,您必须在正则表达式中明确指定它们,或者如果它们具有特定范围,则可以使用字符类。在这种情况下,你的第二个simbol不是一个标准的表情符号,它只是一个unicode字符,但由于它大于\ u263a(☺️的unicode表示),你可以把它放在一个带有\ u263a的范围内:

In [71]: s = 'blah xzuyguhbc ibcbb bqw 2 extract1  ☺️ jbjhcb 6 extract2 ???? bjvcvvv'

In [72]: regex = re.compile(r'\d+(.*?)(?:\u263a|\U0001f645)')

In [74]: regex.findall(s)
Out[74]: [' extract1  ', ' extract2 ']

Or if you want to match more emojies you can use a character range (here is a good reference which shows you the proper range for different emojies http://apps.timwhitlock.info/emoji/tables/unicode):

或者如果你想匹配更多的表情符号,你可以使用一个字符范围(这里有一个很好的参考资料,可以显示不同表情符号的适当范围http://apps.timwhitlock.info/emoji/tables/unicode):

In [75]: regex = re.compile(r'\d+(.*?)[\u263a-\U0001f645]')

In [76]: regex.findall(s)
Out[76]: [' extract1  ', ' extract2 ']

Note that in second case you have to make sure that all the characters withn the aforementioned range are emojies that you want.

请注意,在第二种情况下,您必须确保具有上述范围的所有字符都是您想要的表情符号。

Here is another example:

这是另一个例子:

In [77]: s = "blah 4 xzuyguhbc ???? ibcbb bqw 2 extract1  ☺️ jbjhcb 6 extract2 ???? bjvcvvv"

In [78]: regex = re.compile(r'\d+(.*?)[\u263a-\U0001f645]')

In [79]: regex.findall(s)
Out[79]: [' xzuyguhbc ', ' extract1  ', ' extract2 ']

#2

Here's my stab at the solution. Not sure if it will work in all circumstances. The trick is to convert all unicode emojis into normal text. This could be done by following this post Then you can match the emoji just as any normal text. Note that it won't work if the literal strings \u or \U is in your searched text.

这是我对解决方案的抨击。不确定它是否适用于所有情况。诀窍是将所有unicode表情符号转换为普通文本。这可以通过以下帖子完成然后您可以像任何普通文本一样匹配表情符号。请注意,如果文字字符串\ u或\ U位于搜索文本中,则无法使用。

Example: Copy your string into a file, let's call it emo. In terminal:

示例:将字符串复制到文件中,我们称之为emo。在终端:

Chip chip@ 03:24:33@ ~: cat emo | python *.py
blah xzuyguhbc ibcbb bqw 2 extract1  \u263a\ufe0f jbjhcb 6 extract2 \U0001f645 bjvcvvv\n
------------------------
[' extract1  ', ' extract2 ']

Where *.py file is:

*.py文件的位置是:

import fileinput
a = fileinput.input();
for line in a:
    teststring = unicode(line,'utf-8')
    teststring = teststring.encode('unicode-escape')

import re
print teststring
print "------------------------"
m = re.findall('(?<=[\s][\d])(.*?)(?=\\\\[uU])', teststring)
print m

#3

So this may or not work depending on your needs. If you know the emoji's ahead of time though this will probably work, you just need a list of the types of emoticons to expect.

所以根据您的需要,这可能或不起作用。如果您提前知道表情符号,虽然这可能会有效,但您只需要一个表达类型的表情符号。

Anyway without more information, this is what I'd do.

无论如何没有更多的信息,这就是我要做的。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

my_regex = re.compile(r'\d\s*([^☺️|^????]+)')

string = "blah xzuyguhbc ibcbb bqw 2 extract1  ☺️ jbjhcb 6 extract2 ???? bjvcvvv"

m = my_regex.findall(string)
if m:
  print m

#1

In [71]: s = 'blah xzuyguhbc ibcbb bqw 2 extract1  ☺️ jbjhcb 6 extract2 ???? bjvcvvv'

In [72]: regex = re.compile(r'\d+(.*?)(?:\u263a|\U0001f645)')

In [74]: regex.findall(s)
Out[74]: [' extract1  ', ' extract2 ']