scikit-learn:在标记化时不要分隔带连字符的单词

时间:2022-02-18 20:27:04

I am using the CountVectorizer and don't want to separate hyphenated words into different tokens. I have tried passing different pregex patterns into the token_pattern argument, but haven't been able to get the desired result.

我正在使用CountVectorizer,并且不希望将带连字符的单词分成不同的标记。我尝试将不同的pregex模式传递给token_pattern参数,但是无法获得所需的结果。

Here's what I have tried:

这是我尝试过的:

pattern = r''' (?x)         # set flag to allow verbose regexps 
([A-Z]\.)+          # abbreviations (e.g. U.S.A.)
| \w+(-\w+)*        # words with optional internal hyphens
| \$?\d+(\.\d+)?%?  # currency & percentages
| \.\.\.            # ellipses '''

text = 'I hate traffic-ridden streets.'
vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
analyze = vectorizer.build_analyzer()
analyze(text)

I have also tried to use nltk's regexp_tokenize as suggested in an earlier question but it's behaviour seems to have changed as well.

我也试过使用nltk的regexp_tokenize,如前面的问题所示,但它的行为似乎也发生了变化。

1 个解决方案

#1


6  

There are a couple things to note. The first is that adding in all of those spaces, line breaks and comments into your pattern string makes all of those characters part of your regular expression. See here:

有几点需要注意。首先,在模式字符串中添加所有这些空格,换行符和注释会使所有这些字符成为正则表达式的一部分。看这里:

import re
>>> re.match("[0-9]","3")
<_sre.SRE_Match object at 0x104caa920>
>>> re.match("[0-9] #a","3")
>>> re.match("[0-9] #a","3 #a")
<_sre.SRE_Match object at 0x104caa718>

The second is that you need to escape special sequences when constructing your regex pattern within a string. For example pattern = "\w" really needs to be pattern = "\\w". Once you account for those things you should be able to write the regex for your desired tokenizer. For example if you just wanted to add in hyphens something like this will work:

第二个是在字符串中构造正则表达式时需要转义特殊序列。例如,pattern =“\ w”确实需要是pattern =“\\ w”。一旦你考虑到这些东西,你应该能够为你想要的标记器编写正则表达式。例如,如果你只是想添加连字符,这样的东西会起作用:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> pattern = "(?u)\\b[\\w-]+\\b"
>>> 
>>> text = 'I hate traffic-ridden streets.'
>>> vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
>>> analyze = vectorizer.build_analyzer()
>>> analyze(text)
[u'hate', u'traffic-ridden', u'streets']

#1


6  

There are a couple things to note. The first is that adding in all of those spaces, line breaks and comments into your pattern string makes all of those characters part of your regular expression. See here:

有几点需要注意。首先,在模式字符串中添加所有这些空格,换行符和注释会使所有这些字符成为正则表达式的一部分。看这里:

import re
>>> re.match("[0-9]","3")
<_sre.SRE_Match object at 0x104caa920>
>>> re.match("[0-9] #a","3")
>>> re.match("[0-9] #a","3 #a")
<_sre.SRE_Match object at 0x104caa718>

The second is that you need to escape special sequences when constructing your regex pattern within a string. For example pattern = "\w" really needs to be pattern = "\\w". Once you account for those things you should be able to write the regex for your desired tokenizer. For example if you just wanted to add in hyphens something like this will work:

第二个是在字符串中构造正则表达式时需要转义特殊序列。例如,pattern =“\ w”确实需要是pattern =“\\ w”。一旦你考虑到这些东西,你应该能够为你想要的标记器编写正则表达式。例如,如果你只是想添加连字符,这样的东西会起作用:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> pattern = "(?u)\\b[\\w-]+\\b"
>>> 
>>> text = 'I hate traffic-ridden streets.'
>>> vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
>>> analyze = vectorizer.build_analyzer()
>>> analyze(text)
[u'hate', u'traffic-ridden', u'streets']