如何使用python nltk加速stanford NER的NE识别

First I tokenize the file content into sentences and then call Stanford NER on each of the sentences. But this process is really slow. I know if I call it on the whole file content if would be faster, but I'm calling it on each sentence as I want to index each sentence before and after NE recognition.

首先,我将文件内容标记为句子,然后在每个句子上调用Stanford NER。但这个过程非常缓慢。我知道如果我在整个文件内容上调用它会更快,但我会在每个句子上调用它,因为我想在NE识别之前和之后索引每个句子。

st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
for filename in filelist:
    sentences = sent_tokenize(filecontent) #break file content into sentences
    for j,sent in enumerate(sentences): 
        words = word_tokenize(sent) #tokenize sentences into words
        ne_tags = st.tag(words) #get tagged NEs from Stanford NER

This is probably due to calling st.tag() for each sentence, but is there any way to make it run faster?

这可能是因为为每个句子调用st.tag(),但有没有办法让它运行得更快?

EDIT

The reason that I want to tag sentences separate is that I want to write sentences to a file (like sentence indexing) so that given the ne tagged sentence at a later stage, i can get the unprocessed sentence (i'm also doing lemmatizing here)

我想要将句子分开标记的原因是我想将句子写入文件(如句子索引),以便在后期给出带有标签的句子,我可以得到未经处理的句子(我也在这里进行词形翻译) )

file format:

(sent_number, orig_sentence, NE_and_lemmatized_sentence)

(sent_number,orig_sentence,NE_and_lemmatized_sentence)

3 个解决方案

#1

From StanfordNERTagger, there is the tag_sents() function, see https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68

来自StanfordNERTagger,有tag_sents()函数,请参阅https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68

>>> st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
>>> tokenized_sents = [[word_tokenize(sent) for sent in sent_tokenize(filecontent)] for filename in filelist]
>>> st.tag_sents(tokenized_sents)

#2

you can use stanford ner server. The speed will be much faster.

你可以使用stanford服务器。速度会快得多。

install sner

pip install sner

run ner server

运行服务器

cd your_stanford_ner_dir
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz

from sner import Ner

test_string = "Alice went to the Museum of Natural History."
tagger = Ner(host='localhost',port=9199)
print(tagger.get_entities(test_string))

this code result is

这段代码的结果是

[('Alice', 'PERSON'),
 ('went', 'O'),
 ('to', 'O'),
 ('the', 'O'),
 ('Museum', 'ORGANIZATION'),
 ('of', 'ORGANIZATION'),
 ('Natural', 'ORGANIZATION'),
 ('History', 'ORGANIZATION'),
 ('.', 'O')]

more detail to look https://github.com/caihaoyu/sner

更多细节看https://github.com/caihaoyu/sner

#3

First download Stanford CoreNLP 3.5.2 from here: http://nlp.stanford.edu/software/corenlp.shtml

首先从这里下载Stanford CoreNLP 3.5.2:http://nlp.stanford.edu/software/corenlp.shtml

Lets say you put the download at /User/username/stanford-corenlp-full-2015-04-20

让我们说你把下载放在/ User / username / stanford-corenlp-full-2015-04-20

This Python code will run the pipeline:

这个Python代码将运行管道:

stanford_distribution_dir = "/User/username/stanford-corenlp-full-2015-04-20"
list_of_sentences_path = "/Users/username/list_of_sentences.txt"
stanford_command = "cd %s ; java -Xmx2g -cp \"*\" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ssplit.eolonly -filelist %s -outputFormat json" % (stanford_distribution_dir, list_of_sentences_path)
os.system(stanford_command)

Here is some sample Python code for loading in a .json file for reference:

下面是一些示例Python代码,用于加载.json文件以供参考:

import json
sample_json = json.loads(file("sample_file.txt.json").read()

At this point sample_json will be a nice dictionary with all the sentences from the file in it.

此时,sample_json将是一个很好的字典,其中包含文件中的所有句子。

for sentence in sample_json["sentences"]:
  tokens = []
  ner_tags = []
  for token in sentence["tokens"]:
    tokens.append(token["word"])
    ner_tags.append(token["ner"])
  print (tokens, ner_tags)

list_of_sentences.txt should be your list of files with sentences, something like:

list_of_sentences.txt应该是包含句子的文件列表,例如:

input_file_1.txt
input_file_2.txt
...
input_file_100.txt

So input_file.txt (which should have one sentence per line) will generate input_file.txt.json once the Java command is run and that .json files will have the NER tags. You can just load the .json for each input file and easily get (sentence, ner tag sequence) pairs. You can experiment with "text" as an alternative output format if you like that better. But "json" will create a nice .json file that you can load in with json.loads(...) and then you'll have a nice dictionary that you can use to access the sentences and annotations.

因此,一旦运行Java命令,input_file.txt(每行应该有一个句子)将生成input_file.txt.json,并且.json文件将具有NER标记。您可以为每个输入文件加载.json,并轻松获取(句子,ner标签序列)对。如果您更喜欢,可以尝试使用“text”作为替代输出格式。但是“json”将创建一个很好的.json文件,您可以使用json.loads(...)加载它,然后您将有一个很好的字典,您可以使用它来访问句子和注释。

This way you'll only load the pipeline once for all the files.

这样,您只需为所有文件加载一次管道。

#1