如何从字符串中提取“单词形式的数字”

Has anyone any idea where to start? For example, extract "two" from "I have two apples". I'm looking in the direction of NLP or QDA. Any leads for how to go about it would be appreciated.

有谁知道从哪里开始？例如，从“我有两个苹果”中提取“两个”。我正朝着NLP或QDA的方向前进。任何有关如何去做的线索将不胜感激。

2 个解决方案

#1

You might be interested in Stanford NER system. It identifies numeric entities.

您可能对Stanford NER system感兴趣。它标识数字实体。

You can try it here: http://nlp.stanford.edu:8080/corenlp/

你可以在这里试试：http：//nlp.stanford.edu：8080 / corenlp /

#2

What about this

那这个呢

(((?:sixty|seventy|eighty|ninety|fourteen|sixteen|seventeen|eighteen|nineteen|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|twenty|thirty|forty|fifty|hundred|thousand|million|billion|trillion|and)[,   -]*)+)

The words have to be out of order because regex matches the first alternation found. sixty needs to come before six, fourteen before four, etc.

由于正则表达式匹配找到的第一个轮换，因此单词必须不正常。六十，六，十四，四等之前需要六十，等等。

Demo: Regexr

演示：Regexr

This regex may work better, it managers it ignore the trailing space

这个正则表达式可能更好，它管理它忽略尾随空格

((\b(?:fourty|sixty|seventy|eighty|ninety|fourteen|sixteen|seventeen|eighteen|nineteen|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fifteen|twenty|thirty|forty|fifty|hundred|thousand|million|billion|trillion|and)\b[ ,    -]*)+(?=\W|$)+)

Also, you'll notice fourty at the begininning of this regex. It's a really common misspelling of forty, so I thought that might be useful to you. You can remove it if you like.

此外，你会注意到这个正则表达式开始时的四十岁。这是一个非常常见的四十个拼写错误，所以我认为这可能对你有用。如果您愿意，可以将其删除。

#1