利用NLTK通过分块提取关系

时间:2023-02-08 23:47:01

I am trying to figure out how to use NLTK's cascading chunker as per Chapter 7 of the NLTK book. Unfortunately, I'm running into a few issues when performing non-trivial chunking measures.

我正在试图弄清楚如何使用NLTK的级联分块,按NLTK书的第7章。不幸的是,在执行重要的分块度量时,我遇到了一些问题。

Let's start with this phrase:

让我们从这个短语开始:

"adventure movies between 2000 and 2015 featuring performances by daniel craig"

“2000 - 2015年的冒险电影,丹尼尔·克雷格的表演”

I am able to find all the relevant NPs when I use the following grammar:

当我使用以下语法时,我可以找到所有相关的NPs:

grammar = "NP: {<DT>?<JJ>*<NN.*>+}"

语法= " NP:{ < DT > ? < JJ > * < NN。* > + }”

However, I am not sure how to build nested structures with NLTK. The book gives the following format, but there are clearly a few things missing (e.g. How does one actually specify multiple rules?):

但是,我不确定如何使用NLTK构建嵌套结构。这本书给出了以下的格式,但显然缺少了一些东西(例如,如何实际指定多个规则?)

grammar = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
  """

In my case, I'd like to do something like the following:

就我而言,我想做以下事情:

grammar = r"""
          MEDIA: {<DT>?<JJ>*<NN.*>+}
          RELATION: {<V.*>}{<DT>?<JJ>*<NN.*>+}
          ENTITY: {<NN.*>}
          """

Assuming that I'd like to use a cascaded chunker for my task, what syntax would I need to use? Additionally, is it possible for me to specify specific words (e.g. "directed" or "acted") when using a chunker?

假设我想在任务中使用一个级联块,那么需要使用什么语法?另外,我是否可以指定具体的单词(例如:“导演”或“表演”)当使用块?

1 个解决方案

#1


3  

I can't comment on the relationship extraction part, not least because you don't give any details on what you want to do and what kind of data you have. So this is a rather partial answer.

我不能评论关系提取部分,尤其是因为你没有提供你想做什么和你有什么数据的任何细节。所以这是一个相当片面的答案。

a.) How does cascading chunking work in NLTK b.) Is it possible to treat the chunker like a context-free grammar, and if so, how?

一个)。在NLTK b中,级联块是如何工作的?是否可以像对待上下文无关的语法一样对待chunker,如果是,如何?

As I understand section "Building nested structure with cascaded chunkers" in the NLTK book, you can use it with a context free grammar but you have to apply it repeatedly to get the recursive structure. Chunkers are flat, but you can add chunks on top of chunks.

正如我在NLTK书中所理解的“使用级联块构建嵌套结构”一节中所理解的,您可以使用它与上下文无关的语法,但是您必须反复应用它来获得递归结构。Chunkers是平的,但是你可以在块上添加块。

c.) How can I use chunking to perform relation extraction?

c。)如何使用chunking进行关系提取?

I can't really speak to that, and anyway as I said you don't give any specifics; but if you're dealing with real text, my understanding is is that hand-written rulesets for any task are useless unless you have a large team and a lot of time. Look into the probabilistic tools that come with the NLTK. It'll be a whole lot easier if you have an annotated training corpus.

我不能谈这个,不管怎样,正如我说的,你没有给出任何细节;但如果你正在处理真实的文本,我的理解是,除非你有一个庞大的团队和大量的时间,否则手写的任何任务规则都是无用的。查看NLTK附带的概率工具。如果你有一个带注释的训练语料库,就会容易得多。

Anyway, a couple more comments about the RegexpParser.

总之,还有一些关于RegexpParser的评论。

  1. You'll find a lot more use examples on http://www.nltk.org/howto/chunk.html. (Unfortunately it's not a real how-to, but a test suite.)

    您将在http://www.nltk.org/howto/chunk.html上找到更多的使用示例。(不幸的是,这不是一个真正的操作方法,而是一个测试套件。)

  2. According to this, you can specify multiple expansion rules like this:

    根据这个,可以指定多个展开规则如下:

    patterns = """NP: {<DT|PP\$>?<JJ>*<NN>}
        {<NNP>+}
        {<NN>+}
    """
    

    I should add that grammars can have multiple rules with the same left side. That should add some flexibility with grouping related rules, etc.

    我应该补充一点,语法可以在相同的左边有多个规则。这应该为分组相关规则增加一些灵活性,等等。

#1


3  

I can't comment on the relationship extraction part, not least because you don't give any details on what you want to do and what kind of data you have. So this is a rather partial answer.

我不能评论关系提取部分,尤其是因为你没有提供你想做什么和你有什么数据的任何细节。所以这是一个相当片面的答案。

a.) How does cascading chunking work in NLTK b.) Is it possible to treat the chunker like a context-free grammar, and if so, how?

一个)。在NLTK b中,级联块是如何工作的?是否可以像对待上下文无关的语法一样对待chunker,如果是,如何?

As I understand section "Building nested structure with cascaded chunkers" in the NLTK book, you can use it with a context free grammar but you have to apply it repeatedly to get the recursive structure. Chunkers are flat, but you can add chunks on top of chunks.

正如我在NLTK书中所理解的“使用级联块构建嵌套结构”一节中所理解的,您可以使用它与上下文无关的语法,但是您必须反复应用它来获得递归结构。Chunkers是平的,但是你可以在块上添加块。

c.) How can I use chunking to perform relation extraction?

c。)如何使用chunking进行关系提取?

I can't really speak to that, and anyway as I said you don't give any specifics; but if you're dealing with real text, my understanding is is that hand-written rulesets for any task are useless unless you have a large team and a lot of time. Look into the probabilistic tools that come with the NLTK. It'll be a whole lot easier if you have an annotated training corpus.

我不能谈这个,不管怎样,正如我说的,你没有给出任何细节;但如果你正在处理真实的文本,我的理解是,除非你有一个庞大的团队和大量的时间,否则手写的任何任务规则都是无用的。查看NLTK附带的概率工具。如果你有一个带注释的训练语料库,就会容易得多。

Anyway, a couple more comments about the RegexpParser.

总之,还有一些关于RegexpParser的评论。

  1. You'll find a lot more use examples on http://www.nltk.org/howto/chunk.html. (Unfortunately it's not a real how-to, but a test suite.)

    您将在http://www.nltk.org/howto/chunk.html上找到更多的使用示例。(不幸的是,这不是一个真正的操作方法,而是一个测试套件。)

  2. According to this, you can specify multiple expansion rules like this:

    根据这个,可以指定多个展开规则如下:

    patterns = """NP: {<DT|PP\$>?<JJ>*<NN>}
        {<NNP>+}
        {<NN>+}
    """
    

    I should add that grammars can have multiple rules with the same left side. That should add some flexibility with grouping related rules, etc.

    我应该补充一点,语法可以在相同的左边有多个规则。这应该为分组相关规则增加一些灵活性,等等。