处理大型结构化数据集

What I'm asking is rather a methodology, than a concrete solution. I will start by describing the situation I found challenging, and will then proceed to the question. Hope it makes more sense to do it this way.

我要问的是一种方法论,而不是具体的解决方案。我将首先描述我发现具有挑战性的情况,然后继续讨论这个问题。希望这样做更有意义。

I'm dealing with data extracted from natural language. This data must be later analysed against a sort of "knowledge base" (it's quoted because it's not really a knowledge base, I'll get to it later). The knowledge base is large, its volume, so far theoretically, but soon practically will surpass what is possible to store in-memory. My two concerns are these:

我正在处理从自然语言中提取的数据。这些数据必须稍后根据某种“知识库”进行分析(引用它是因为它不是真正的知识库,我稍后会介绍它)。到目前为止,理论基础上知识库的数量很大,但实际上很快就会超过存储在内存中的能力。我的两个问题是:

Moving the data over to a database server will mean speed reduction by a factor... well, I don't know what factor, but it could be easily several orders of magnitude. I.e. a task of finding a piece of data in an object native to the runtime located in memory is significantly faster, then querying the database.

将数据移动到数据库服务器将意味着减少一个因素......好吧,我不知道是什么因素,但它可能很容易几个数量级。即在位于内存中的运行时本机的对象中查找数据的任务明显更快,然后查询数据库。
The entire huge volume of data is not required at any one time. In fact, only a very small fraction is used, so, perhaps some caching could aid the problem. I'm actually hoping that someone already faced this problem and caching was the right answer.

任何时候都不需要整个海量数据。事实上,只使用了很小一部分,因此,也许一些缓存可以帮助解决问题。我实际上希望有人已经面临这个问题,缓存是正确的答案。

The "knowledge base" is so far just a complex data structure, which can be queried about in a similar way you would query a database by using some query language. I.e. it is not a simple lookup value by key operation, it requires multiple sub-queries to identify an object as matching the given criteria.

到目前为止,“知识库”只是一个复杂的数据结构,可以通过使用某种查询语言查询数据库的类似方式进行查询。即它不是按键操作的简单查找值,它需要多个子查询来将对象标识为与给定条件匹配。

Just to give you more concrete example of what I'm trying to do. Unlike langutils, I'm trying to come up with a parser, which I'm calling "predictive parser", sorry if the term is already taken and means something else :) The main idea is that instead of assigning POS tags to words, and then iteratively correcting the original assumption by applying a set of rules to the inferred information, I'm trying to do it in a way, that given a certain prefix, the engine would generate a continuation, based on its "learned knowledge". I.e. suppose the knowledge base learned that the prefix "I could " is almost certainly followed by a verb phrase. And so the parser would assume the verb phrase and parse it as such, unless it hits an error. The difficult part is finding the proper prefix. The bad thing is that prefices like "I will " and "Thou shalt" will get equal priority, i.e. they would be checked for the match in the same order, either random, alphabetical etc. The idea is though that during the knowledge acquisition the knowledge base would learn to store and look up the information in such way, that the most likely prefices would be looked up first, and the least likely prefices would not be even loaded initially.

只是为了给你更具体的例子,说明我正在尝试做什么。与langutils不同,我正在尝试提出一个解析器,我称之为“预测解析器”,对不起,如果该术语已被采用并且意味着别的东西:)主要的想法是,不是将POS标签分配给单词,然后通过对推断信息应用一组规则来迭代地校正原始假设,我试图以某种方式进行,给定一定的前缀,引擎将根据其“学习知识”生成延续。即假设知识库知道前缀“我能”几乎肯定会跟着动词短语。所以解析器会假设动词短语并解析它,除非它遇到错误。困难的部分是找到合适的前缀。糟糕的是,像“我愿意”和“你应该”这样的前提将获得相同的优先权,即他们将以相同的顺序检查匹配,无论是随机的,字母的等等。这个想法是在知识获取过程中知识库将学习以这种方式存储和查找信息,最先查找最可能的预备,并且最不可能的最初预备不会被加载。

This is the concept somewhat similar to how CPU cache works. So, if what I wrote is too long: I'm looking for a data structure, that functions like CPU cache, where what's currently cached resides in memory, and what isn't is stored in a database, or as a file etc.

这个概念有点类似于CPU缓存的工作原理。因此,如果我写的内容太长:我正在寻找一种数据结构,其功能类似于CPU缓存,其中当前缓存的内容驻留在内存中,而不是存储在数据库中,或作为文件等。

PS. Sorry for my collection of tags. I feel like it's not really describing my question. You are welcome to adjust it, if you know where the question belongs.

PS。对不起我收集的标签。我觉得它并没有真正描述我的问题。如果您知道问题所在,欢迎您进行调整。

1 个解决方案

#1

If we just consider this part:

如果我们只考虑这部分:

The idea is though that during the knowledge acquisition the knowledge base would learn to store and look up the information in such way, that the most likely prefices would be looked up first, and the least likely prefices would not be even loaded initially.

我们的想法是,在知识获取过程中,知识库将学会以这种方式存储和查找信息,最先查找最可能的前提条件,并且最不可能的初始条件甚至不会被加载。

then, if I understood you correctly, you're dealing with the task of handling n-grams. As in your situation you're not putting any explicit limits on the prefixes, it can be assumed, that the generally reasonable limits apply, and those are 4-5 word n-grams. There's a lot of such n-grams: from a real-world corpus you'd easily get gigabytes of data. But even if you limit yourself to only 3-grams, you'll still get at least a couple of gigabytes, unless you perform some clever pre-processing, that will somehow separate the "good" n-grams. (Coupled with proper smoothing this may be a feasible solution).

那么,如果我理解正确的话,你就要处理处理n-gram的任务。在你的情况下,你没有对前缀作出任何明确的限制,可以假设,通常合理的限制适用,那些是4-5字n-gram。有很多这样的n-gram:从现实世界的语料库中你可以轻松获得数十亿字节的数据。但即使你将自己限制在只有3克,你仍然会获得至少几千兆字节,除非你执行一些聪明的预处理,否则会以某种方式将“好”的n-gram分开。 (加上适当的平滑,这可能是一个可行的解决方案)。

The bad news about n-grams besides their size is that they are distributed by the Zipf's law, which basically means, that caching won't be very useful.

除了它们的大小之外,关于n-gram的坏消息是它们是由Zipf定律分配的,这基本上意味着缓存不会非常有用。

So, I'd just put the data into some fast database on the local machine (maybe, some variant of dbm). If you can fit it all in memory, maybe, Memcached or Redis will be faster.

所以,我只是把数据放到本地机器上的一些快速数据库中(也许是dbm的一些变体)。如果你可以将它全部放在内存中,也许,Memcached或Redis会更快。

#1