Map-Reduce查询计数标签

I have a database of documents which are tagged with keywords. I am trying to find (and then count) the unique tags which are used alongside each other. So for any given tag, I want to know what tags have been used alongside that tag.

我有一个用关键字标记的文档数据库。我试图找到(然后计算)彼此并排使用的唯一标签。因此,对于任何给定的标签,我想知道哪些标签与该标签一起使用。

For example, if I had one document which had the tags [fruit, apple, plant] then when I query [apple] I should get [fruit, plant]. If another document has tags [apple, banana] then my query for [apple] would give me [fruit, plant, banana] instead.

例如,如果我有一个带有标签[水果,苹果,植物]的文件,那么当我查询[苹果]时,我应该得到[水果,植物]。如果另一个文件有标签[苹果,香蕉],那么我对[苹果]的查询会给我[水果,植物,香蕉]。

This is my map function which emits all the tags and their neighbours:

这是我的map函数,它会发出所有标签及其邻居:

function(doc) {
  if(doc.tags) {
    doc.tags.forEach(function(tag1) {
      doc.tags.forEach(function(tag2) {
        emit(tag1, tag2);
      });
    });
  }
}

So in my example above, it would emit

所以在我上面的例子中,它会发出

apple -- fruit
apple -- plant
apple -- banana
fruit -- apple
fruit -- plant
...

My question is: what should my reduce function be? The reduce function should essentially filter out the duplicates and group them all together.

我的问题是:我的减少功能应该是什么? reduce函数应该基本上过滤掉重复项并将它们组合在一起。

I have tried a number of different attempts, but my database server (CouchDB) keeps giving me a Error: reduce_overflow_error. Reduce output must shrink more rapidly.

我尝试了许多不同的尝试,但我的数据库服务器(CouchDB)一直给我一个错误:reduce_overflow_error。减少产量必须更快地缩小。

EDIT: I've found something that seems to work, but I'm not sure why. I see there is an optional "rereduce" parameter to the reduce function call. If I ignore these special cases, then it stops throwing reduce_overflow_errors. Can anyone explain why? And also, should I just be ignoring these, or will this bite me in the ass later?

编辑:我发现似乎有用的东西,但我不确定为什么。我看到reduce函数调用有一个可选的“rereduce”参数。如果我忽略这些特殊情况,那么它会停止抛出reduce_overflow_errors。有人可以解释为什么吗而且,我应该忽略这些,还是稍后会咬我的屁股?

function(keys, values, rereduce) {
  if(rereduce) return null; // Throws error without this.

  var a = [];
  values.forEach(function(tag) {
    if(a.indexOf(tag) < 0) a.push(tag);
  });
  return a;
}

2 个解决方案

#1

Your answer is nice, and as I said in the comments, if it works for you, that's all you should care about. Here is an alternative implementation in case you ever bump into performance problems.

你的答案很好,正如我在评论中所说,如果它适合你,那就是你应该关心的。如果您遇到性能问题,这是一个替代实现。

CouchDB likes tall lists, not fat lists. Instead of view rows keeping an array with every previous tag ever seen, this solution keeps the "sibling" tags in the key of the view rows, and then group them together to guarantee one unique sibling tag per row. Every row is just two tags, but there could be thousands or millions of rows: a tall list, which CouchDB prefers.

CouchDB喜欢高级列表,而不是胖列表。此解决方案将视图行中的“sibling”标记保留在视图行的键中,然后将它们组合在一起以保证每行一个唯一的兄弟标记,而不是查看行保持数组与之前看到的每个标记。每行只有两个标签,但可能有数千或数百万行:CouchDB更喜欢的高列表。

The main idea is to emit a 2-array of tag pairs. Suppose we have one doc, tagged fruit, apple, plant.

主要思想是发出2对标签对。假设我们有一个doc,标记水果,苹果,植物。

// Pseudo-code visualization of view rows (before reduce)
// Key         , Value
[apple, fruit ], 1
[apple, plant ], 1 // Basically this is every combination of 2 tags in the set.
[fruit, apple ], 1
[fruit, plant ], 1
[plant, apple ], 1
[plant, fruit ], 1

Next I tag something apple, banana.

接下来我标记苹果,香蕉。

// Pseudo-code visualization of view rows (before reduce)
// Key         , Value
[apple, banana], 1 // This is from my new doc
[apple, fruit ], 1
[apple, plant ], 1 // This is also from my new doc
[banana, apple], 1
[fruit, apple ], 1
[fruit, plant ], 1
[plant, apple ], 1
[plant, fruit ], 1

Why is the value always 1? Because I can make a very simple built-in reduce function: _sum to tell me the count of all tag pairs. Next, query with ?group_level=2 and CouchDB will give you unique pairs, with a count of their total.

为什么值总是1?因为我可以创建一个非常简单的内置reduce函数:_sum告诉我所有标记对的计数。接下来,使用?group_level = 2和CouchDB进行查询将为您提供唯一的对,并计算其总数。

A map function to produce this kind of view might look like this:

生成此类视图的map函数可能如下所示:

function(doc) {
  // Emit "sibling" tags, keyed on tag pairs.
  var tags = doc.tags || []
  tags.forEach(function(tag1) {
    tags.forEach(function(tag2) {
      if(tag1 != tag2)
        emit([tag1, tag2], 1)
    })
  })
}

#2

I have found a correct solution I am much happier with. The trick was that CouchDB must be set to reduce_limit = false so that it stops checking its heuristic against your query.

我找到了一个正确的解决方案,我感到非常高兴。诀窍是必须将CouchDB设置为reduce_limit = false,以便它停止针对您的查询检查其启发式。

You can set this via Futon on http://localhost:5984/_utils/config.html under the query_server_config settings, by double clicking on the value.

您可以通过在http:// localhost:5984 / _utils / config.html下在query_server_config设置下通过双击该值来设置此项。

Once that's done, here is my new map function which works better with the "re-reducing" part of the reduce function:

完成后,这是我的新map函数,它与reduce函数的“re-reduced”部分一起使用效果更好:

function(doc) {
  if(doc.tags) {
    doc.tags.forEach(function(tag1) {
      doc.tags.forEach(function(tag2) {
        emit(tag1, [tag2]); // Array with single value
      });
    });
  }
}

And here is the reduce function:

这是reduce函数:

function(keys, values) {
  var a = [];
  values.forEach(function(tags) {
    tags.forEach(function(tag) {
      if(a.indexOf(tag) < 0) a.push(tag);
    });
  });
  return a;
}

Hope this helps someone!

希望这有助于某人!

#1