MongoDB映射/减少数组聚合问题

时间:2021-07-31 16:30:51

I have a MongoDB collection, whose docs use several levels of nesting, from which I would like to extract a multidimensional array compiled from a subset of their fields. I have a solution that works for me right now, but I want to better understand this concept of 'idempotency' and its consequences related to the reduce function.

我有一个MongoDB集合,它的文档使用多个层次的嵌套,我想从中提取一个多维数组,该数组是从它们的字段子集编译而来的。我现在有一个解决方案,但是我想更好地理解这个“幂等性”的概念及其与reduce函数相关的结果。

{
  "host_name" : "gateway",
  "service_description" : "PING",
  "last_update" : 1305777787,
  "performance_object" : [
    [ "rta", 0.105, "ms", 100, 500, 0 ],
    [ "pl", 0, "%", 20, 60, 0 ]
  ]
}

And here are the map/reduce functions

这是map/reduce函数

var M = function() {
  var hn = this.host_name, 
      sv = this.service_description, 
      ts = this.last_update;
  this.performance_object.forEach(function(P){
    emit( { 
      host: hn, 
      service: sv, 
      metric: P[0] 
    }, { 
      time: ts, 
      value: P[1] 
    } );
  });
}
var R = function(key,values) {
  var result = { 
    time: [], 
    value: [] 
  };
  values.forEach(function(V){
    result.time.push(V.time);
    result.value.push(V.value);
  });
  return result;
}
db.runCommand({
  mapreduce: <colname>,
  out: <col2name>,
  map: M,
  reduce: R
});

Data is returned in a useful structure, which I reformat/sort with finalize for graphing.

数据以一种有用的结构返回,我用finalize对其进行格式化/排序,以便绘制图形。

{
  "_id" : {
    "host" : "localhost",
    "service" : "Disk Space",
    "metric" : "/var/bck"
  },
  "value" : {
    "time" : [
      [ 1306719302, 1306719601, 1306719903, ... ],
      [ 1306736404, 1306736703, 1306737002, ... ],
      [ 1306766401, 1306766701, 1306767001, ... ]
    ],
    "value" : [
      [ 122, 23423, 25654, ... ],
      [ 336114, 342511, 349067, ... ],
      [ 551196, 551196, 551196, ... ]
    ]
  }
}

Finally...

最后……

 [ [1306719302,122], [1306719601,23423], [1306719903,25654], ... ]

TL;DR: What is the expected behavior with the oberved "chunking" of the array results?

什么是预期的行为与oberved“块”的数组结果?

I understand that the reduce function may be called multiple times on array(s) of emitted values, which is why there are several "chunks" of the complete arrays, rather than a single array. The array chunks are typically 25-50 items and it's easy enough to clean this up in finalize(). I concat() the arrays, interleave them as [time,value] and sort. But what I really want to know is if this can get more complex:

我理解reduce函数可以在发射值的数组中多次调用,这就是为什么完整数组中有几个“块”,而不是单个数组。数组块通常是25-50个条目,在finalize()中清理这些条目非常容易。我将数组转换为[time,value]并排序。但我真正想知道的是,这是否会变得更复杂:

1) Is the chunking observed because of my code, MongoDB's implementation or the Map/Reduce algorithm itself?

1)由于我的代码、MongoDB的实现或Map/Reduce算法本身,组块是否被观察到?

2) Will there ever be deeper (recursive) nesting of array chunks in sharded configurations or even just because of my hasty implementation? This would break the concat() method.

2)是否会有更深层(递归)的数组块嵌套在分片配置中,甚至仅仅因为我的草率实现?这会破坏concat()方法。

3) Is there simply a better strategy for getting array results as shown above?

3)是否有更好的策略来获得如上所示的数组结果?

EDIT: Modified to emit arrays:

I took Thomas' advise and re-wrote it to emit arrays. It absolutely doesn't make any sense to split up the values.

我采纳了托马斯的建议,重新编写它来发送数组。分割这些值是毫无意义的。

var M = function() {
  var hn = this.host_name, 
      sv = this.service_description, 
      ts = this.last_update;
  this.performance_object.forEach(function(P){
    emit( { 
      host: hn, 
      service: sv, 
      metric: P[0] 
    }, { 
      value: [ ts, P[1] ] 
    } );
  });
}
var R = function(key,values) {
  var result = {
    value: [] 
  };
  values.forEach(function(V){
    result.value.push(V.value);
  });
  return result;
}
db.runCommand({
  mapreduce: <colname>,
  out: <col2name>,
  map: M,
  reduce: R
});

Now the output is similar to this:

现在的输出类似于:

{
  "_id" : {
    "host" : "localhost",
    "service" : "Disk Space",
    "metric" : "/var/bck"
  },
  "value" : {
    "value" : [
      [ [1306736404,336114],[1306736703,342511],[1306737002,349067], ... ],
      [ [1306766401,551196],[1306766701,551196],[1306767001,551196], ... ],
      [ [1306719302,122],[1306719601,122],[1306719903,122], ... ]
    ]
  }
}

And I used this finalize function to concatenate the array chunks and sort them.

我用这个finalize函数连接数组块并对它们进行排序。

...
var F = function(key,values) {
  return (Array.concat.apply([],values.value)).sort(function(a,b){ 
    if (a[0] < b[0]) return -1;
    if (a[0] > b[0]) return 1;
    return 0;
  });
}
db.runCommand({
  mapreduce: <colname>,
  out: <col2name>,
  map: M,
  reduce: R,
  finalize: F
});

Which works nicely:

这工作好:

{
  "_id" : {
    "host" : "localhost",
    "service" : "Disk Space",
    "metric" : "/mnt/bck"
  },
  "value" : [ [1306719302,122],[1306719601,122],[1306719903,122],, ... ]
}

I guess the only question that's gnawing at me is whether this Array.concat.apply([],values.value) can be trusted to clean up the output of reduce all of the time.

我想唯一困扰我的问题是这个Array.concat.apply([],value .value)是否可以一直被信任来清理reduce的输出。

LAST EDIT: Much simpler...

I have modified the document structure since the original example given above, but this only changes the example by making the map function really simple.

从上面给出的原始示例开始,我已经修改了文档结构,但这只会使map函数变得非常简单,从而改变示例。

I'm still trying to wrap my brain around why Array.prototype.push.apply(result, V.data) works so differently from result.push(V.data)... but it works.

我还在思考为什么要用array。prototype。push。apply(result, V.data)和result.push(V.data)的工作原理非常不同。但它的工作原理。

var M = function() {
  emit( { 
    host: this.host, 
    service: this.service, 
    metric: this.metric
  } , { 
    data: [ [ this.timestamp, this.data ] ] 
  } );
}
var R = function(key,values) {
  var result = [];
  values.forEach(function(V){
    Array.prototype.push.apply(result, V.data);
  });
  return { data: result };
}
var F = function(key,values) {
  return values.data.sort(function(a,b){
    return (a[0]<b[0]) ? -1 : (a[0]>b[0]) ? 1 : 0;
  });
}

It has the same output as shown just above the LAST EDIT heading.

它的输出与上次编辑标题上方显示的输出相同。

Thanks, Thomas!

谢谢,托马斯!

1 个解决方案

#1


3  

  1. The "chunking" comes from your code: your reduce function's values parameter can contain either {time:<timestamp>,value:<value>} emitted from your map function, or {time:[<timestamps>],value:[<values]} returned from a previous call to your reduce function.

    “分块”来自您的代码:您的reduce函数的值参数可以包含{time: ,value: },或者{time:[ ],value:[ ]},它们是从您之前对reduce函数的调用中返回的。

  2. I don't know if it will happen in practice, but it can happen in theory.

    我不知道它是否会在实践中发生,但它在理论上可以发生。

  3. Simply have your map function emit the same kind of objects that your reduce function returns, i.e. emit(<id>, {time: [ts], value: [P[1]]}), and change your reduce function accordingly, i.e. Array.push.apply(result.time, V.time) and similarly for result.value.

    只需让您的map函数发出与reduce函数返回的对象相同的对象,即emit( , {time: [ts], value: [P[1]]}),并相应地更改reduce函数,即Array.push.apply(result)。时间,时间)和类似的结果。

    Well I actually don't understand why you're not using an array of time/value pairs, instead of a pair of arrays, i.e. emit(<id>, { pairs: [ {time: ts, value: P[1] ] }) or emit(<id>, { pairs: [ [ts, P[1]] ] }) in the map function, and Array.push.apply(result.pairs, V.pairs) in the reduce function. That way, you won't even need the finalize function (except maybe to "unwrap" the array from the pairs property: because the reduce function cannot return an array, your have to wrap it that way in an object)

    实际上,我不理解为什么不使用时间/值对数组,而不是一对数组,例如在map函数中发出( , {time: ts, value: P[1]})或发出( , {pair: [ts, P[1]]})和Array.push.apply(result)。在reduce函数中。这样,您甚至不需要finalize函数(除了可能要“展开”pair属性中的数组:因为reduce函数不能返回数组,所以您必须以这种方式将它封装到对象中)

#1


3  

  1. The "chunking" comes from your code: your reduce function's values parameter can contain either {time:<timestamp>,value:<value>} emitted from your map function, or {time:[<timestamps>],value:[<values]} returned from a previous call to your reduce function.

    “分块”来自您的代码:您的reduce函数的值参数可以包含{time: ,value: },或者{time:[ ],value:[ ]},它们是从您之前对reduce函数的调用中返回的。

  2. I don't know if it will happen in practice, but it can happen in theory.

    我不知道它是否会在实践中发生,但它在理论上可以发生。

  3. Simply have your map function emit the same kind of objects that your reduce function returns, i.e. emit(<id>, {time: [ts], value: [P[1]]}), and change your reduce function accordingly, i.e. Array.push.apply(result.time, V.time) and similarly for result.value.

    只需让您的map函数发出与reduce函数返回的对象相同的对象,即emit( , {time: [ts], value: [P[1]]}),并相应地更改reduce函数,即Array.push.apply(result)。时间,时间)和类似的结果。

    Well I actually don't understand why you're not using an array of time/value pairs, instead of a pair of arrays, i.e. emit(<id>, { pairs: [ {time: ts, value: P[1] ] }) or emit(<id>, { pairs: [ [ts, P[1]] ] }) in the map function, and Array.push.apply(result.pairs, V.pairs) in the reduce function. That way, you won't even need the finalize function (except maybe to "unwrap" the array from the pairs property: because the reduce function cannot return an array, your have to wrap it that way in an object)

    实际上,我不理解为什么不使用时间/值对数组,而不是一对数组,例如在map函数中发出( , {time: ts, value: P[1]})或发出( , {pair: [ts, P[1]]})和Array.push.apply(result)。在reduce函数中。这样,您甚至不需要finalize函数(除了可能要“展开”pair属性中的数组:因为reduce函数不能返回数组,所以您必须以这种方式将它封装到对象中)