MongoDB中两个集合之间的Diff()

时间:2023-02-10 02:31:25

I have done research. I apologize if this is a duplicate question, but the solutions in other questions were not really my fit, and thus, I made a new question.

我做过研究。如果这是一个重复的问题我很抱歉,但其他问题的解决方案并不是我的合适,因此,我提出了一个新问题。

What is the best way with Javascript to compare two collections?

使用Javascript比较两个集合的最佳方法是什么?

I have thousands of these headers in this Mongo document format:

我有这种Mongo文档格式的数千个标题:

{
    "url": "google.com",
    "headers": {
        "location": "http://www.google.com/",
        "content-type": "text/html; charset=UTF-8",
        "date": "Mon, 25 Mar 2013 18:12:08 GMT",
        "expires": "Wed, 24 Apr 2013 18:12:08 GMT",
        "cache-control": "public, max-age=2592000",
        "server": "gws",
        "content-length": "219",
        "x-xss-protection": "1; mode=block",
        "x-frame-options": "SAMEORIGIN"
    }
}

I ran my scraper today. I would, in the future, run it again, and store that in a second collection. Additionally, I would like to be able to compare three specific header objects, and that is server, x-aspnet-version, and x-powered-by, and detect if there are any integer increments.

我今天跑了刮刀。在将来,我会再次运行它,并将其存储在第二个集合中。另外,我希望能够比较三个特定的头对象,即服务器,x-aspnet-version和x-powered-by,并检测是否有任何整数增量。

What is the best way to iterate through two collections and do a diff()?

迭代两个集合并执行diff()的最佳方法是什么?

Am I doing it right? Any suggestions would be really appreciated.

我做得对吗?任何建议都会非常感激。

1 个解决方案

#1


4  

A couple of suggestions:

一些建议:

You could use a combination of url and the date accessed (at least part of the datetime object) as the _id for these objects since from what I can tell you plan to scrape each url once a month.

你可以使用url和访问日期的组合(至少是datetime对象的一部分)作为这些对象的_id,因为我可以告诉你计划每月抓一个url。

Example:

例:

{
    "_id": {
        "url": "www.google.com",
        "date": ISODate("2013-03-01"),
    },
    // Other attributes
}

This yields performance, uniqueness, and query dividends (see this 4sq blog post). You could query doing something like:

这会产生性能,唯一性和查询红利(请参阅此4sq博客文章)。你可以查询做类似的事情:

db.collection.find({
    "_id": {
        "$gte": {
            "url": yourUrl,
            "date": rangeStart
         },
         "$lt": {
            "url": yourUrl,
            "date": rangeEnd
         },
    }
})

Which yields excellent, nicely sorted (by url THEN by date, which seems to be just what you want) results. You could also use this index to perform covered queries (over the _id field) if you just want a nice set of all of the urls and months you have scraped (this could set you up nicely to go through each url one at a time).

这产生了优秀的,很好的排序(按日期,按日期,这似乎是你想要的)结果。你也可以使用这个索引来执行覆盖的查询(在_id字段上),如果你只想要一个很好的所有网址和你已经抓过的月份(这可以很好地让你很好地通过每个网址一次) 。

If you have specific attributes of the document that you're interested in comparing (headers.server for example) and a specific comparison you want to do for them (looking for any increment in version numbers for example), I would use some kind of regex to grab the elements relevant to version number (a quick and dirty one might simply retrieve all numeric elements) and graph them for each url (I assume this would let you visualize changes to server software over time). You could just as easily report whenever any of these attributes changed by scanning them in order and setting off some event when the strings were not identical (perhaps then reporting the change or the numerical piece of the change).

如果你有比较感兴趣的文档的特定属性(例如headers.server)和你想要为它们做的特定比较(例如寻找版本号的任何增量),我会使用某种正则表达式获取与版本号相关的元素(快速和脏的可能只是检索所有数字元素)并为每个URL绘制图形(我假设这可以让您可视化服务器软件随时间的变化)。您可以通过按顺序扫描任何这些属性来轻松报告,并在字符串不相同时(可能随后报告更改或更改的数字部分)引发某些事件。

#1


4  

A couple of suggestions:

一些建议:

You could use a combination of url and the date accessed (at least part of the datetime object) as the _id for these objects since from what I can tell you plan to scrape each url once a month.

你可以使用url和访问日期的组合(至少是datetime对象的一部分)作为这些对象的_id,因为我可以告诉你计划每月抓一个url。

Example:

例:

{
    "_id": {
        "url": "www.google.com",
        "date": ISODate("2013-03-01"),
    },
    // Other attributes
}

This yields performance, uniqueness, and query dividends (see this 4sq blog post). You could query doing something like:

这会产生性能,唯一性和查询红利(请参阅此4sq博客文章)。你可以查询做类似的事情:

db.collection.find({
    "_id": {
        "$gte": {
            "url": yourUrl,
            "date": rangeStart
         },
         "$lt": {
            "url": yourUrl,
            "date": rangeEnd
         },
    }
})

Which yields excellent, nicely sorted (by url THEN by date, which seems to be just what you want) results. You could also use this index to perform covered queries (over the _id field) if you just want a nice set of all of the urls and months you have scraped (this could set you up nicely to go through each url one at a time).

这产生了优秀的,很好的排序(按日期,按日期,这似乎是你想要的)结果。你也可以使用这个索引来执行覆盖的查询(在_id字段上),如果你只想要一个很好的所有网址和你已经抓过的月份(这可以很好地让你很好地通过每个网址一次) 。

If you have specific attributes of the document that you're interested in comparing (headers.server for example) and a specific comparison you want to do for them (looking for any increment in version numbers for example), I would use some kind of regex to grab the elements relevant to version number (a quick and dirty one might simply retrieve all numeric elements) and graph them for each url (I assume this would let you visualize changes to server software over time). You could just as easily report whenever any of these attributes changed by scanning them in order and setting off some event when the strings were not identical (perhaps then reporting the change or the numerical piece of the change).

如果你有比较感兴趣的文档的特定属性(例如headers.server)和你想要为它们做的特定比较(例如寻找版本号的任何增量),我会使用某种正则表达式获取与版本号相关的元素(快速和脏的可能只是检索所有数字元素)并为每个URL绘制图形(我假设这可以让您可视化服务器软件随时间的变化)。您可以通过按顺序扫描任何这些属性来轻松报告,并在字符串不相同时(可能随后报告更改或更改的数字部分)引发某些事件。