如何在django中有效地提供大量站点地图

时间:2020-12-30 03:51:43

I have a site with about 150K pages in its sitemap. I'm using the sitemap index generator to make the sitemaps, but really, I need a way of caching it, because building the 150 sitemaps of 1,000 links each is brutal on my server.[1]

我的站点地图中有一个大约150K页面的站点。我正在使用站点地图索引生成器来制作站点地图,但实际上,我需要一种缓存方式,因为在我的服务器上构建每个1000个链接的150个站点地图是残酷的。[1]

I COULD cache each of these sitemap pages with memcached, which is what I'm using elsewhere on the site...however, this is so many sitemaps that it would completely fill memcached....so that doesn't work.

我可以用memcached缓存每个站点地图页面,这是我在网站上的其他地方使用的...但是,这是很多站点地图,它将完全填充memcached ....所以这不起作用。

What I think I need is a way to use the database as the cache for these, and to only generate them when there are changes to them (which as a result of the sitemap index means only changing the latest couple of sitemap pages, since the rest are always the same.)[2] But, as near as I can tell, I can only use one cache backend with django.

我认为我需要的是一种使用数据库作为缓存的方法,并且仅在它们发生更改时生成它们(由于站点地图索引意味着只更改最新的几个站点地图页面,因为休息总是一样的。)[2]但是,尽管我可以说,我只能使用django的一个缓存后端。

How can I have these sitemaps ready for when Google comes-a-crawlin' without killing my database or memcached?

如何在不杀死我的数据库或memcached的情况下为Google提供这些站点地图?

Any thoughts?

有什么想法吗?

[1] I've limited it to 1,000 links per sitemap page because generating the max, 50,000 links, just wasn't happening.

[1]我将其限制为每个站点地图页面的1,000个链接,因为生成最多50,000个链接,就是没有发生。

[2] for example, if I have sitemap.xml?page=1, page=2...sitemap.xml?page=50, I only really need to change sitemap.xml?page=50 until it is full with 1,000 links, then I can it pretty much forever, and focus on page 51 until it's full, cache it forever, etc.

[2]例如,如果我有sitemap.xml?page = 1,page = 2 ... sitemap.xml?page = 50,我只需要更改sitemap.xml?page = 50,直到它满了1,000链接,然后我几乎可以永远,并专注于第51页,直到它已满,永远缓存,等等。

EDIT, 2012-05-12: This has continued to be a problem, and I finally ditched Django's sitemap framework after using it with a file cache for about a year. Instead I'm now using Solr to generate the links I need in a really simple view, and I'm then passing them off to the Django template. This greatly simplified my sitemaps, made them perform just fine, and I'm up to about 2,250,000 links as of now. If you want to do that, just check out the sitemap template - it's all really obvious from there. You can see the code for this here: https://bitbucket.org/mlissner/search-and-awareness-platform-courtlistener/src/tip/alert/casepage/sitemap.py

编辑,2012-05-12:这仍然是一个问题,我终于放弃了Django的站点地图框架后使用文件缓存大约一年。相反,我现在使用Solr在一个非常简单的视图中生成我需要的链接,然后我将它们传递给Django模板。这大大简化了我的站点地图,使它们表现得很好,截至目前我的链接大约为2,250,000个。如果你想这样做,只需查看站点地图模板 - 这一切都非常明显。您可以在此处查看此代码:https://bitbucket.org/mlissner/search-and-awareness-platform-courtlistener/src/tip/alert/casepage/sitemap.py

4 个解决方案

#1


9  

I had a similar issue and decided to use django to write the sitemap files to disk in the static media and have the webserver serve them. I made the call to regenerate the sitemap every couple of hours since my content wasn't changing more often than that. But it will depend on your content how often you need to write the files.

我有一个类似的问题,并决定使用django将站点地图文件写入静态媒体中的磁盘,并让网络服务器为它们提供服务。我打电话每隔几个小时重新生成一次站点地图,因为我的内容没有经常变化。但这取决于您的内容您需要多久写一次文件。

I used a django custom command with a cron job, but curl with a cron job is easier.

我使用了一个带有cron作业的django自定义命令,但使用cron作业的curl更容易。

Here's how I use curl, and I have apache send /sitemap.xml as a static file, not through django:

这是我如何使用curl,我有apache发送/sitemap.xml作为静态文件,而不是通过django:

curl -o /path/sitemap.xml http://example.com/generate/sitemap.xml

#2


8  

Okay - I have found some more info on this and what amazon are doing with their 6 million or so URLS.

好的 - 我已经找到了更多有关这方面的信息以及亚马逊在600万左右的URL中所做的事情。

Amazon simply make a map for each day and add to it:

亚马逊只需为每一天制作一张地图并添加到其中:

  1. new urls
  2. 新的网址
  3. updated urls
  4. 更新的网址

So this means that they end up with loads of site-maps - but the search bot will only look at the latest ones - as the updated dates are recent. I was under the understanding that one should refresh a map - and not include a url more than once. I think this is true. But, Amazon get around this as the site maps are more of a log. A url may appear in a later site-map - as it maybe updated - but Google wont look at the older maps as they are out of date - unless of course it does a major re-index. This approach makes a lot of sense as all you do is simply build a new map - say each day of new and updated content and ping it at google - thus google only needs to index these new urls.

所以这意味着它们最终会有大量的站点地图 - 但搜索机器人只会查看最新的 - 因为更新的日期是最近的。我理解的是,人们应该刷新地图 - 而不是不止一次包含网址。我认为这是真的。但是,亚马逊绕过这个,因为站点地图更像是一个日志。一个网址可能会出现在后来的网站地图中 - 因为它可能会更新 - 但是谷歌不会看旧版的地图,因为它们已经过时了 - 除非它确实是一个重要的重新索引。这种方法很有意义,因为你所做的只是建立一个新的地图 - 比如新的和更新的内容的每一天,并在谷歌ping它 - 因此谷歌只需要索引这些新的网址。

This log approach is a synch to code - as all you need is a static data-store model that stores the XML data for each map. your cron job can build a map - daily or weekly and then store the raw XML page in a blob field or what have you. you can then serve the pages straight from a handler and also the index map too.

这种日志方法是与代码同步 - 因为您只需要一个静态数据存储模型,它存储每个映射的XML数据。你的cron作业可以构建一个地图 - 每天或每周,然后将原始XML页面存储在blob字段中或者你有什么。然后,您可以直接从处理程序和索引映射中提供页面。

I'm not sure what others think but this sounds like a very workable approach and a load off ones server - compared to rebuilding huge map just because a few pages may have changed.

我不确定别人的想法,但这听起来像一个非常可行的方法和一个负载关闭的服务器 - 相比重建巨大的地图只是因为几页可能已经改变。

I have also considered that it may be possible to then crunch a weeks worth of maps into a week map and 4 weeks of maps into a month - so you end up with monthly maps, a map for each week in the current month and then a map for the last 7 days. Assuming that the dates are all maintained this will reduce the number of maps tidy up the process - im thinking in terms of reducing 365 maps for each day of the year down to 12.

我还认为有可能将一周价值的地图压缩成一周的地图,将4周的地图压缩成一个月 - 所以你最终得到月度地图,当月每周的地图,然后是过去7天的地图。假设所有日期都保持不变,这将减少整个过程中的地图数量 - 我想的是将一年中每天的365个地图减少到12个。

Here is a pdf on site maps and the approaches used by amazon and CNN.

以下是网站地图上的pdf以及亚马逊和CNN使用的方法。

http://www2009.org/proceedings/pdf/p991.pdf

http://www2009.org/proceedings/pdf/p991.pdf

#3


3  

I'm using django-staticgenerator app for caching sitemap.xml to filesystem and update that file when data updated.

我正在使用django-staticgenerator应用程序将sitemap.xml缓存到文件系统,并在数据更新时更新该文件。

settings.py:

settings.py:

STATIC_GENERATOR_URLS = (
    r'^/sitemap',
)
WEB_ROOT = os.path.join(SITE_ROOT, 'cache')

models.py:

models.py:

from staticgenerator import quick_publish, quick_delete
from django.dispatch import receiver
from django.db.models.signals import post_save, post_delete
from django.contrib.sitemaps import ping_google

@receiver(post_delete)
@receiver(post_save)
def delete_cache(sender, **kwargs):
    # Check if a Page model changed
    if sender == Page:
        quick_delete('/sitemap.xml')
        # You may republish sitemap file now
        # quick_publish('/', '/sitemap.xml')
        ping_google()

In nginx configuration I redirect sitemap.xml to cache folder and django instance for fallback:

在nginx配置中,我将sitemap.xml重定向到缓存文件夹,并将django实例重定向到后备:

location /sitemap.xml {
    root /var/www/django_project/cache;

    proxy_set_header  X-Real-IP  $remote_addr;
    proxy_set_header  X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Host $http_host;

    if (-f $request_filename/index.html) {
        rewrite (.*) $1/index.html break;
    }
    # If file doesn't exist redirect to django
    if (!-f $request_filename) {
        proxy_pass http://127.0.0.1:8000;
        break;
    }    
}

With this method, sitemap.xml will always be updated and clients(like google) gets xml file always staticly. That's cool I think! :)

使用此方法,将始终更新sitemap.xml,并且客户端(如google)始终静态获取xml文件。我觉得这很酷! :)

#4


0  

For those who (for whatever reason) would prefer to keep their sitemaps dynamically generated (eg freshness, lazyness). Try django-sitemaps. It's a streaming version of the standard sitemaps. Drop-in replacement. Much faster response time and uses waaaaay less memory.

对于那些(无论出于何种原因)更愿意动态生成其站点地图(例如新鲜,懒惰)的人。试试django-sitemaps。它是标准站点地图的流媒体版本。直接替换。更快的响应时间,并使用更少的内存。

#1


9  

I had a similar issue and decided to use django to write the sitemap files to disk in the static media and have the webserver serve them. I made the call to regenerate the sitemap every couple of hours since my content wasn't changing more often than that. But it will depend on your content how often you need to write the files.

我有一个类似的问题,并决定使用django将站点地图文件写入静态媒体中的磁盘,并让网络服务器为它们提供服务。我打电话每隔几个小时重新生成一次站点地图,因为我的内容没有经常变化。但这取决于您的内容您需要多久写一次文件。

I used a django custom command with a cron job, but curl with a cron job is easier.

我使用了一个带有cron作业的django自定义命令,但使用cron作业的curl更容易。

Here's how I use curl, and I have apache send /sitemap.xml as a static file, not through django:

这是我如何使用curl,我有apache发送/sitemap.xml作为静态文件,而不是通过django:

curl -o /path/sitemap.xml http://example.com/generate/sitemap.xml

#2


8  

Okay - I have found some more info on this and what amazon are doing with their 6 million or so URLS.

好的 - 我已经找到了更多有关这方面的信息以及亚马逊在600万左右的URL中所做的事情。

Amazon simply make a map for each day and add to it:

亚马逊只需为每一天制作一张地图并添加到其中:

  1. new urls
  2. 新的网址
  3. updated urls
  4. 更新的网址

So this means that they end up with loads of site-maps - but the search bot will only look at the latest ones - as the updated dates are recent. I was under the understanding that one should refresh a map - and not include a url more than once. I think this is true. But, Amazon get around this as the site maps are more of a log. A url may appear in a later site-map - as it maybe updated - but Google wont look at the older maps as they are out of date - unless of course it does a major re-index. This approach makes a lot of sense as all you do is simply build a new map - say each day of new and updated content and ping it at google - thus google only needs to index these new urls.

所以这意味着它们最终会有大量的站点地图 - 但搜索机器人只会查看最新的 - 因为更新的日期是最近的。我理解的是,人们应该刷新地图 - 而不是不止一次包含网址。我认为这是真的。但是,亚马逊绕过这个,因为站点地图更像是一个日志。一个网址可能会出现在后来的网站地图中 - 因为它可能会更新 - 但是谷歌不会看旧版的地图,因为它们已经过时了 - 除非它确实是一个重要的重新索引。这种方法很有意义,因为你所做的只是建立一个新的地图 - 比如新的和更新的内容的每一天,并在谷歌ping它 - 因此谷歌只需要索引这些新的网址。

This log approach is a synch to code - as all you need is a static data-store model that stores the XML data for each map. your cron job can build a map - daily or weekly and then store the raw XML page in a blob field or what have you. you can then serve the pages straight from a handler and also the index map too.

这种日志方法是与代码同步 - 因为您只需要一个静态数据存储模型,它存储每个映射的XML数据。你的cron作业可以构建一个地图 - 每天或每周,然后将原始XML页面存储在blob字段中或者你有什么。然后,您可以直接从处理程序和索引映射中提供页面。

I'm not sure what others think but this sounds like a very workable approach and a load off ones server - compared to rebuilding huge map just because a few pages may have changed.

我不确定别人的想法,但这听起来像一个非常可行的方法和一个负载关闭的服务器 - 相比重建巨大的地图只是因为几页可能已经改变。

I have also considered that it may be possible to then crunch a weeks worth of maps into a week map and 4 weeks of maps into a month - so you end up with monthly maps, a map for each week in the current month and then a map for the last 7 days. Assuming that the dates are all maintained this will reduce the number of maps tidy up the process - im thinking in terms of reducing 365 maps for each day of the year down to 12.

我还认为有可能将一周价值的地图压缩成一周的地图,将4周的地图压缩成一个月 - 所以你最终得到月度地图,当月每周的地图,然后是过去7天的地图。假设所有日期都保持不变,这将减少整个过程中的地图数量 - 我想的是将一年中每天的365个地图减少到12个。

Here is a pdf on site maps and the approaches used by amazon and CNN.

以下是网站地图上的pdf以及亚马逊和CNN使用的方法。

http://www2009.org/proceedings/pdf/p991.pdf

http://www2009.org/proceedings/pdf/p991.pdf

#3


3  

I'm using django-staticgenerator app for caching sitemap.xml to filesystem and update that file when data updated.

我正在使用django-staticgenerator应用程序将sitemap.xml缓存到文件系统,并在数据更新时更新该文件。

settings.py:

settings.py:

STATIC_GENERATOR_URLS = (
    r'^/sitemap',
)
WEB_ROOT = os.path.join(SITE_ROOT, 'cache')

models.py:

models.py:

from staticgenerator import quick_publish, quick_delete
from django.dispatch import receiver
from django.db.models.signals import post_save, post_delete
from django.contrib.sitemaps import ping_google

@receiver(post_delete)
@receiver(post_save)
def delete_cache(sender, **kwargs):
    # Check if a Page model changed
    if sender == Page:
        quick_delete('/sitemap.xml')
        # You may republish sitemap file now
        # quick_publish('/', '/sitemap.xml')
        ping_google()

In nginx configuration I redirect sitemap.xml to cache folder and django instance for fallback:

在nginx配置中,我将sitemap.xml重定向到缓存文件夹,并将django实例重定向到后备:

location /sitemap.xml {
    root /var/www/django_project/cache;

    proxy_set_header  X-Real-IP  $remote_addr;
    proxy_set_header  X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Host $http_host;

    if (-f $request_filename/index.html) {
        rewrite (.*) $1/index.html break;
    }
    # If file doesn't exist redirect to django
    if (!-f $request_filename) {
        proxy_pass http://127.0.0.1:8000;
        break;
    }    
}

With this method, sitemap.xml will always be updated and clients(like google) gets xml file always staticly. That's cool I think! :)

使用此方法,将始终更新sitemap.xml,并且客户端(如google)始终静态获取xml文件。我觉得这很酷! :)

#4


0  

For those who (for whatever reason) would prefer to keep their sitemaps dynamically generated (eg freshness, lazyness). Try django-sitemaps. It's a streaming version of the standard sitemaps. Drop-in replacement. Much faster response time and uses waaaaay less memory.

对于那些(无论出于何种原因)更愿意动态生成其站点地图(例如新鲜,懒惰)的人。试试django-sitemaps。它是标准站点地图的流媒体版本。直接替换。更快的响应时间,并使用更少的内存。