如何下载*文章并存储在数据库中

时间:2023-01-14 23:21:24

I have this web application that users (mainly English learners or children) can search for some existing licensed articles in my database. They can be filtered by categories, tags and how difficult each one is.

我有这个web应用程序,用户(主要是英语学习者或儿童)可以在我的数据库中搜索一些现有的许可文章。它们可以通过类别、标签和每一个的难度来过滤。

So I am thinking of adding articles from Wikipedia to the database and be able to update the articles in my database once in a while but I am not sure what would be the best way for that. My understanding is that I need to download compressed files every time and then decompress them so I will get articles in XML format. Then I can add them to the database according to the tags? Is there a way I can have it update automatically? I read the article but on data dumps but not sure how to get started.

所以我在考虑从*中添加文章到数据库,并且能够在一段时间内更新我的数据库中的文章,但是我不确定什么是最好的方法。我的理解是每次都需要下载压缩文件,然后解压它们,这样我就可以得到XML格式的文章。然后我可以根据标签添加到数据库中吗?有没有办法让我自动更新?我读了这篇文章,但不知道如何开始。

http://en.wikipedia.org/wiki/Wikipedia:Database_download#SQL_schema

http://en.wikipedia.org/wiki/Wikipedia Database_download # SQL_schema

1 个解决方案

#1


-2  

Perhaps it would be better to merely crawl and index Wikipedia. Then you can store a search index with the pages you care about in a system such as Apache Solr. If you do that, be sure to be polite about the rate of your requests,

也许仅仅是抓取和索引*会更好。然后,您可以将搜索索引与您在系统(如Apache Solr)中关心的页面存储在一起。如果你这样做了,一定要礼貌地对待你的要求,

That avoids the storage and requires no effort to get the content updated. Only the links need to be updated (probably much less frequently).

这样就避免了存储,并且不需要任何努力来更新内容。只有链接需要更新(可能要少得多)。

If you don't wish to filter what people find, then you could probably just sign up for Google's search API and save the crawler time/effort...

如果您不希望过滤人们发现的内容,那么您可能只需注册谷歌的搜索API并节省爬虫时间/精力……

#1


-2  

Perhaps it would be better to merely crawl and index Wikipedia. Then you can store a search index with the pages you care about in a system such as Apache Solr. If you do that, be sure to be polite about the rate of your requests,

也许仅仅是抓取和索引*会更好。然后,您可以将搜索索引与您在系统(如Apache Solr)中关心的页面存储在一起。如果你这样做了,一定要礼貌地对待你的要求,

That avoids the storage and requires no effort to get the content updated. Only the links need to be updated (probably much less frequently).

这样就避免了存储,并且不需要任何努力来更新内容。只有链接需要更新(可能要少得多)。

If you don't wish to filter what people find, then you could probably just sign up for Google's search API and save the crawler time/effort...

如果您不希望过滤人们发现的内容,那么您可能只需注册谷歌的搜索API并节省爬虫时间/精力……