用于存储/访问大量数据的正确解决方案

时间:2022-10-03 19:20:38

I wrote a program which crawls website, processes html pages and stores results in MySql database. By 'results' i mean html contents, all the links with their attributes and various errors in case when crawler couldn't fetch. I use this program for analytical purposes.

我写了一个程序,它抓取网站,处理html页面并将结果存储在MySql数据库中。 “结果”是指html内容,所有带有属性的链接以及爬虫无法获取时的各种错误。我将此程序用于分析目的。

Everything works fine but the main problem is that data takes way too much disk space. For each 100000 websites crawled (20 pages per site maximum) i have like 5 mysql tables totaling ~60 Gigabytes of space and i need to process 20-30 times more websites.

一切正常但主要问题是数据占用了太多的磁盘空间。对于每个被抓取的100000个网站(每个网站最多20个网页),我有5个mysql表共计约60 GB的空间,我需要处理20-30倍的网站。

Of course i cannot process that much data on my home pc at once and i am forced to process only small chunks of it which is time consuming and not efficient.

当然我不能一次处理我家用电脑上的那么多数据而且我*只处理它的一小部分,这是耗时且效率不高的。

So i am seeking for advice or solution that would:
1) give the same flexibility accessing data that relational DB does
2) allow smart and efficient saving of data

所以我正在寻求以下建议或解决方案:1)提供访问关系数据库所做数据的相同灵活性2)允许智能和有效地保存数据

2 个解决方案

#1


2  

I doubt a different storage engine will get much more efficient than that - if you store everything in one table, without any indexes, and using natural primary keys, then almost no storage overhead will be incurred, and even if you do add a bit of structure, it should still remain sane.

我怀疑一个不同的存储引擎会比这更有效 - 如果你把所有东西存储在一个表中,没有任何索引,并使用自然的主键,那么几乎不会产生任何存储开销,即使你确实添加了一点结构,它应该仍然保持理智。

My guess would be that your problem is the sheer amount of data you collect, so you probably want to remove considerable portions of your sample data before storing: for example, you may want to boil the page source down to a bunch of (normalized) keywords, you may want to skip heavy content (images etc.) and stuff that doesn't interest you (e.g. CSS stylesheets, javascript, etc.), etc.

我的猜测是您的问题是您收集的大量数据,因此您可能希望在存储之前删除大量的示例数据:例如,您可能希望将页面源简化为一堆(规范化)关键字,你可能想要跳过沉重的内容(图像等)和你不感兴趣的东西(例如CSS样式表,javascript等)等。

#2


1  

You may want to look into InnoDB Data Compression option.

您可能需要查看InnoDB数据压缩选项。

There are also BI products like the column-oriented Infobright that transparently use compression.

还有BI产品,如面向列的Infobright,透明地使用压缩。

#1


2  

I doubt a different storage engine will get much more efficient than that - if you store everything in one table, without any indexes, and using natural primary keys, then almost no storage overhead will be incurred, and even if you do add a bit of structure, it should still remain sane.

我怀疑一个不同的存储引擎会比这更有效 - 如果你把所有东西存储在一个表中,没有任何索引,并使用自然的主键,那么几乎不会产生任何存储开销,即使你确实添加了一点结构,它应该仍然保持理智。

My guess would be that your problem is the sheer amount of data you collect, so you probably want to remove considerable portions of your sample data before storing: for example, you may want to boil the page source down to a bunch of (normalized) keywords, you may want to skip heavy content (images etc.) and stuff that doesn't interest you (e.g. CSS stylesheets, javascript, etc.), etc.

我的猜测是您的问题是您收集的大量数据,因此您可能希望在存储之前删除大量的示例数据:例如,您可能希望将页面源简化为一堆(规范化)关键字,你可能想要跳过沉重的内容(图像等)和你不感兴趣的东西(例如CSS样式表,javascript等)等。

#2


1  

You may want to look into InnoDB Data Compression option.

您可能需要查看InnoDB数据压缩选项。

There are also BI products like the column-oriented Infobright that transparently use compression.

还有BI产品,如面向列的Infobright,透明地使用压缩。