We are running a custom OpenX ad server on a MySQL database which gets approx. 1 million clicks / day. We need to store all this click information and show statistics based on it.
我们在MySQL数据库上运行一个自定义的OpenX广告服务器,大约可以获得。每天100万次点击。我们需要存储所有这些点击信息并根据它显示统计信息。
Right now, all the click information is aggregated every 2 days and the specific click info is deleted. But we want to provide a our affiliates with a new feature which will allow them to set a dynamic tracking id (TID) and, basically, track their clicks and conversions based on this.
现在,所有点击信息每2天汇总一次,并删除特定的点击信息。但我们希望为我们的关联企业提供一项新功能,以便他们设置动态跟踪ID(TID),并基本上根据此跟踪他们的点击次数和转化次数。
So, the problem is that our click table will grow by a minimum of 1 million entries a day, and we need to be able to search this table and show all the clicks for one user for a specific period of time, grouped by the TID I mentioned above, or search by the TID.
因此,问题是我们的点击表每天至少会增加100万个条目,我们需要能够搜索此表并显示特定时间段内一个用户的所有点击次数,按TID分组我在上面提到过,或者通过TID搜索。
I had a look at MySQL partitioning and it seems like a good solution, but, I'm not sure if it will still work well on a HUGE database (maybe billions of entries).
我看了一下MySQL分区,它似乎是一个很好的解决方案,但是,我不确定它是否仍能在巨大的数据库(可能是数十亿条目)上运行良好。
What do you think would be the correct approach for this issue?
您认为这个问题的正确方法是什么?
EDIT:
Based on your answers, I'm now thinking of a mixed solution.
根据你的答案,我现在正在考虑一个混合的解决方案。
We already have a "LIVE" table from which the entries are deleted when the clicks are aggregated at maintenance time, which looks something like this:
我们已经有一个“LIVE”表,当维护时聚合点击时,条目将从中删除,如下所示:
Table: clicks
viewer_id | ... | date_time | affiliate_id | ... | tid
viewer_id | ...... | date_time | affiliate_id | ...... | TID
(I skipped the columns which are unimportant at this point)
(我跳过了此时不重要的列)
At maintenance time, I can move everything to another monthly table which looks almost the same, say Table: clicks_2012_11, which has indexes for date_time, affiliate_id and tid and is partitioned by the affiliate_id.
在维护时,我可以将所有内容移动到另一个看起来几乎相同的月表,例如表:clicks_2012_11,其中包含date_time,affiliate_id和tid的索引,并由affiliate_id分区。
So now, when an affiliate wants to see his statistics for the past 2 months, I know I have to look inside the Table: clicks_2012_10 and the Table: clicks_2012_11 (I will have the time range limited to a maximum of 2 months). Because I have the tables partitioned by affiliate_id, only the needed partitions will be searched from the 2 tables and I can now list all the TIDs which had any activity in the past 2 months.
所以现在,当一个联盟会员希望查看他过去2个月的统计数据时,我知道我必须查看表格:clicks_2012_10和表格:clicks_2012_11(我的时间范围限制为最多2个月)。因为我有通过affiliate_id分区的表,所以只从2个表中搜索所需的分区,现在我可以列出过去2个月内有任何活动的所有TID。
What do you think about this approach? Are there any obvious issues? Am I over complicating things without a solid reason?
您如何看待这种方法?有没有明显的问题?我是否因为没有坚实的理由而使事情复杂化?
2 个解决方案
#1
2
There is nothing inherent in big (even "huge") tables that makes MySQL fail. Big tables are mostly a problem in terms of:
大型(甚至“巨大”)表中没有任何固有的东西可以使MySQL失败。大表几乎是一个问题:
- disk space
- cache usage (you are likely not to be able to run in memory)
- maintenance (schema changes, rebuilds, ...)
缓存使用情况(您很可能无法在内存中运行)
维护(架构更改,重建,......)
You need to address all of these.
你需要解决所有这些问题。
Partitioning is mainly useful for bulk data maintenance such as dropping entire partitions. It is certainly not a best-practice to partition big tables by default on just some column. Partitioning is always introduced for a specific reason.
分区主要用于批量数据维护,例如删除整个分区。默认情况下,仅在某些列上对大表进行分区肯定不是最佳做法。始终出于特定原因引入分区。
#2
1
Optimizing for insertion and optimizing for retrieval are usually mutually exclusive. You might be better off with two tables:
优化插入和优化检索通常是互斥的。使用两个表可能会更好:
live data: no (or minimal) keys, myisam to remove transaction overhead, etc...
historical data: indexed up the wazoo, with data moved over from the live data on a periodic basis.
#1
2
There is nothing inherent in big (even "huge") tables that makes MySQL fail. Big tables are mostly a problem in terms of:
大型(甚至“巨大”)表中没有任何固有的东西可以使MySQL失败。大表几乎是一个问题:
- disk space
- cache usage (you are likely not to be able to run in memory)
- maintenance (schema changes, rebuilds, ...)
缓存使用情况(您很可能无法在内存中运行)
维护(架构更改,重建,......)
You need to address all of these.
你需要解决所有这些问题。
Partitioning is mainly useful for bulk data maintenance such as dropping entire partitions. It is certainly not a best-practice to partition big tables by default on just some column. Partitioning is always introduced for a specific reason.
分区主要用于批量数据维护,例如删除整个分区。默认情况下,仅在某些列上对大表进行分区肯定不是最佳做法。始终出于特定原因引入分区。
#2
1
Optimizing for insertion and optimizing for retrieval are usually mutually exclusive. You might be better off with two tables:
优化插入和优化检索通常是互斥的。使用两个表可能会更好:
live data: no (or minimal) keys, myisam to remove transaction overhead, etc...
historical data: indexed up the wazoo, with data moved over from the live data on a periodic basis.