MySQL索引 - 根据此表和查询的最佳实践是什么

时间:2022-05-19 21:03:27

i have this table (500,000 row)

我有这张桌子(500,000排)

CREATE TABLE IF NOT EXISTS `listings` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `type` tinyint(1) NOT NULL DEFAULT '1',
  `hash` char(32) NOT NULL,
  `source_id` int(10) unsigned NOT NULL,
  `link` varchar(255) NOT NULL,
  `short_link` varchar(255) NOT NULL,
  `cat_id` mediumint(5) NOT NULL,
  `title` mediumtext NOT NULL,
  `description` mediumtext,
  `content` mediumtext,
  `images` mediumtext,
  `videos` mediumtext,
  `views` int(10) unsigned NOT NULL,
  `comments` int(11) DEFAULT '0',
  `comments_update` int(11) NOT NULL DEFAULT '0',
  `editor_id` int(11) NOT NULL DEFAULT '0',
  `auther_name` varchar(255) DEFAULT NULL,
  `createdby_id` int(10) NOT NULL,
  `createdon` int(20) NOT NULL,
  `editedby_id` int(10) NOT NULL,
  `editedon` int(20) NOT NULL,
  `deleted` tinyint(1) NOT NULL,
  `deletedon` int(20) NOT NULL,
  `deletedby_id` int(10) NOT NULL,
  `deletedfor` varchar(255) NOT NULL,
  `published` tinyint(1) NOT NULL DEFAULT '1',
  `publishedon` int(11) unsigned NOT NULL,
  `publishedby_id` int(10) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `hash` (`hash`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8;

i'm thinking to make each query by the publishedon between x and y (show in all the site just records of 1 month)

我正在考虑通过x和y之间的发布来进行每个查询(在所有网站中显示仅1个月的记录)

in the same time, i want to add with the publishedon in the where clause published, cat_id , source_id

同时,我想在发布的where子句中添加发布的cat_id,source_id

some thing like this:

这样的事情:

SELECT * FROM listings 
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) 
  AND (published = 1) 
  AND (cat_id in(1,2,3,4,5)) 
  AND (source_id  in(1,2,3,4,5))

that query is ok and fast until now without indexing, but when trying to use order by publishedon its became too slow, so i used this index

该查询是好的,快到现在没有索引,但是当试图通过发布使用它变得太慢时,所以我使用了这个索引

CREATE INDEX `listings_pcs` ON listings(
    `publishedon` DESC,
    `published` ,
    `cat_id` ,
    `source_id`
)

it worked and the order by publishedon became fast, now i want to order by views like this

它的工作和发布的订单变得很快,现在我想通过这样的观点订购

SELECT * FROM listings 
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) 
  AND (published = 1) 
  AND (cat_id in(1,2,3,4,5)) 
  AND (source_id  in(1,2,3,4,5)) 
ORDER BY views DESC

this is the explanation MySQL索引 - 根据此表和查询的最佳实践是什么 this query is too slow because of ORDER BY views DESC

这是因为ORDER BY视图DESC,这个查询太慢的解释

then i'm tried to drop the old index and add this

然后我试图删除旧索引并添加它

CREATE INDEX `listings_pcs` ON listings(
    `publishedon` DESC,
    `published` ,
    `cat_id` ,
    `source_id`,
    `views` DESC
)

its too slow also

它也太慢了

what about if i use just single index on publishedon? what about using single index on cat_id,source_id,views,publishedon?

如果我只使用发布的单一索引怎么样?如何在cat_id,source_id,views,publishedon上使用单个索引?

i can change the query dependencies like publishedon in one month if i found other indexing method depend on any other columns

如果我发现其他索引方法取决于任何其他列,我可以在一个月内更改查询依赖项,如publishedon

what about making index in (cat_id, source_id, publishedon, published) ? but in some cases i will use source_id only?

如何在(cat_id,source_id,publishedon,published)中制作索引?但在某些情况下,我只会使用source_id?

what is the best indexing schema for that table

该表的最佳索引模式是什么

4 个解决方案

#1


2  

One important general note as to why your query isn't getting any faster despite your attempts is that DESC on indexes is not currently supported on MySQL. See this SO thread, and the source from which it comes.

关于为什么你的查询没有得到任何更快的一个重要的一般说明尽管你的尝试是MySQL目前不支持DESC索引。请参阅此SO线程及其来源。

In this case, your largest problem is in the sheer size of your record. If the engine decides it wouldn't really be faster to use an index, then it won't.

在这种情况下,您最大的问题在于记录的庞大规模。如果引擎决定使用索引真的不会更快,那么它就不会。

You have a few options, and all are actually pretty decent and can probably help you see significant improvement.

你有几个选择,所有这些都非常不错,可能会帮助你看到显着的改进。

A note on SQL

First, I want to make a quick note about indexing in SQL. While I don't think it's the solution for your woes, it was your main question, and can help.

首先,我想快速了解一下SQL中的索引。虽然我不认为这是你的困境的解决方案,但这是你的主要问题,并且可以提供帮助。

It usually helps me to think about indexing in three different buckets. The absolutely, the maybe, and the never. You certainly don't have anything in your indexing that's in the never column, but there are some I would consider "maybe" indexes.

通常可以帮助我考虑在三个不同的桶中编制索引。绝对的,可能的,永远的。你的索引中肯定没有任何东西在never列中,但有一些我会考虑“可能”索引。

absolutely: This is your primary key and any foreign keys. It is also any key you will reference on a very regular basis to pull a small set of data from the massive data you have.

绝对:这是你的主键和任何外键。它也是您定期参考的任何密钥,用于从您拥有的海量数据中提取一小组数据。

maybe: These are columns which, while you may reference them regularly, are not really referenced by themselves. In fact, through analysis and using EXPLAIN as @Machavity recommends in his answer, you may find that by the time these columns are used to strip out fields, there aren't that many fields anyway. An example of a column that would solidly be in this pile for me would be the published column. Keep in mind that every INDEX adds to the work your queries need to do.

也许:这些是列,虽然你可以定期引用它们,但它们本身并没有真正引用。事实上,通过分析并使用EXPLAIN作为@Machavity建议在他的回答中,您可能会发现,当这些列用于去除字段时,无论如何都没有那么多字段。一个专栏的例子就是已发表的专栏。请记住,每个INDEX都会增加您的查询需要完成的工作。

Also: Composite keys are a good choice when you're regularly searching for data based on two different columns. More on that later.

另外:当您定期搜索基于两个不同列的数据时,复合键是一个不错的选择。稍后会详细介绍。

Options, options, options...

There are a number of options to consider, and each one has some drawbacks. Ultimately I would consider each of these on a case-by-case basis as I don't see any of these to be a silver bullet. Ideally, you'd test a few different solutions against your current setting and see which one runs the fastest using a nice scientific test.

有许多选项需要考虑,每个选项都有一些缺点。最终,我会根据具体情况考虑其中的每一项,因为我认为这些都不是银弹。理想情况下,您可以针对当前设置测试几种不同的解决方案,并使用一个不错的科学测试来查看哪一个运行速度最快。

  1. Split your SQL table into two or more separate tables.
  2. 将SQL表拆分为两个或多个单独的表。

This is one of the few times where, despite the number of columns in your table, I wouldn't rush to try to split your table into smaller chunks. If you decided to split it into smaller chunks, however, I'd argue that your [action]edon, [action]edby_id, and [action]ed could easily be put into another table, actions:

这是少数几次之一,尽管你的表中有多列,但我不会急于尝试将你的表分成更小的块。但是,如果您决定将其拆分为较小的块,我认为您的[action] edon,[action] edby_id和[action] ed可以轻松地放入另一个表中,操作:

+-----------+-------------+------+-----+-------------------+----------------+
| Field     | Type        | Null | Key | Default           | Extra          |
+-----------+-------------+------+-----+-------------------+----------------+
| id        | int(11)     | NO   | PRI | NULL              | auto_increment |
| action_id | int(11)     | NO   |     | NULL              |                |
| action    | varchar(45) | NO   |     | NULL              |                |
| date      | datetime    | NO   |     | CURRENT_TIMESTAMP |                |
| user_id   | int(11)     | NO   |     | NULL              |                |
+-----------+-------------+------+-----+-------------------+----------------+

The downside to this is that it does not allow you to ensure there is only one creation date without a TRIGGER. The upside is that when you don't have to sort as many columns with as many indexes when you're sorting by date. Also, it allows you to sort not only be created, but also by all of your other actions.

这样做的缺点是它不允许您确保只有一个没有TRIGGER的创建日期。好处是,当您按日期排序时,不必对包含尽可能多索引的列进行排序。此外,它还允许您不仅可以创建排序,还可以通过所有其他操作进行排序。

Edit: As requested, here is a sample sorting query

编辑:根据要求,这是一个示例排序查询

SELECT * FROM listings 
INNER JOIN actions ON actions.listing_id = listings.id
WHERE (actions.action = 'published') 
  AND (listings.published = 1) 
  AND (listings.cat_id in(1,2,3,4,5)) 
  AND (listings.source_id  in(1,2,3,4,5)) 
  AND (actions.actiondate between 1441105258 AND 1443614458)
ORDER BY listings.views DESC

Theoretically, it should cut down on the number of rows you're sorting against because it's only pulling relevant data. I don't have a dataset like yours so I can't test it right now!

从理论上讲,它应该减少你要排序的行数,因为它只会提取相关数据。我没有像你这样的数据集,所以我现在无法测试它!

If you put a composite key on actiondate and listings.id, this should help to increase speed.

如果在actiondate和listing.id上放置一个复合键,这应该有助于提高速度。

As I said, I don't think this is the best solution for you right now because I'm not convinced it's going to give you the maximum optimization. This leads me to my next suggestion:

正如我所说,我认为这对你来说不是最好的解决方案,因为我不相信它会给你最大的优化。这引出了我的下一个建议:

  1. Create a month field
  2. 创建月份字段

I used this nifty tool to confirm what I thought I understood of your question: You are sorting by month here. Your example is specifically looking between September 1st and September 30th, inclusive.

我使用这个漂亮的工具来确认我认为我对你的问题的理解:你在这里按月分类。你的例子是9月1日到9月30日之间的具体情况。

So another option is for you to split your integer function into a month, day, and year field. You can still have your timestamp, but timestamps aren't all that great for searching. Run an EXPLAIN on even a simple query and you'll see for yourself.

因此,另一个选项是将整数函数拆分为月,日和年字段。您仍然可以获得时间戳,但时间戳对于搜索来说并不是那么好。即使是一个简单的查询也可以运行一个EXPLAIN,你会亲眼看到。

That way, you can just index the month and year fields and do a query like this:

这样,您只需索引月份和年份字段并执行如下查询:

SELECT * FROM listings 
WHERE (publishedmonth = 9)
  AND (publishedyear = 2015) 
  AND (published = 1) 
  AND (cat_id in(1,2,3,4,5)) 
  AND (source_id  in(1,2,3,4,5)) 
ORDER BY views DESC

Slap an EXPLAIN in front and you should see massive improvements.

在前面打一个EXPLAIN,你应该看到大量的改进。

Because you're planning on referring to a month and a day, you may want to add a composite key against month and year, rather than a key on both separately, for added gains.

因为您计划引用一个月和一天,所以您可能希望针对月份和年份添加复合键,而不是单独添加一个键,以增加收益。

Note: I want to be clear, this is not the "correct" way to do things. It is convenient, but denormalized. If you want the correct way to do things, you'd adapt something like this link but I think that would require you to seriously reconsider your table, and I haven't tried anything like this, having lacked the need, and, frankly, will, to brush up on my geometry. I think it's a little overkill for what you're trying to do.

注意:我想清楚,这不是“正确”的做事方式。它很方便,但是非规范化。如果你想要正确的方法来做事,你会适应这样的链接,但我认为这需要你认真地重新考虑你的桌子,我没有尝试过这样的事情,没有必要,坦率地说,将,刷新我的几何。我认为你要做的事情有点过分。

  1. Do your heavy sorting elsewhere
  2. 在其他地方进行重排

This was hard for me to come to terms with because I like to do things the "SQL" way wherever possible, but that is not always the best solution. Heavy computing, for example, is best done using your programming language, leaving SQL to handle relationships.

这对我来说很难接受,因为我喜欢尽可能以“SQL”方式做事,但这并不总是最好的解决方案。例如,重型计算最好使用您的编程语言完成,让SQL处理关系。

The former CTO of Digg sorted using PHP instead of MySQL and received a 4,000% performance increase. You're probably not scaling out to this level, of course, so the performance trade-offs won't be clearcut unless you test it out yourself. Still, the concept is sound: the database is the bottleneck, and computer memory is dirt cheap by comparison.

Digg的前CTO使用PHP而不是MySQL进行排序,性能提升了4,000%。当然,你可能没有扩展到这个级别,所以除非你自己测试,否则性能权衡将不会明确。尽管如此,这个概念仍然是合理的:数据库是瓶颈,相比之下,计算机内存相当便宜。

There are doubtless a lot more tweaks that can be done. Each of these has a drawback and requires some investment. The best answer is to test two or more of these and see which one helps you get the most improvement.

毫无疑问,可以做更多的调整。这些都有缺点,需要一些投资。最好的答案是测试其中的两个或更多,看看哪一个可以帮助您获得最大的改进。

#2


12  

This query:

这个查询:

SELECT *
FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) AND
      (published = 1) AND
      (cat_id in (1,2,3,4,5)) AND
      (source_id in (1,2,3,4,5));

Is hard to optimize with only indexes. The best index is one that starts with published and then has the other columns -- it is not clear what their order should be. The reason is because all but published are not using =.

很难仅使用索引进行优化。最好的索引是以已发布的方式开始,然后是其他列 - 不清楚它们的顺序应该是什么。原因是因为除了已发布的所有内容都没有使用=。

Because your performance problem is on a sort, that suggests that lots of rows are being returned. Typically, an index is used to satisfy the WHERE clause before the index can be used for the ORDER BY. That makes this hard to optimize.

因为您的性能问题是排序,这表明正在返回许多行。通常,索引用于在索引可用于ORDER BY之前满足WHERE子句。这使得这很难优化。

Suggestions . . . None are that great:

建议。 。 。没有那么好:

  • If you are going to access the data by month, then you might consider partitioning the data by month. That will make the query without the ORDER BY faster, but won't help the ORDER BY.
  • 如果要按月访问数据,则可以考虑按月对数据进行分区。这将使得没有ORDER BY的查询更快,但不会帮助ORDER BY。
  • Try various orders of columns after published in the index. You might find the most selective column(s). But, once again, this speeds the query before the sorting.
  • 在索引中发布后尝试各种列的列。您可能会找到最具选择性的列。但是,再一次,这会在排序之前加快查询速度。
  • Think about ways that you can structure the query to have more equality conditions in the WHERE clause or to return a smaller set of data.
  • 考虑一下如何构造查询以在WHERE子句中具有更多相等条件或返回更小的数据集。
  • (Not really recommended) Put an index on published and the ordering column. Then use a subquery to fetch the data. Put the inequality conditions (IN and so on) in the outer query. The subquery will use the index for sorting and then filter the results.
  • (不是真的推荐)在已发布和排序列上放置索引。然后使用子查询来获取数据。将不等式条件(IN等)放在外部查询中。子查询将使用索引进行排序,然后过滤结果。

The reason the last is not recommended is because SQL (and MySQL) do not guarantee the ordering of results from a subquery. However, because MySQL materializes subqueries, the results really are in order. I don't like using undocumented side effects, which can change from version to version.

不建议使用last的原因是因为SQL(和MySQL)不保证子查询的结果排序。但是,因为MySQL实现了子查询,所以结果确实是有序的。我不喜欢使用未记录的副作用,这些副作用可能会随版本而变化。

#3


2  

If I were you, I'd at least INDEX the fields in question individually. You're building multi-column indices but it's clear you're pulling a lot of disparate records as well. Having the columns indexed individually can't hurt.

如果我是你,我至少会单独索引有关的字段。您正在构建多列索引,但很明显您也会提取大量不同的记录。将列索引为单独的列不会有害。

Something you should do is use EXPLAIN which lets you look under the hood of how MySQL is pulling the data. It could further point to what is slowing your query down.

你应该做的就是使用EXPLAIN,它让你了解MySQL如何提取数据。它可能进一步指向什么会减慢您的查询速度。

EXPLAIN SELECT * FROM listings 
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) 
  AND (published = 1) 
  AND (cat_id in(1,2,3,4,5)) 
  AND (source_id  in(1,2,3,4,5)) 
ORDER BY views DESC

#4


1  

The rows of your table are enormous (all those mediumtext columns), so sorting SELECT * is going to have a lot of overhead. That's a simple reality of your schema design. SELECT * is generally considered harmful to performance. If you can enumerate the columns you need, and you can leave out some of the big ones, you'll get better performance.

你的表的行是巨大的(所有那些中文文本列),因此排序SELECT *将有很多开销。这是您的架构设计的简单现实。 SELECT *通常被认为对性能有害。如果您可以枚举所需的列,并且可以省略一些大的列,那么您将获得更好的性能。

You showed us a query with the following filter criteria

您向我们展示了具有以下过滤条件的查询

  1. single-value equality on published.
  2. 已公布的单值平等。
  3. range matching on publishedon.
  4. 发布的范围匹配。
  5. set matching on cat_id
  6. 在cat_id上设置匹配
  7. set matching on source_id.
  8. 在source_id上​​设置匹配。
  9. Ordering on views.
  10. 订购视图。

Due to the way MySQL indexing works on MyISAM, the following compound covering index will probably serve you well. It's hard to be sure unless you try it.

由于MySQL索引在MyISAM上的工作方式,以下覆盖索引的化合物可能会很好地为您服务。除非你尝试,否则很难确定。

CREATE INDEX listings_x_pub_date_cover ON listings( 
     published, publishedon, cat_id, source_id, views, id )

To satisfy your query the MySQL engine will random-access the index at the appropriate value of published, and then at the begiining of the publishedon range. It will then scan through the index filtering on the other two filtering criteria. Finally, it sorts and and uses the id value to look up each row that passes the filter. Give it a try.

为了满足您的查询,MySQL引擎将以发布的适当值随机访问索引,然后在发布的范围内开始。然后,它将扫描其他两个过滤条件的索引过滤。最后,它排序并使用id值来查找通过过滤器的每一行。试一试。

If that performance isn't good enough try this so-called deferred join operation.

如果该性能不够好,请尝试这种所谓的延迟连接操作。

SELECT a.*
  FROM listings a
  JOIN ( SELECT id, views
           FROM listings
          WHERE published = 1
            AND publishedon BETWEEN 1441105258
                                AND 1443614458
            AND cat_id IN (1,2,3,4,5)
            AND source_id IN (1,2,3,4,5)
          ORDER BY views DESC
       ) b ON a.id = b.id
 ORDER BY b.views DESC

This does the heavy lifting of ordering with just the id and views columns without having to shuffle all those massive text columns. It may or may not help, because the ordering has to be repeated in the outer query. This kind of thing DEFINITELY helps when you have ORDER BY ... LIMIT n pattern in your query, but you don't.

这样就可以通过id和views列来完成排序的繁重工作,而无需对所有那些大量文本列进行洗牌。它可能有用也可能没用,因为必须在外部查询中重复排序。当您在查询中有ORDER BY ... LIMIT n模式时,这种事情肯定会有所帮助,但您却没有。

Finally, considering the size of these rows, you may get best performance by doing this inner query from your php program:

最后,考虑到这些行的大小,您可以通过从php程序执行此内部查询来获得最佳性能:

         SELECT id
           FROM listings
          WHERE published = 1
            AND publishedon BETWEEN 1441105258
                                AND 1443614458
            AND cat_id IN (1,2,3,4,5)
            AND source_id IN (1,2,3,4,5)
          ORDER BY views DESC

and then fetching the full rows of the table one-by-one using these id values in an inner loop. (This query that fetches just id values should be quite fast with the help of the index I mentioned.) The inner loop solution would be ugly, but if your text columns are really big (each mediumtext column can hold up to 16MiB) it's probably your best bet.

然后在内部循环中使用这些id值逐个获取表的完整行。 (在我提到的索引的帮助下,这个只获取id值的查询应该非常快。)内部循环解决方案很难看,但是如果你的文本列真的很大(每个中间文本列最多可以容纳16MiB),它可能是你最好的选择。

tl;dr. Create the index mentioned. Get rid of SELECT * if you possibly can, giving a list of columns you need instead. Try the deferred join query. If it's still not good enough try the nested query.

TL;博士。创建提到的索引。如果可能的话,摆脱SELECT *,给出你需要的列列表。尝试延迟连接查询。如果仍然不够好,请尝试嵌套查询。

#1


2  

One important general note as to why your query isn't getting any faster despite your attempts is that DESC on indexes is not currently supported on MySQL. See this SO thread, and the source from which it comes.

关于为什么你的查询没有得到任何更快的一个重要的一般说明尽管你的尝试是MySQL目前不支持DESC索引。请参阅此SO线程及其来源。

In this case, your largest problem is in the sheer size of your record. If the engine decides it wouldn't really be faster to use an index, then it won't.

在这种情况下,您最大的问题在于记录的庞大规模。如果引擎决定使用索引真的不会更快,那么它就不会。

You have a few options, and all are actually pretty decent and can probably help you see significant improvement.

你有几个选择,所有这些都非常不错,可能会帮助你看到显着的改进。

A note on SQL

First, I want to make a quick note about indexing in SQL. While I don't think it's the solution for your woes, it was your main question, and can help.

首先,我想快速了解一下SQL中的索引。虽然我不认为这是你的困境的解决方案,但这是你的主要问题,并且可以提供帮助。

It usually helps me to think about indexing in three different buckets. The absolutely, the maybe, and the never. You certainly don't have anything in your indexing that's in the never column, but there are some I would consider "maybe" indexes.

通常可以帮助我考虑在三个不同的桶中编制索引。绝对的,可能的,永远的。你的索引中肯定没有任何东西在never列中,但有一些我会考虑“可能”索引。

absolutely: This is your primary key and any foreign keys. It is also any key you will reference on a very regular basis to pull a small set of data from the massive data you have.

绝对:这是你的主键和任何外键。它也是您定期参考的任何密钥,用于从您拥有的海量数据中提取一小组数据。

maybe: These are columns which, while you may reference them regularly, are not really referenced by themselves. In fact, through analysis and using EXPLAIN as @Machavity recommends in his answer, you may find that by the time these columns are used to strip out fields, there aren't that many fields anyway. An example of a column that would solidly be in this pile for me would be the published column. Keep in mind that every INDEX adds to the work your queries need to do.

也许:这些是列,虽然你可以定期引用它们,但它们本身并没有真正引用。事实上,通过分析并使用EXPLAIN作为@Machavity建议在他的回答中,您可能会发现,当这些列用于去除字段时,无论如何都没有那么多字段。一个专栏的例子就是已发表的专栏。请记住,每个INDEX都会增加您的查询需要完成的工作。

Also: Composite keys are a good choice when you're regularly searching for data based on two different columns. More on that later.

另外:当您定期搜索基于两个不同列的数据时,复合键是一个不错的选择。稍后会详细介绍。

Options, options, options...

There are a number of options to consider, and each one has some drawbacks. Ultimately I would consider each of these on a case-by-case basis as I don't see any of these to be a silver bullet. Ideally, you'd test a few different solutions against your current setting and see which one runs the fastest using a nice scientific test.

有许多选项需要考虑,每个选项都有一些缺点。最终,我会根据具体情况考虑其中的每一项,因为我认为这些都不是银弹。理想情况下,您可以针对当前设置测试几种不同的解决方案,并使用一个不错的科学测试来查看哪一个运行速度最快。

  1. Split your SQL table into two or more separate tables.
  2. 将SQL表拆分为两个或多个单独的表。

This is one of the few times where, despite the number of columns in your table, I wouldn't rush to try to split your table into smaller chunks. If you decided to split it into smaller chunks, however, I'd argue that your [action]edon, [action]edby_id, and [action]ed could easily be put into another table, actions:

这是少数几次之一,尽管你的表中有多列,但我不会急于尝试将你的表分成更小的块。但是,如果您决定将其拆分为较小的块,我认为您的[action] edon,[action] edby_id和[action] ed可以轻松地放入另一个表中,操作:

+-----------+-------------+------+-----+-------------------+----------------+
| Field     | Type        | Null | Key | Default           | Extra          |
+-----------+-------------+------+-----+-------------------+----------------+
| id        | int(11)     | NO   | PRI | NULL              | auto_increment |
| action_id | int(11)     | NO   |     | NULL              |                |
| action    | varchar(45) | NO   |     | NULL              |                |
| date      | datetime    | NO   |     | CURRENT_TIMESTAMP |                |
| user_id   | int(11)     | NO   |     | NULL              |                |
+-----------+-------------+------+-----+-------------------+----------------+

The downside to this is that it does not allow you to ensure there is only one creation date without a TRIGGER. The upside is that when you don't have to sort as many columns with as many indexes when you're sorting by date. Also, it allows you to sort not only be created, but also by all of your other actions.

这样做的缺点是它不允许您确保只有一个没有TRIGGER的创建日期。好处是,当您按日期排序时,不必对包含尽可能多索引的列进行排序。此外,它还允许您不仅可以创建排序,还可以通过所有其他操作进行排序。

Edit: As requested, here is a sample sorting query

编辑:根据要求,这是一个示例排序查询

SELECT * FROM listings 
INNER JOIN actions ON actions.listing_id = listings.id
WHERE (actions.action = 'published') 
  AND (listings.published = 1) 
  AND (listings.cat_id in(1,2,3,4,5)) 
  AND (listings.source_id  in(1,2,3,4,5)) 
  AND (actions.actiondate between 1441105258 AND 1443614458)
ORDER BY listings.views DESC

Theoretically, it should cut down on the number of rows you're sorting against because it's only pulling relevant data. I don't have a dataset like yours so I can't test it right now!

从理论上讲,它应该减少你要排序的行数,因为它只会提取相关数据。我没有像你这样的数据集,所以我现在无法测试它!

If you put a composite key on actiondate and listings.id, this should help to increase speed.

如果在actiondate和listing.id上放置一个复合键,这应该有助于提高速度。

As I said, I don't think this is the best solution for you right now because I'm not convinced it's going to give you the maximum optimization. This leads me to my next suggestion:

正如我所说,我认为这对你来说不是最好的解决方案,因为我不相信它会给你最大的优化。这引出了我的下一个建议:

  1. Create a month field
  2. 创建月份字段

I used this nifty tool to confirm what I thought I understood of your question: You are sorting by month here. Your example is specifically looking between September 1st and September 30th, inclusive.

我使用这个漂亮的工具来确认我认为我对你的问题的理解:你在这里按月分类。你的例子是9月1日到9月30日之间的具体情况。

So another option is for you to split your integer function into a month, day, and year field. You can still have your timestamp, but timestamps aren't all that great for searching. Run an EXPLAIN on even a simple query and you'll see for yourself.

因此,另一个选项是将整数函数拆分为月,日和年字段。您仍然可以获得时间戳,但时间戳对于搜索来说并不是那么好。即使是一个简单的查询也可以运行一个EXPLAIN,你会亲眼看到。

That way, you can just index the month and year fields and do a query like this:

这样,您只需索引月份和年份字段并执行如下查询:

SELECT * FROM listings 
WHERE (publishedmonth = 9)
  AND (publishedyear = 2015) 
  AND (published = 1) 
  AND (cat_id in(1,2,3,4,5)) 
  AND (source_id  in(1,2,3,4,5)) 
ORDER BY views DESC

Slap an EXPLAIN in front and you should see massive improvements.

在前面打一个EXPLAIN,你应该看到大量的改进。

Because you're planning on referring to a month and a day, you may want to add a composite key against month and year, rather than a key on both separately, for added gains.

因为您计划引用一个月和一天,所以您可能希望针对月份和年份添加复合键,而不是单独添加一个键,以增加收益。

Note: I want to be clear, this is not the "correct" way to do things. It is convenient, but denormalized. If you want the correct way to do things, you'd adapt something like this link but I think that would require you to seriously reconsider your table, and I haven't tried anything like this, having lacked the need, and, frankly, will, to brush up on my geometry. I think it's a little overkill for what you're trying to do.

注意:我想清楚,这不是“正确”的做事方式。它很方便,但是非规范化。如果你想要正确的方法来做事,你会适应这样的链接,但我认为这需要你认真地重新考虑你的桌子,我没有尝试过这样的事情,没有必要,坦率地说,将,刷新我的几何。我认为你要做的事情有点过分。

  1. Do your heavy sorting elsewhere
  2. 在其他地方进行重排

This was hard for me to come to terms with because I like to do things the "SQL" way wherever possible, but that is not always the best solution. Heavy computing, for example, is best done using your programming language, leaving SQL to handle relationships.

这对我来说很难接受,因为我喜欢尽可能以“SQL”方式做事,但这并不总是最好的解决方案。例如,重型计算最好使用您的编程语言完成,让SQL处理关系。

The former CTO of Digg sorted using PHP instead of MySQL and received a 4,000% performance increase. You're probably not scaling out to this level, of course, so the performance trade-offs won't be clearcut unless you test it out yourself. Still, the concept is sound: the database is the bottleneck, and computer memory is dirt cheap by comparison.

Digg的前CTO使用PHP而不是MySQL进行排序,性能提升了4,000%。当然,你可能没有扩展到这个级别,所以除非你自己测试,否则性能权衡将不会明确。尽管如此,这个概念仍然是合理的:数据库是瓶颈,相比之下,计算机内存相当便宜。

There are doubtless a lot more tweaks that can be done. Each of these has a drawback and requires some investment. The best answer is to test two or more of these and see which one helps you get the most improvement.

毫无疑问,可以做更多的调整。这些都有缺点,需要一些投资。最好的答案是测试其中的两个或更多,看看哪一个可以帮助您获得最大的改进。

#2


12  

This query:

这个查询:

SELECT *
FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) AND
      (published = 1) AND
      (cat_id in (1,2,3,4,5)) AND
      (source_id in (1,2,3,4,5));

Is hard to optimize with only indexes. The best index is one that starts with published and then has the other columns -- it is not clear what their order should be. The reason is because all but published are not using =.

很难仅使用索引进行优化。最好的索引是以已发布的方式开始,然后是其他列 - 不清楚它们的顺序应该是什么。原因是因为除了已发布的所有内容都没有使用=。

Because your performance problem is on a sort, that suggests that lots of rows are being returned. Typically, an index is used to satisfy the WHERE clause before the index can be used for the ORDER BY. That makes this hard to optimize.

因为您的性能问题是排序,这表明正在返回许多行。通常,索引用于在索引可用于ORDER BY之前满足WHERE子句。这使得这很难优化。

Suggestions . . . None are that great:

建议。 。 。没有那么好:

  • If you are going to access the data by month, then you might consider partitioning the data by month. That will make the query without the ORDER BY faster, but won't help the ORDER BY.
  • 如果要按月访问数据,则可以考虑按月对数据进行分区。这将使得没有ORDER BY的查询更快,但不会帮助ORDER BY。
  • Try various orders of columns after published in the index. You might find the most selective column(s). But, once again, this speeds the query before the sorting.
  • 在索引中发布后尝试各种列的列。您可能会找到最具选择性的列。但是,再一次,这会在排序之前加快查询速度。
  • Think about ways that you can structure the query to have more equality conditions in the WHERE clause or to return a smaller set of data.
  • 考虑一下如何构造查询以在WHERE子句中具有更多相等条件或返回更小的数据集。
  • (Not really recommended) Put an index on published and the ordering column. Then use a subquery to fetch the data. Put the inequality conditions (IN and so on) in the outer query. The subquery will use the index for sorting and then filter the results.
  • (不是真的推荐)在已发布和排序列上放置索引。然后使用子查询来获取数据。将不等式条件(IN等)放在外部查询中。子查询将使用索引进行排序,然后过滤结果。

The reason the last is not recommended is because SQL (and MySQL) do not guarantee the ordering of results from a subquery. However, because MySQL materializes subqueries, the results really are in order. I don't like using undocumented side effects, which can change from version to version.

不建议使用last的原因是因为SQL(和MySQL)不保证子查询的结果排序。但是,因为MySQL实现了子查询,所以结果确实是有序的。我不喜欢使用未记录的副作用,这些副作用可能会随版本而变化。

#3


2  

If I were you, I'd at least INDEX the fields in question individually. You're building multi-column indices but it's clear you're pulling a lot of disparate records as well. Having the columns indexed individually can't hurt.

如果我是你,我至少会单独索引有关的字段。您正在构建多列索引,但很明显您也会提取大量不同的记录。将列索引为单独的列不会有害。

Something you should do is use EXPLAIN which lets you look under the hood of how MySQL is pulling the data. It could further point to what is slowing your query down.

你应该做的就是使用EXPLAIN,它让你了解MySQL如何提取数据。它可能进一步指向什么会减慢您的查询速度。

EXPLAIN SELECT * FROM listings 
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) 
  AND (published = 1) 
  AND (cat_id in(1,2,3,4,5)) 
  AND (source_id  in(1,2,3,4,5)) 
ORDER BY views DESC

#4


1  

The rows of your table are enormous (all those mediumtext columns), so sorting SELECT * is going to have a lot of overhead. That's a simple reality of your schema design. SELECT * is generally considered harmful to performance. If you can enumerate the columns you need, and you can leave out some of the big ones, you'll get better performance.

你的表的行是巨大的(所有那些中文文本列),因此排序SELECT *将有很多开销。这是您的架构设计的简单现实。 SELECT *通常被认为对性能有害。如果您可以枚举所需的列,并且可以省略一些大的列,那么您将获得更好的性能。

You showed us a query with the following filter criteria

您向我们展示了具有以下过滤条件的查询

  1. single-value equality on published.
  2. 已公布的单值平等。
  3. range matching on publishedon.
  4. 发布的范围匹配。
  5. set matching on cat_id
  6. 在cat_id上设置匹配
  7. set matching on source_id.
  8. 在source_id上​​设置匹配。
  9. Ordering on views.
  10. 订购视图。

Due to the way MySQL indexing works on MyISAM, the following compound covering index will probably serve you well. It's hard to be sure unless you try it.

由于MySQL索引在MyISAM上的工作方式,以下覆盖索引的化合物可能会很好地为您服务。除非你尝试,否则很难确定。

CREATE INDEX listings_x_pub_date_cover ON listings( 
     published, publishedon, cat_id, source_id, views, id )

To satisfy your query the MySQL engine will random-access the index at the appropriate value of published, and then at the begiining of the publishedon range. It will then scan through the index filtering on the other two filtering criteria. Finally, it sorts and and uses the id value to look up each row that passes the filter. Give it a try.

为了满足您的查询,MySQL引擎将以发布的适当值随机访问索引,然后在发布的范围内开始。然后,它将扫描其他两个过滤条件的索引过滤。最后,它排序并使用id值来查找通过过滤器的每一行。试一试。

If that performance isn't good enough try this so-called deferred join operation.

如果该性能不够好,请尝试这种所谓的延迟连接操作。

SELECT a.*
  FROM listings a
  JOIN ( SELECT id, views
           FROM listings
          WHERE published = 1
            AND publishedon BETWEEN 1441105258
                                AND 1443614458
            AND cat_id IN (1,2,3,4,5)
            AND source_id IN (1,2,3,4,5)
          ORDER BY views DESC
       ) b ON a.id = b.id
 ORDER BY b.views DESC

This does the heavy lifting of ordering with just the id and views columns without having to shuffle all those massive text columns. It may or may not help, because the ordering has to be repeated in the outer query. This kind of thing DEFINITELY helps when you have ORDER BY ... LIMIT n pattern in your query, but you don't.

这样就可以通过id和views列来完成排序的繁重工作,而无需对所有那些大量文本列进行洗牌。它可能有用也可能没用,因为必须在外部查询中重复排序。当您在查询中有ORDER BY ... LIMIT n模式时,这种事情肯定会有所帮助,但您却没有。

Finally, considering the size of these rows, you may get best performance by doing this inner query from your php program:

最后,考虑到这些行的大小,您可以通过从php程序执行此内部查询来获得最佳性能:

         SELECT id
           FROM listings
          WHERE published = 1
            AND publishedon BETWEEN 1441105258
                                AND 1443614458
            AND cat_id IN (1,2,3,4,5)
            AND source_id IN (1,2,3,4,5)
          ORDER BY views DESC

and then fetching the full rows of the table one-by-one using these id values in an inner loop. (This query that fetches just id values should be quite fast with the help of the index I mentioned.) The inner loop solution would be ugly, but if your text columns are really big (each mediumtext column can hold up to 16MiB) it's probably your best bet.

然后在内部循环中使用这些id值逐个获取表的完整行。 (在我提到的索引的帮助下,这个只获取id值的查询应该非常快。)内部循环解决方案很难看,但是如果你的文本列真的很大(每个中间文本列最多可以容纳16MiB),它可能是你最好的选择。

tl;dr. Create the index mentioned. Get rid of SELECT * if you possibly can, giving a list of columns you need instead. Try the deferred join query. If it's still not good enough try the nested query.

TL;博士。创建提到的索引。如果可能的话,摆脱SELECT *,给出你需要的列列表。尝试延迟连接查询。如果仍然不够好,请尝试嵌套查询。