MySQL Python查询大型数据库花费的时间太长

时间:2023-01-03 04:02:30

I have a database with over 30,000 tables and ~40-100 rows in each table. I want to retrieve a list of table names which contain a string under a specific column.

我有一个数据库,其中有超过30,000个表,每个表中大约有40-100行。我想检索表名列表,其中包含特定列下的字符串。

So for example:

举个例子:

I want to retrieve the names of all tables which contain 'foo'...

我想检索包含“foo”的所有表的名称。

Database
    Table_1
        ID: 1, STR: bar
        ID: 2, STR: foo
        ID: 3, STR: bar
    Table_2
        ID: 1, STR: bar
        ID: 2, STR: bar
        ID: 3, STR: bar
    Table_3
        ID: 1, STR: bar
        ID: 2, STR: bar
        ID: 3, STR: foo

So in this case the function should return ['Table_1', 'Table_3']

在这种情况下,函数应该返回['Table_1', 'Table_3']

So far I have this, it works fine but takes over 2 minutes to execute, which is way too long for the application I have in mind.

到目前为止,它运行良好,但是需要花2分钟来执行,这对于我想要的应用程序来说太长了。

self.m('SHOW TABLES')
result = self.db.store_result()
tablelist = result.fetch_row(0, 1)
for table in tablelist:
    table_name = table['Tables_in_definitions']
    self.m("""SELECT `def` FROM `""" + table_name + """` WHERE `def` = '""" + str + """'""")
    result = self.db.store_result()
    r = result.fetch_row(1, 1)
    if len(r) > 0:
        results.append(table_name)

I'm not smart enough to come up with a way to speed this up so if anyone has any suggestions it would be greatly appreciated, thanks!

我不够聪明,不能想出一个加速的办法,所以如果有人有什么建议,我将非常感谢,谢谢!

1 个解决方案

#1


3  

If you are just testing for the existence of one row in each table where def = 'str', one easy thing to do (with no other changes) is to add a LIMIT 1 clause to the end of your query.

如果您只是在每个def = 'str'的表中测试是否存在一行,那么一件简单的事情(没有其他更改)就是在查询的末尾添加一个LIMIT 1子句。

(If your query is performing a full table scan, MySQL can halt it once the first row is found. If no rows are found, the full table scan has to run to the end of the table.)

(如果查询正在执行全表扫描,那么一旦找到第一行,MySQL就可以停止它。如果没有找到任何行,整个表扫描必须运行到表的末尾。

This also avoids overhead of preparing lots of rows to be returned to the client, and returning them to the client, if they aren't needed.

这也避免了将大量行准备返回给客户端,并在不需要时将它们返回给客户端的开销。

Also, an index with def as a leading column (at least on your largest tables) will likely help performance, if your query is looking through large tables for "a needle in haystack".

此外,如果查询在大型表中查找“大海捞针”,那么使用def作为主导列的索引(至少在最大的表上)可能有助于提高性能。


UPDATE:

更新:

I've re-read your question, and I see that you have 30,000 tables to check, that's 30,000 separate queries, 30,000 roundtrips to the database. (ACCCKKK.)

我重新阅读了你的问题,我看到你有3万个表格要检查,这是3万个单独的查询,3万个到数据库的往返。(ACCCKKK)。

So my previous suggestion is pretty much useless. (That would be more appropriate with 40 tables each having 30,000 rows.)

我之前的建议毫无用处。(这将更适合40个表,每个表有3万行。)

Another approach would be to query a bunch of those tables at the same time. I'd be hesitant to even try more than a couple hundred tables at a time though, so I'd do it in batches.

另一种方法是同时查询一些表。我甚至不愿意一次尝试超过几百张桌子,所以我会分批进行。

SELECT DISTINCT 'Table1' AS table_name FROM Table1 WHERE def = 'str'
 UNION ALL
SELECT DISTINCT 'Table2' FROM Table2 WHERE def = 'str'
 UNION ALL
SELECT DISTINCT 'Table3' FROM Table3 WHERE def = 'str'

If def is unique in each table, or, if it's nearly unique, and you can handle duplicate table_name values being returned, you could get rid of the DISTINCT keyword.

如果每个表中def都是唯一的,或者,如果它几乎是唯一的,并且您可以处理返回的重复的table_name值,那么您可以去掉这个不同的关键字。

You do need to ensure that every table in the list has a column named def. If you encounter a table that doesn't have that column in it, the whole batch would fail. And a SHOW TABLES doesn't do that check of the column names. I'd be using a query like this to get the list of table names that have a column named def:

您确实需要确保列表中的每个表都有一个名为def的列,如果您遇到一个没有该列的表,那么整个批处理就会失败。SHOW TABLES不检查列名。我将使用这样的查询来获取具有一个名为def的列的表名列表:

SELECT table_name
  FROM information_schema.columns
 WHERE table_schema = DATABASE()
   AND column_name = 'def'
 GROUP BY table_name
 ORDER BY table_name

#1


3  

If you are just testing for the existence of one row in each table where def = 'str', one easy thing to do (with no other changes) is to add a LIMIT 1 clause to the end of your query.

如果您只是在每个def = 'str'的表中测试是否存在一行,那么一件简单的事情(没有其他更改)就是在查询的末尾添加一个LIMIT 1子句。

(If your query is performing a full table scan, MySQL can halt it once the first row is found. If no rows are found, the full table scan has to run to the end of the table.)

(如果查询正在执行全表扫描,那么一旦找到第一行,MySQL就可以停止它。如果没有找到任何行,整个表扫描必须运行到表的末尾。

This also avoids overhead of preparing lots of rows to be returned to the client, and returning them to the client, if they aren't needed.

这也避免了将大量行准备返回给客户端,并在不需要时将它们返回给客户端的开销。

Also, an index with def as a leading column (at least on your largest tables) will likely help performance, if your query is looking through large tables for "a needle in haystack".

此外,如果查询在大型表中查找“大海捞针”,那么使用def作为主导列的索引(至少在最大的表上)可能有助于提高性能。


UPDATE:

更新:

I've re-read your question, and I see that you have 30,000 tables to check, that's 30,000 separate queries, 30,000 roundtrips to the database. (ACCCKKK.)

我重新阅读了你的问题,我看到你有3万个表格要检查,这是3万个单独的查询,3万个到数据库的往返。(ACCCKKK)。

So my previous suggestion is pretty much useless. (That would be more appropriate with 40 tables each having 30,000 rows.)

我之前的建议毫无用处。(这将更适合40个表,每个表有3万行。)

Another approach would be to query a bunch of those tables at the same time. I'd be hesitant to even try more than a couple hundred tables at a time though, so I'd do it in batches.

另一种方法是同时查询一些表。我甚至不愿意一次尝试超过几百张桌子,所以我会分批进行。

SELECT DISTINCT 'Table1' AS table_name FROM Table1 WHERE def = 'str'
 UNION ALL
SELECT DISTINCT 'Table2' FROM Table2 WHERE def = 'str'
 UNION ALL
SELECT DISTINCT 'Table3' FROM Table3 WHERE def = 'str'

If def is unique in each table, or, if it's nearly unique, and you can handle duplicate table_name values being returned, you could get rid of the DISTINCT keyword.

如果每个表中def都是唯一的,或者,如果它几乎是唯一的,并且您可以处理返回的重复的table_name值,那么您可以去掉这个不同的关键字。

You do need to ensure that every table in the list has a column named def. If you encounter a table that doesn't have that column in it, the whole batch would fail. And a SHOW TABLES doesn't do that check of the column names. I'd be using a query like this to get the list of table names that have a column named def:

您确实需要确保列表中的每个表都有一个名为def的列,如果您遇到一个没有该列的表,那么整个批处理就会失败。SHOW TABLES不检查列名。我将使用这样的查询来获取具有一个名为def的列的表名列表:

SELECT table_name
  FROM information_schema.columns
 WHERE table_schema = DATABASE()
   AND column_name = 'def'
 GROUP BY table_name
 ORDER BY table_name