如何锁定对MySQL表的读/写，以便我可以选择然后插入而无需其他程序读/写数据库？

I am running many instances of a webcrawler in parallel.

我并行运行了很多webcrawler实例。

Each crawler selects a domain from a table, inserts that url and a start time into a log table, and then starts crawling the domain.

每个爬网程序从表中选择一个域，将该URL和开始时间插入到日志表中，然后开始对域进行爬网。

Other parallel crawlers check the log table to see what domains are already being crawled before selecting their own domain to crawl.

在选择要爬网的域之前，其他并行爬网程序会检查日志表以查看已爬网的域。

I need to prevent other crawlers from selecting a domain that has just been selected by another crawler but doesn't have a log entry yet. My best guess at how to do this is to lock the database from all other read/writes while one crawler selects a domain and inserts a row in the log table (two queries).

我需要阻止其他抓取工具选择刚刚被其他抓取工具选中但尚未拥有日志条目的域。我最好的猜测是如何锁定数据库以防止所有其他读/写，同时一个爬虫选择一个域并在日志表中插入一行（两个查询）。

How the heck does one do this? I'm afraid this is terribly complex and relies on many other things. Please help get me started.

一个人怎么做到这一点？我担心这非常复杂，并且依赖于许多其他事情。请帮我一步。

This code seems like a good solution (see the error below, however):

这段代码似乎是一个很好的解决方案（但请参阅下面的错误）：

INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
    (
        SELECT companies.id FROM companies
        LEFT OUTER JOIN crawlLog
        ON companies.id = crawlLog.companyId
        WHERE crawlLog.companyId IS NULL
        LIMIT 1
    ),
    now()
)

but I keep getting the following mysql error:

但我不断收到以下mysql错误：

You can't specify target table 'crawlLog' for update in FROM clause

Is there a way to accomplish the same thing without this problem? I've tried a couple different ways. Including this:

有没有办法在没有这个问题的情况下完成同样的事情？我尝试了几种不同的方式。包括这个：

INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
    (
        SELECT id
        FROM companies
        WHERE id NOT IN (SELECT companyId FROM crawlLog) LIMIT 1
    ),
    now()
)

5 个解决方案

#1

You can lock tables using the MySQL LOCK TABLES command like this:

您可以使用MySQL LOCK TABLES命令锁定表，如下所示：

LOCK TABLES tablename WRITE;

# Do other queries here

UNLOCK TABLES;

See:

看到：

http://dev.mysql.com/doc/refman/5.5/en/lock-tables.html

#2

You probably don't want to lock the table. If you do that you'll have to worry about trapping errors when the other crawlers try to write to the database - which is what you were thinking when you said "...terribly complex and relies on many other things."

您可能不想锁定表。如果你这样做，你将不得不担心当其他爬虫试图写入数据库时陷阱错误 - 当你说“......非常复杂并依赖于许多其他东西时”，这就是你的想法。

Instead you should probably wrap the group of queries in a MySQL transaction (see http://dev.mysql.com/doc/refman/5.0/en/commit.html) like this:

相反，您应该将这组查询包装在MySQL事务中（请参阅http://dev.mysql.com/doc/refman/5.0/en/commit.html），如下所示：

START TRANSACTION;
SELECT @URL:=url FROM tablewiththeurls WHERE uncrawled=1 ORDER BY somecriterion LIMIT 1;
INSERT INTO loggingtable SET url=@URL;
COMMIT;

Or something close to that.

或者接近那个。

[edit] I just realized - you could probably do everything you need in a single query and not even have to worry about transactions. Something like this:

[编辑]我刚刚意识到 - 你可以在一个查询中完成所需的一切，甚至不必担心交易。像这样的东西：

INSERT INTO loggingtable (url) SELECT url FROM tablewithurls u LEFT JOIN loggingtable l ON l.url=t.url WHERE {some criterion used to pick the url to work on} AND l.url IS NULL.

#3

I wouldn't use locking, or transactions.

我不会使用锁定或事务。

The easiest way to go is to INSERT a record in the logging table if it's not yet present, and then check for that record.

最简单的方法是在记录表中插入记录（如果它还不存在），然后检查该记录。

Assume you have tblcrawels (cra_id) that is filled with your crawlers and tblurl (url_id) that is filled with the URLs, and a table tbllogging (log_cra_id, log_url_id) for your logfile.

假设您有使用爬虫填充的tblcrawels（cra_id）和填充了URL的tblurl（url_id），以及日志文件的表tbllogging（log_cra_id，log_url_id）。

You would run the following query if crawler 1 wants to start crawling url 2:

如果抓取工具1想要开始抓取网址2，您将运行以下查询：

INSERT INTO tbllogging (log_cra_id, log_url_id) 
SELECT 1, url_id FROM tblurl LEFT JOIN tbllogging on url_id=log_url 
WHERE url_id=2 AND log_url_id IS NULL;

The next step is to check whether this record has been inserted.

下一步是检查是否已插入此记录。

SELECT * FROM tbllogging WHERE log_url_id=2 AND log_cra_id=1

If you get any results then crawler 1 can crawl this url. If you don't get any results this means that another crawler has inserted in the same line and is already crawling.

如果您获得任何结果，则抓取工具1可以抓取此网址。如果您没有得到任何结果，这意味着另一个爬虫已插入同一行并且已经爬行。

#4

Well, table locks are one way to deal with that; but this makes parallel requests impossible. If the table is InnoDB you could force a row lock instead, using SELECT ... FOR UPDATE within a transaction.

那么，表锁是解决这个问题的一种方法;但这使得并行请求变得不可能。如果表是InnoDB，则可以在事务中使用SELECT ... FOR UPDATE强制执行行锁定。

BEGIN;

SELECT ... FROM your_table WHERE domainname = ... FOR UPDATE

# do whatever you have to do

COMMIT;

Please note that you will need an index on domainname (or whatever column you use in the WHERE-clause) for this to work, but this makes sense in general and I assume you will have that anyway.

请注意，你需要一个关于domainname（或你在WHERE子句中使用的任何列）的索引才能使用它，但这一般是有意义的，我认为无论如何你都会这样做。

#5

I got some inspiration from @Eljakim's answer and started this new thread where I figured out a great trick. It doesn't involve locking anything and is very simple.

我从@ Eljakim的回答中得到了一些灵感，并开始了这个新线程，我想出了一个很棒的技巧。它不涉及锁定任何东西，非常简单。

INSERT INTO crawlLog (companyId, timeStartCrawling)
SELECT id, now()
FROM companies
WHERE id NOT IN
(
    SELECT companyId
    FROM crawlLog AS crawlLogAlias
)
LIMIT 1

#1