如何组合两行并计算MySQL中两个时间戳值之间的时差?

时间:2022-06-08 21:30:04

I have a situation that I'm sure is quite common and it's really bothering me that I can't figure out how to do it or what to search for to find a relevant example/solution. I'm relatively new to MySQL (have been using MSSQL and PostgreSQL earlier) and every approach I can think of is blocked by some feature lacking in MySQL.

我有一种情况,我确信这是非常普遍的,我真的很困扰我,我无法弄清楚如何做或搜索什么来找到相关的示例/解决方案。我对MySQL比较陌生(之前一直在使用MSSQL和PostgreSQL),我能想到的每种方法都被MySQL缺乏的一些功能所阻挡。

I have a "log" table that simply lists many different events with their timestamp (stored as datetime type). There's lots of data and columns in the table not relevant to this problem, so lets say we have a simple table like this:

我有一个“日志”表,它只是列出了许多不同的事件及其时间戳(存储为日期时间类型)。表中有很多数据和列与此问题无关,所以我们假设我们有一个这样的简单表:

CREATE TABLE log (  
  id INT NOT NULL AUTO_INCREMENT,  
  name VARCHAR(16),  
  ts DATETIME NOT NULL,  
  eventtype VARCHAR(25),  
  PRIMARY KEY  (id)  
)

Let's say that some rows have an eventtype = 'start' and others have an eventtype = 'stop'. What I want to do is to somehow couple each "startrow" with each "stoprow" and find the time difference between the two (and then sum the durations per each name, but that's not where the problem lies). Each "start" event should have a corresponding "stop" event occuring at some stage later then the "start" event, but because of problems/bugs/crashed with the data collector it could be that some are missing. In that case I would like to disregard the event without a "partner". That means that given the data:

假设某些行有一个eventtype ='start'而其他行有一个eventtype ='stop'。我想要做的是以某种方式将每个“startrow”与每个“stoprow”结合起来并找到两者之间的时间差(然后将每个名称的持续时间相加,但这不是问题所在的位置)。每个“开始”事件应该在某个阶段发生相应的“停止”事件,然后是“开始”事件,但由于数据收集器出现问题/错误/崩溃,可能会丢失一些事件。在这种情况下,我想在没有“伙伴”的情况下忽视这一事件。这意味着给定数据:

foo, 2010-06-10 19:45, start  
foo, 2010-06-10 19:47, start  
foo, 2010-06-10 20:13, stop

..I would like to just disregard the 19:45 start event and not just get two result rows both using the 20:13 stop event as the stop time.

..我想忽略19:45开始事件而不是仅仅使用20:13停止事件作为停止时间得到两个结果行。

I've tried to join the table with itself in different ways, but the key problems for me seems to be to find a way to correctly identify the corresponding "stop" event to the "start" event for the given "name". The problem is exactly the same as you would have if you had table with employees stamping in and out of work and wanted to find out how much they actually were at work. I'm sure there must be well known solutions to this, but I can't seem to find them...

我试图以不同的方式加入表格,但对我来说,关键问题似乎是找到一种方法来正确识别给定“名称”的“开始”事件的相应“停止”事件。这个问题与您在员工进出工作台时的表格完全相同,并希望了解他们实际工作的程度。我敢肯定必须有一个众所周知的解决方案,但我似乎无法找到它们......

6 个解决方案

#1


6  

I believe this could be a simpler way to reach your goal:

我相信这可能是实现目标的更简单方法:

SELECT
    start_log.name,
    MAX(start_log.ts) AS start_time,
    end_log.ts AS end_time,
    TIMEDIFF(MAX(start_log.ts), end_log.ts)
FROM
    log AS start_log
INNER JOIN
    log AS end_log ON (
            start_log.name = end_log.name
        AND
            end_log.ts > start_log.ts)
WHERE start_log.eventtype = 'start'
AND end_log.eventtype = 'stop'
GROUP BY start_log.name

It should run considerably faster as it eliminates one subquery.

它应该运行得快得多,因为它消除了一个子查询。

#2


1  

If you don't mind creating a temporary table*, then I think the following should work well. I have tested it with 120,000 records, and the whole process completes in under 6 seconds. With 1,048,576 records it completed in just under 66 seconds - and that's on an old Pentium III with 128MB RAM:

如果您不介意创建临时表*,那么我认为以下内容应该可以正常工作。我用120,000条记录测试了它,整个过程在6秒内完成。它有1,048,576条记录,在不到66秒的时间内就完成了 - 而这是在一台带有128MB RAM的旧Pentium III上:

*In MySQL 5.0 (and perhaps other versions) the temporary table cannot be a true MySQL temporary table, as you cannot refer to a TEMPORARY table more than once in the same query. See here:

*在MySQL 5.0(可能还有其他版本)中,临时表不能是真正的MySQL临时表,因为在同一查询中不能多次引用TEMPORARY表。看这里:

http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html

http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html

Instead, just drop/create a normal table, as follows:

相反,只需删除/创建一个普通表,如下所示:

DROP TABLE IF EXISTS `tmp_log`;
CREATE TABLE `tmp_log` (
    `id` INT NOT NULL,
    `row` INT NOT NULL,
    `name` VARCHAR(16),
    `ts` DATETIME NOT NULL,
    `eventtype` VARCHAR(25),
    INDEX `row` (`row` ASC),
    INDEX `eventtype` (`eventtype` ASC)
);

This table is used to store a sorted and numbered list of rows from the following SELECT query:

此表用于存储以下SELECT查询中的已排序和编号的行列表:

INSERT INTO `tmp_log` (
    `id`,
    `row`,
    `name`,
    `ts`,
    `eventtype`
)
SELECT
    `id`,
    @row:=@row+1,
    `name`,
    `ts`,
    `eventtype`
FROM log,
(SELECT @row:=0) row_count
ORDER BY `name`, `id`;

The above SELECT query sorts the rows by name and then id (you could use the timestamp instead of the id, just so long as the start events appear before the stop events). Each row is also numbered. By doing this, matching pairs of events are always next to each other, and the row number of the start event is always one less than the row id of the stop event.

上面的SELECT查询按名称对行进行排序,然后是id(只要启动事件出现在stop事件之前,就可以使用timestamp而不是id)。每行也都有编号。通过这样做,匹配的事件对总是彼此相邻,并且开始事件的行号始终比stop事件的行id小1。

Now select the matching pairs from the list:

现在从列表中选择匹配对:

SELECT
    start_log.row AS start_row,
    stop_log.row AS stop_row,
    start_log.name AS name,
    start_log.eventtype AS start_event,
    start_log.ts AS start_time,
    stop_log.eventtype AS stop_event,
    stop_log.ts AS end_time,
    TIMEDIFF(stop_log.ts, start_log.ts) AS duration
FROM
    tmp_log AS start_log
INNER JOIN tmp_log AS stop_log
    ON start_log.row+1 = stop_log.row
    AND start_log.name = stop_log.name
    AND start_log.eventtype = 'start'
    AND stop_log.eventtype = 'stop'
ORDER BY start_log.id;

Once you're done, it's probably a good idea to drop the temporary table:

一旦完成,删除临时表可能是个好主意:

DROP TABLE IF EXISTS `tmp_log`;row

UPDATE

UPDATE

You could try the following idea, which eliminates temp tables and joins altogether by using variables to store values from the previous row. It sorts the rows by name then time stamp, which groups all values with the same name together, and puts each group in time order. I think that this should ensure that all corresponding start/stop events are next to each other.

您可以尝试以下想法,通过使用变量存储上一行的值来消除临时表和连接。它按名称对行进行排序,然后按时间戳排序,将所有值组合在一起,并按时间顺序排列每个组。我认为这应该确保所有相应的开始/停止事件彼此相邻。

SELECT id, name, start, stop, TIMEDIFF(stop, start) AS duration FROM (
    SELECT
        id, ts, eventtype,
        (@name <> name) AS new_name,
        @start AS start,
        @start := IF(eventtype = 'start', ts, NULL) AS prev_start,
        @stop  := IF(eventtype = 'stop',  ts, NULL) AS stop,
        @name  := name AS name
    FROM table1 ORDER BY name, ts
) AS tmp, (SELECT @start:=NULL, @stop:=NULL, @name:=NULL) AS vars
WHERE new_name = 0 AND start IS NOT NULL AND stop IS NOT NULL;

I don't know how it will compare to Ivar Bonsaksen's method, but it runs fairly fast on my box.

我不知道它与Ivar Bonsaksen的方法相比如何,但它在我的盒子上运行得相当快。

Here's how I created the test data:

这是我创建测试数据的方式:

CREATE TABLE  `table1` (
    `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
    `name` VARCHAR(5),
    `ts` DATETIME,
    `eventtype` VARCHAR(5),
    PRIMARY KEY (`id`),
    INDEX `name` (`name`),
    INDEX `ts` (`ts`)
) ENGINE=InnoDB;

DELIMITER //
DROP PROCEDURE IF EXISTS autofill//
CREATE PROCEDURE autofill()
BEGIN
    DECLARE i INT DEFAULT 0;
    WHILE i < 1000000 DO
        INSERT INTO table1 (name, ts, eventtype) VALUES (
            CHAR(FLOOR(65 + RAND() * 26)),
            DATE_ADD(NOW(),
            INTERVAL FLOOR(RAND() * 365) DAY),
            IF(RAND() >= 0.5, 'start', 'stop')
        );
        SET i = i + 1;
    END WHILE;
END;
//
DELIMITER ;

CALL autofill();

#3


1  

Can you change the data collector? If yes, add a group_id field (with an index) into the log table and write the id of the start event into it (same id for start and end in the group_id). Then you can do

你能改变数据收集器吗?如果是,则将group_id字段(带索引)添加到日志表中,并将start事件的id写入其中(在group_id中,start和end的id相同)。那你可以做

SELECT S.id, S.name, TIMEDIFF(E.ts, S.ts) `diff`
FROM `log` S
    JOIN `log` E ON S.id = E.group_id AND E.eventtype = 'end'
WHERE S.eventtype = 'start'

#4


1  

Try this.

尝试这个。

select start.name, start.ts start, end.ts end, timediff(end.ts, start.ts) duration from (
    select *, (
        select id from log L2 where L2.ts>L1.ts and L2.name=L1.name order by ts limit 1
    ) stop_id from log L1
) start join log end on end.id=start.stop_id
where start.eventtype='start' and end.eventtype='stop';

#5


0  

How about this:

这个怎么样:

SELECT start_log.ts AS start_time, end_log.ts AS end_time
FROM log AS start_log
INNER JOIN log AS end_log ON (start_log.name = end_log.name AND end_log.ts > start_log.ts)
WHERE NOT EXISTS (SELECT 1 FROM log WHERE log.ts > start_log.ts AND log.ts < end_log.ts)
 AND start_log.eventtype = 'start'
 AND end_log.eventtype = 'stop'

This will find each pair of rows (aliased as start_log and end_log) with no events in between, where the first is always a start and the last is always a stop. Since we disallow intermediate events, a start that's not immediately followed by a stop will naturally be excluded.

这将找到每对行(别名为start_log和end_log),其间没有任何事件,其中第一行始终是开始,最后一个始终是停止。由于我们禁止中间事件,因此自然会排除不会立即停止的开始。

#6


0  

I got it working by combining both your solutions, but the query isn't very effective and I'd think there would be a smarter way to omit those unwanted rows.

我通过结合你的解决方案来实现它,但是查询不是非常有效,我认为会有一种更聪明的方法来省略那些不需要的行。

What I've got now is:

我现在得到的是:

SELECT y.name, 
       y.start, 
       y.stop, 
       TIMEDIFF(y.stop, y.start) 
  FROM (SELECT l.name, 
               MAX(x.ts) AS start, 
               l.ts AS stop 
          FROM log l 
          JOIN (SELECT t.name, 
                       t.ts 
                  FROM log t 
                 WHERE t.eventtype = 'start') x ON x.name = l.name 
                       AND x.ts < l.ts 
         WHERE l.eventtype = 'stop' 
      GROUP BY l.name, l.ts) y 
WHERE NOT EXISTS (SELECT 1 
                    FROM log AS d 
                   WHERE d.ts > y.start AND d.ts < y.stop AND d.name = y.name 
                         AND d.eventtype = 'stop')

Limited to a given 'name' the query goes from about 0.5 seconds to about 14 seconds when I include the WHERE NOT EXISTS clause... The table will become quite large and I'm worried about how many hours this will take for all names in the end. I currently only have data for June 2010 in the table (10 days) and it's now at 109888 rows.

当我包含WHERE NOT EXISTS子句时,限制为给定的“名称”,查询从大约0.5秒到大约14秒......表格会变得非常大,我担心这个名字需要多少小时到底。我目前只有表中的2010年6月的数据(10天),现在是109888行。

#1


6  

I believe this could be a simpler way to reach your goal:

我相信这可能是实现目标的更简单方法:

SELECT
    start_log.name,
    MAX(start_log.ts) AS start_time,
    end_log.ts AS end_time,
    TIMEDIFF(MAX(start_log.ts), end_log.ts)
FROM
    log AS start_log
INNER JOIN
    log AS end_log ON (
            start_log.name = end_log.name
        AND
            end_log.ts > start_log.ts)
WHERE start_log.eventtype = 'start'
AND end_log.eventtype = 'stop'
GROUP BY start_log.name

It should run considerably faster as it eliminates one subquery.

它应该运行得快得多,因为它消除了一个子查询。

#2


1  

If you don't mind creating a temporary table*, then I think the following should work well. I have tested it with 120,000 records, and the whole process completes in under 6 seconds. With 1,048,576 records it completed in just under 66 seconds - and that's on an old Pentium III with 128MB RAM:

如果您不介意创建临时表*,那么我认为以下内容应该可以正常工作。我用120,000条记录测试了它,整个过程在6秒内完成。它有1,048,576条记录,在不到66秒的时间内就完成了 - 而这是在一台带有128MB RAM的旧Pentium III上:

*In MySQL 5.0 (and perhaps other versions) the temporary table cannot be a true MySQL temporary table, as you cannot refer to a TEMPORARY table more than once in the same query. See here:

*在MySQL 5.0(可能还有其他版本)中,临时表不能是真正的MySQL临时表,因为在同一查询中不能多次引用TEMPORARY表。看这里:

http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html

http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html

Instead, just drop/create a normal table, as follows:

相反,只需删除/创建一个普通表,如下所示:

DROP TABLE IF EXISTS `tmp_log`;
CREATE TABLE `tmp_log` (
    `id` INT NOT NULL,
    `row` INT NOT NULL,
    `name` VARCHAR(16),
    `ts` DATETIME NOT NULL,
    `eventtype` VARCHAR(25),
    INDEX `row` (`row` ASC),
    INDEX `eventtype` (`eventtype` ASC)
);

This table is used to store a sorted and numbered list of rows from the following SELECT query:

此表用于存储以下SELECT查询中的已排序和编号的行列表:

INSERT INTO `tmp_log` (
    `id`,
    `row`,
    `name`,
    `ts`,
    `eventtype`
)
SELECT
    `id`,
    @row:=@row+1,
    `name`,
    `ts`,
    `eventtype`
FROM log,
(SELECT @row:=0) row_count
ORDER BY `name`, `id`;

The above SELECT query sorts the rows by name and then id (you could use the timestamp instead of the id, just so long as the start events appear before the stop events). Each row is also numbered. By doing this, matching pairs of events are always next to each other, and the row number of the start event is always one less than the row id of the stop event.

上面的SELECT查询按名称对行进行排序,然后是id(只要启动事件出现在stop事件之前,就可以使用timestamp而不是id)。每行也都有编号。通过这样做,匹配的事件对总是彼此相邻,并且开始事件的行号始终比stop事件的行id小1。

Now select the matching pairs from the list:

现在从列表中选择匹配对:

SELECT
    start_log.row AS start_row,
    stop_log.row AS stop_row,
    start_log.name AS name,
    start_log.eventtype AS start_event,
    start_log.ts AS start_time,
    stop_log.eventtype AS stop_event,
    stop_log.ts AS end_time,
    TIMEDIFF(stop_log.ts, start_log.ts) AS duration
FROM
    tmp_log AS start_log
INNER JOIN tmp_log AS stop_log
    ON start_log.row+1 = stop_log.row
    AND start_log.name = stop_log.name
    AND start_log.eventtype = 'start'
    AND stop_log.eventtype = 'stop'
ORDER BY start_log.id;

Once you're done, it's probably a good idea to drop the temporary table:

一旦完成,删除临时表可能是个好主意:

DROP TABLE IF EXISTS `tmp_log`;row

UPDATE

UPDATE

You could try the following idea, which eliminates temp tables and joins altogether by using variables to store values from the previous row. It sorts the rows by name then time stamp, which groups all values with the same name together, and puts each group in time order. I think that this should ensure that all corresponding start/stop events are next to each other.

您可以尝试以下想法,通过使用变量存储上一行的值来消除临时表和连接。它按名称对行进行排序,然后按时间戳排序,将所有值组合在一起,并按时间顺序排列每个组。我认为这应该确保所有相应的开始/停止事件彼此相邻。

SELECT id, name, start, stop, TIMEDIFF(stop, start) AS duration FROM (
    SELECT
        id, ts, eventtype,
        (@name <> name) AS new_name,
        @start AS start,
        @start := IF(eventtype = 'start', ts, NULL) AS prev_start,
        @stop  := IF(eventtype = 'stop',  ts, NULL) AS stop,
        @name  := name AS name
    FROM table1 ORDER BY name, ts
) AS tmp, (SELECT @start:=NULL, @stop:=NULL, @name:=NULL) AS vars
WHERE new_name = 0 AND start IS NOT NULL AND stop IS NOT NULL;

I don't know how it will compare to Ivar Bonsaksen's method, but it runs fairly fast on my box.

我不知道它与Ivar Bonsaksen的方法相比如何,但它在我的盒子上运行得相当快。

Here's how I created the test data:

这是我创建测试数据的方式:

CREATE TABLE  `table1` (
    `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
    `name` VARCHAR(5),
    `ts` DATETIME,
    `eventtype` VARCHAR(5),
    PRIMARY KEY (`id`),
    INDEX `name` (`name`),
    INDEX `ts` (`ts`)
) ENGINE=InnoDB;

DELIMITER //
DROP PROCEDURE IF EXISTS autofill//
CREATE PROCEDURE autofill()
BEGIN
    DECLARE i INT DEFAULT 0;
    WHILE i < 1000000 DO
        INSERT INTO table1 (name, ts, eventtype) VALUES (
            CHAR(FLOOR(65 + RAND() * 26)),
            DATE_ADD(NOW(),
            INTERVAL FLOOR(RAND() * 365) DAY),
            IF(RAND() >= 0.5, 'start', 'stop')
        );
        SET i = i + 1;
    END WHILE;
END;
//
DELIMITER ;

CALL autofill();

#3


1  

Can you change the data collector? If yes, add a group_id field (with an index) into the log table and write the id of the start event into it (same id for start and end in the group_id). Then you can do

你能改变数据收集器吗?如果是,则将group_id字段(带索引)添加到日志表中,并将start事件的id写入其中(在group_id中,start和end的id相同)。那你可以做

SELECT S.id, S.name, TIMEDIFF(E.ts, S.ts) `diff`
FROM `log` S
    JOIN `log` E ON S.id = E.group_id AND E.eventtype = 'end'
WHERE S.eventtype = 'start'

#4


1  

Try this.

尝试这个。

select start.name, start.ts start, end.ts end, timediff(end.ts, start.ts) duration from (
    select *, (
        select id from log L2 where L2.ts>L1.ts and L2.name=L1.name order by ts limit 1
    ) stop_id from log L1
) start join log end on end.id=start.stop_id
where start.eventtype='start' and end.eventtype='stop';

#5


0  

How about this:

这个怎么样:

SELECT start_log.ts AS start_time, end_log.ts AS end_time
FROM log AS start_log
INNER JOIN log AS end_log ON (start_log.name = end_log.name AND end_log.ts > start_log.ts)
WHERE NOT EXISTS (SELECT 1 FROM log WHERE log.ts > start_log.ts AND log.ts < end_log.ts)
 AND start_log.eventtype = 'start'
 AND end_log.eventtype = 'stop'

This will find each pair of rows (aliased as start_log and end_log) with no events in between, where the first is always a start and the last is always a stop. Since we disallow intermediate events, a start that's not immediately followed by a stop will naturally be excluded.

这将找到每对行(别名为start_log和end_log),其间没有任何事件,其中第一行始终是开始,最后一个始终是停止。由于我们禁止中间事件,因此自然会排除不会立即停止的开始。

#6


0  

I got it working by combining both your solutions, but the query isn't very effective and I'd think there would be a smarter way to omit those unwanted rows.

我通过结合你的解决方案来实现它,但是查询不是非常有效,我认为会有一种更聪明的方法来省略那些不需要的行。

What I've got now is:

我现在得到的是:

SELECT y.name, 
       y.start, 
       y.stop, 
       TIMEDIFF(y.stop, y.start) 
  FROM (SELECT l.name, 
               MAX(x.ts) AS start, 
               l.ts AS stop 
          FROM log l 
          JOIN (SELECT t.name, 
                       t.ts 
                  FROM log t 
                 WHERE t.eventtype = 'start') x ON x.name = l.name 
                       AND x.ts < l.ts 
         WHERE l.eventtype = 'stop' 
      GROUP BY l.name, l.ts) y 
WHERE NOT EXISTS (SELECT 1 
                    FROM log AS d 
                   WHERE d.ts > y.start AND d.ts < y.stop AND d.name = y.name 
                         AND d.eventtype = 'stop')

Limited to a given 'name' the query goes from about 0.5 seconds to about 14 seconds when I include the WHERE NOT EXISTS clause... The table will become quite large and I'm worried about how many hours this will take for all names in the end. I currently only have data for June 2010 in the table (10 days) and it's now at 109888 rows.

当我包含WHERE NOT EXISTS子句时,限制为给定的“名称”,查询从大约0.5秒到大约14秒......表格会变得非常大,我担心这个名字需要多少小时到底。我目前只有表中的2010年6月的数据(10天),现在是109888行。