在SQL查询中相互关联的用户。

时间:2022-05-21 02:00:51

I am trying to correlate users with one another and assign a common ID for web site visitors.

我正在尝试将用户相互关联,并为web站点访问者分配一个公共ID。

I have the rows (call it table a) a.UUID, a.seen_time, a.ip_address, a.user_id, a.subdomain, and I am trying to come up with a a.matched_id whereby if the row IP address is +/- 4hrs of the last (i.e. continuously), a single matched_id is assigned to those rows.

我有行(称为表a) a。UUID,。seen_time,。ip_address,。user_id,。子定义域,我想要得到a。matched_id如果行IP地址是最后一行的+/- 4hrs(即连续),则为这些行分配一个matched_id。

Note that for my purposes, an IP on 2 different subdomains are NOT necessarily the same match, unless they have the same user ID.

注意,就我的目的而言,两个不同子域上的IP不一定是相同的匹配,除非它们具有相同的用户ID。

Here is the basic process I would follow in a regular programming language (however I need to construct SQL):

下面是我在常规编程语言中要遵循的基本过程(但是我需要构造SQL):

  • Get the necessary rows of table a
  • 获取表a所需的行
  • For each row, if any row ever has a matching user_id (subdomain doesn't matter), assign them the same matched_id (all else being equal, let's use MIN(uuid))
  • 对于每一行,如果任何一行都有匹配的user_id(子域无关紧要),那么为它们分配相同的matched_id(其他条件相同,我们使用MIN(uid)))
  • Partition into subdomain sets.

    划分为子域集。

    For each of those subdomain partitions:

    对于每个子域分区:

    • Now partition into buckets of IP addresses where each row is < 4hrs from the seen_time before(/after) it (ie on a row-by-row basis)

      现在,将每一行从seen_time (/after)到每一行的IP地址划分为分段(即以行为单位)

      For each of those IP address partitions:

      对于每个IP地址分区:

      • If any 1 item has a matched_id already, assign that to all. Otherwise, assign a new matched_id to all (using MIN(uuid)). Continue.
      • 如果任何一项已经有matched_id,则将其分配给所有人。否则,为所有人分配一个新的matched_id(使用MIN(uuid))。继续下去。

I am using Amazon Redshift which is more or less queried the same as Postgres but with a few more limitations (if interested, see unsupported features and unsupported functions): Postgres/ANSI SQL answers accepted.

我使用的是Amazon Redshift,它的查询和Postgres差不多,但是有一些更多的限制(如果有兴趣,请查看不支持的特性和不支持的功能):Postgres/ANSI SQL答案是可以接受的。

How can I construct this query in an efficient fashion?

如何有效地构造这个查询?

What is the basic SQL process I must follow?

我必须遵循的基本SQL过程是什么?

Thanks

谢谢


-- UPDATE --

——更新

I have made the following progress shown below:

我已取得以下进展:

  • I don't know how efficient it is
  • 我不知道它有多高效
  • I used discovery_time instead of seen_time as referenced to above, and the table name mydata instead of a, although its sometimes aliased as a and b
  • 我使用了discovery_time代替了上面提到的seen_time,使用了表名mydata而不是a,尽管它有时也被称为a和b
  • It uses an MD5 instead of MIN(UUID) since I believe getting that info would require another query - anyway, it doesn't matter too much
  • 它使用的是MD5而不是MIN(UUID),因为我认为获取该信息需要另一个查询——无论如何,这并不重要
  • Key problem: It does not count the +/- 4 hrs 'from the last row' instead its as an absolute
  • 关键问题:它并没有把+/- 4 hrs从最后一行改为绝对。

Code:

代码:

--UPDATE mydata m SET matched_id = NULL; --for testing

WITH cte1 AS (
    --start with the max discovery time and go down from there
    --select the matched id if one already exists
    SELECT m.ip, m.subdomain, MAX(m.discovery_time) AS max_discovery_time, 
        CASE WHEN MIN(m.user_id) IS NOT NULL THEN MD5(MIN(m.user_id)) 
        ELSE MIN(m.matched_id) END AS known_matched_id
    FROM mydata m
    GROUP BY m.ip, m.subdomain

    ), cte2 AS (

    SELECT m.uuid, CASE WHEN c.known_matched_id IS NOT NULL THEN c.known_matched_id 
        ELSE MD5(CONCAT(c.ip, c.subdomain, c.max_discovery_time)) END AS matched_id
    FROM mydata m 
    --IP on different subdomains are not necessarily the same match
    RIGHT OUTER JOIN cte1 c ON CONCAT(c.ip, c.subdomain) = CONCAT(m.ip, m.subdomain) 
    WHERE m.discovery_time >= (c.max_discovery_time - INTERVAL '4 hours')
    --Does not work 'row by row' instead in terms of absolutes - need to make this recursive somehow,
    --but Redshift does not support recursive CTEs or user-defined functions
)

UPDATE mydata m
SET matched_id = c.matched_id
FROM cte2 c
WHERE c.uuid = m.uuid;

--view result for an example IP
SELECT m.discovery_time, m.ip, m.matched_id, m.uuid 
FROM mydata m
WHERE m.ip = '12.34.56.78'
ORDER BY m.ip, m.discovery_time;

And in case you are wanting to test, the following create script should do you:

如果您想要测试,以下创建脚本应该做:

CREATE TABLE mydata
(
  ip character varying(255),
  subdomain character varying(255),
  matched_id character varying(255),
  user_id character varying(255),
  uuid character varying(255) NOT NULL,
  discovery_time timestamp without time zone,
  CONSTRAINT pk_mydata PRIMARY KEY (uuid)
);

-- should all get the same matched_id in result, except the 1st
INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '222b5991-9780-11e3-9304-127b2ab15ea7', '2014-02-14 00:03:26');
INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '333b5991-9780-11e3-9304-127b2ab15ea7', '2014-02-16 22:22:26');
INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '379b641b-9782-11e3-9304-127b2ab15ea7', '2014-02-17 03:18:48');
INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, 'ac0f6416-977e-11e3-9304-127b2ab15ea7', '2014-02-17 02:53:25');
INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '11fb5991-9780-11e3-9304-127b2ab15ea7', '2014-02-17 03:03:26');
INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '849d8d61-9781-11e3-9304-127b2ab15ea7', '2014-02-17 03:13:48');


The expected output would then be for all those rows to be assigned the same matched_id, except for the first one (in the INSERT lines) since its time is way more than 4hrs out from the next most recently seen time (and nor does it have a user_id to match to any others).

预期的输出将被分配相同的matched_id那些行,除了第一个(插入行)以来的时间远远超过4小时从下一个最近看到时间(而且也没有user_id匹配任何其他人)。


-- UPDATE 2 --

——更新2

  • Still not much luck on the continuous row-by-row results. This version seems to work that way if run repeatedly, though
  • 在连续的逐行结果中仍然没有多少运气。不过,如果重复运行,这个版本似乎也可以这样工作
  • Interested to make it efficient
  • 有兴趣使它有效
  • New columns min_time and max_time denote the min and max times in a 4hr set
  • 新列min_time和max_time表示4hr集中的最小和最大时间

Code:

代码:

-- Set user IDs that are the same 
UPDATE mydata AS m SET matched_id = matching.new_matched_id
FROM (
    SELECT a.user_id, MIN(a.uuid) AS new_matched_id FROM mydata a
    WHERE a.user_id IS NOT NULL
    GROUP BY a.user_id
) AS matching
WHERE m.matched_id IS NULL
AND m.user_id IS NOT NULL
AND matching.user_id = m.user_id;


-- Find rows +/- 4hrs of each other 
-- 1. Set min and max times for a 4hr set --
UPDATE mydata my SET min_time = matching.min_dist, max_time = matching.max_dist, matched_id = new_matched_id
FROM (
    -- mintime is approx
    SELECT a.uuid, MIN(b.matched_id) AS new_matched_id, max(COALESCE(b.min_time, b.discovery_time)) - interval '4 hour' AS min_dist, max(COALESCE(b.max_time, b.discovery_time)) + interval '4 hour' AS max_dist
    FROM mydata a
    JOIN mydata b
    ON (a.ip = b.ip AND a.subdomain = b.subdomain)
    GROUP BY a.uuid
    HAVING ABS(EXTRACT(EPOCH FROM max(COALESCE(a.min_time, b.discovery_time)) - a.discovery_time)/3600) <= 4
) matching
WHERE matching.uuid = my.uuid
AND min_time IS NULL;

-- 2. Set the matched id of all the +/- 4hr records --
UPDATE mydata m SET matched_id = new_matched_id, min_time = matching.min_time, max_time = matching.max_time
FROM (
    SELECT a.uuid, MAX(b.min_time) AS min_time, MAX(b.max_time) AS max_time, COALESCE(a.matched_id, MIN(b.uuid)) AS new_matched_id FROM mydata a
    INNER JOIN mydata b
    ON a.ip = b.ip AND a.subdomain = b.subdomain
    WHERE a.discovery_time >= b.min_time
    AND a.discovery_time <= b.max_time
    GROUP BY a.uuid
) matching
WHERE matching.uuid = m.uuid;

2 个解决方案

#1


3  

I'm not sure I understand the question, so but it seems from whereby if the row IP address is +/- 4hrs of the last that you need the "last" time for each IP address (or IP + UUID, not sure). That you get from

我不确定我是否理解这个问题,但似乎如果行IP地址是+/- 4hrs,那么每个IP地址(或IP + UUID,不确定)都需要“最后”时间。你从

select ip_address, max(seen_time) group by ip_address

You could make a virtual table out of that or use a correlated subquery, see next.

您可以从中创建一个虚拟表,或者使用相关的子查询,请参阅next。

I'm not a Postgres user, but there's surely a function that measures hours. As a rough sketch,

我不是Postgres的用户,但肯定有一个度量小时的函数。作为一个草图,

select * from a as A 
where exists (
    select 1 from a 
    where ip_address = A.ip_address
    and   UUID = A.UUID
    group by ip_address, UUID
    having hour(max(seen_time)) - hour(A.seen_time) < 4
)

HTH.

HTH。

#2


2  

I suggest:

我建议:

Add columns for working to your table a: id_1, id_2, min_time, max_time

添加用于工作的列到表a: id_1、id_2、min_time、max_time

Update id_1 to be the min(uuid) for any records with the same user_id. Something like this:

将id_1更新为具有相同user_id的任何记录的最小值(uuid)。是这样的:

 -- match any records with a userid
 update a 
 set id_1 = x.uuid 
 from a 
 inner join (   
        select min(uuid) as uuid, userid 
        from a where userid is not null group by userid ) as x
   on a.userId = x.userId 

Update columns min_time and max_time to be last_seen minus/plus 4 hours. You could do all this in the next query, but in case you're re-using these values later it'll be more efficient to only calculate once.

更新列min_time和max_time为last_seen - / + 4小时。您可以在下一个查询中完成所有这些工作,但如果您稍后重用这些值,那么只计算一次就更有效了。

update a 
set min_time = seen_time - interval '4 hour'
,   max_time = seen_time + interval '4 hour'

Join a onto itself, matching records by ip and subdomain, where a.seen_time within 4 hours of the other record. e.g.:

将a连接到自身,按ip和子域匹配记录,其中a。在其他记录后的4小时内查看时间。例如:

update a 
set id_2 = other_uuid
from ( 

    -- join a onto all matching records by ip and subdomain
    -- where a.seen_time within 4 hours of the other record.
    select a.uuid, min(other.uuid) as other_uuid 
    from a 
    inner join a AS other
    on a.ip_address = other.ip_address
    and a.subdomain = other.subdomain
    and a.uuid <> other.uuid
    where a.seen_time > other.min_time
    and a.seen_time < other.max_time
    group by a.uuid
) AS matching 
where a.uuid = matching.uuid
-- no need to match ones already matched on userid
and id_1 is null

Now id_1 and id_2 combined is what you're looking for.

现在id_1和id_2结合起来就是你要找的。

#1


3  

I'm not sure I understand the question, so but it seems from whereby if the row IP address is +/- 4hrs of the last that you need the "last" time for each IP address (or IP + UUID, not sure). That you get from

我不确定我是否理解这个问题,但似乎如果行IP地址是+/- 4hrs,那么每个IP地址(或IP + UUID,不确定)都需要“最后”时间。你从

select ip_address, max(seen_time) group by ip_address

You could make a virtual table out of that or use a correlated subquery, see next.

您可以从中创建一个虚拟表,或者使用相关的子查询,请参阅next。

I'm not a Postgres user, but there's surely a function that measures hours. As a rough sketch,

我不是Postgres的用户,但肯定有一个度量小时的函数。作为一个草图,

select * from a as A 
where exists (
    select 1 from a 
    where ip_address = A.ip_address
    and   UUID = A.UUID
    group by ip_address, UUID
    having hour(max(seen_time)) - hour(A.seen_time) < 4
)

HTH.

HTH。

#2


2  

I suggest:

我建议:

Add columns for working to your table a: id_1, id_2, min_time, max_time

添加用于工作的列到表a: id_1、id_2、min_time、max_time

Update id_1 to be the min(uuid) for any records with the same user_id. Something like this:

将id_1更新为具有相同user_id的任何记录的最小值(uuid)。是这样的:

 -- match any records with a userid
 update a 
 set id_1 = x.uuid 
 from a 
 inner join (   
        select min(uuid) as uuid, userid 
        from a where userid is not null group by userid ) as x
   on a.userId = x.userId 

Update columns min_time and max_time to be last_seen minus/plus 4 hours. You could do all this in the next query, but in case you're re-using these values later it'll be more efficient to only calculate once.

更新列min_time和max_time为last_seen - / + 4小时。您可以在下一个查询中完成所有这些工作,但如果您稍后重用这些值,那么只计算一次就更有效了。

update a 
set min_time = seen_time - interval '4 hour'
,   max_time = seen_time + interval '4 hour'

Join a onto itself, matching records by ip and subdomain, where a.seen_time within 4 hours of the other record. e.g.:

将a连接到自身,按ip和子域匹配记录,其中a。在其他记录后的4小时内查看时间。例如:

update a 
set id_2 = other_uuid
from ( 

    -- join a onto all matching records by ip and subdomain
    -- where a.seen_time within 4 hours of the other record.
    select a.uuid, min(other.uuid) as other_uuid 
    from a 
    inner join a AS other
    on a.ip_address = other.ip_address
    and a.subdomain = other.subdomain
    and a.uuid <> other.uuid
    where a.seen_time > other.min_time
    and a.seen_time < other.max_time
    group by a.uuid
) AS matching 
where a.uuid = matching.uuid
-- no need to match ones already matched on userid
and id_1 is null

Now id_1 and id_2 combined is what you're looking for.

现在id_1和id_2结合起来就是你要找的。