使用自引用查询进行mysql更新。

时间:2022-01-18 00:29:07

I have a table of surveys which contains (amongst others) the following columns

我有一个调查表,其中包括(其他)以下列

survey_id  - unique id
user_id    - the id of the person the survey relates to
created    - datetime
ip_address - of the submission
ip_count   - the number of duplicates

Due to a large record set, its impractical to run this query on the fly, so trying to create an update statement which will periodically store a "cached" result in ip_count.

由于一个大的记录集,动态运行这个查询是不现实的,因此尝试创建一个update语句,该语句将定期在ip_count中存储“缓存”的结果。

The purpose of the ip_count is to show the number of duplicate ip_address survey submissions have been recieved for the same user_id with a 12 month period (+/- 6months of created date).

ip_count的目的是显示为相同的user_id(创建日期+/- 6个月)收到的重复ip_address调查提交的数量。

Using the following dataset, this is the expected result.

使用以下数据集,这是预期的结果。

survey_id   user_id    created    ip_address     ip_count  #counted duplicates survey_id
  1            1      01-Jan-12   123.132.123       1      # 2
  2            1      01-Apr-12   123.132.123       2      # 1, 3
  3            2      01-Jul-12   123.132.123       0      # 
  4            1      01-Aug-12   123.132.123       3      # 2, 6
  6            1      01-Dec-12   123.132.123       1      # 4

This is the closest solution I have come up with so far but this query is failing to take into account the date restriction and struggling to come up with an alternative method.

到目前为止,这是我提出的最接近的解决方案,但是这个查询没有考虑到日期限制,很难找到替代方法。

UPDATE surveys
JOIN(
  SELECT ip_address, created, user_id, COUNT(*) AS total
  FROM surveys  
  WHERE surveys.state IN (1, 3) # survey is marked as completed and confirmed
  GROUP BY ip_address, user_id
) AS ipCount 
  ON (
    ipCount.ip_address = surveys.ip_address
    AND ipCount.user_id = surveys.user_id
    AND ipCount.created BETWEEN (surveys.created - INTERVAL 6 MONTH) AND (surveys.created + INTERVAL 6 MONTH)
  )
SET surveys.ip_count = ipCount.total - 1 # minus 1 as this query will match on its own id.
WHERE surveys.ip_address IS NOT NULL # ignore surveys where we have no ip_address

Thank you for you help in advance :)

(谢谢您的帮助)

2 个解决方案

#1


2  

I don't have your table with me, so its hard for me to form correct sql that definitely works, but I can take a shot at this, and hopefully be able to help you..

我没有你的表格,所以我很难形成正确的sql,但我可以尝试一下,希望能帮助你。

First I would need to take the cartesian product of surveys against itself and filter out the rows I don't want

首先,我需要取调查的笛卡尔积然后过滤掉我不想要的行

select s1.survey_id x, s2.survey_id y from surveys s1, surveys s2 where s1.survey_id != s2.survey_id and s1.ip_address = s2.ip_address and (s1.created and s2.created fall 6 months within each other)

The output of this should contain every pair of surveys that match (according to your rules) TWICE (once for each id in the 1st position and once for it to be in the 2nd position)

这个输出应该包含每一对匹配的调查(根据你的规则)两次(一次是在第一个位置,一次是在第2个位置)

Then we can do a GROUP BY on the output of this to get a table that basically gives me the correct ip_count for each survey_id

然后我们可以对这个输出进行分组,得到一个表,它基本上为每个survey_id提供了正确的ip_count

(select x, count(*) c from (select s1.survey_id x, s2.survey_id y from surveys s1, surveys s2 where s1.survey_id != s2.survey_id and s1.ip_address = s2.ip_address and (s1.created and s2.created fall 6 months within each other)) group by x)

So now we have a table mapping each survey_id to its correct ip_count. To update the original table, we need to join that against this and copy the values over

现在我们有了一个表,将每个survey_id映射为正确的ip_count。要更新原始表,我们需要将其与此结合并将值复制过来

So that should look something like

看起来应该是这样的

UPDATE surveys SET s.ip_count = n.c from surveys s inner join (ABOVE QUERY) n on s.survey_id = n.x

There is some pseudo code in there, but I think the general idea should work

这里有一些伪代码,但我认为总体思路应该是可行的

I have never had to update a table based on the output of another query myself before.. Tried to guess the right syntax for doing this from this question - How do I UPDATE from a SELECT in SQL Server?

我以前从未根据另一个查询的输出更新过一个表。尝试从这个问题中猜测正确的语法——如何从SQL Server中的SELECT更新?

Also if I needed to do something like this for my own work, I wouldn't attempt to do it in a single query.. This would be a pain to maintain and might have memory/performance issues. It would be best have a script traverse the table row by row, update on a single row in a transaction before moving on to the next row. Much slower, but simpler to understand and possibly lighter on your database.

另外,如果我需要为我自己的工作做类似的事情,我不会尝试在一个查询中完成。这将是一个痛苦的维护和可能有记忆/性能问题。最好是让脚本逐行遍历表行,然后在事务中更新单个行,然后再转到下一行。速度要慢得多,但是更容易理解,而且可能在数据库上更轻。

#2


2  

A few (very) minor tweaks to what is shown above. Thank you again!

对上面显示的内容做一些(非常)小的调整。再一次感谢您!

UPDATE surveys AS s
INNER JOIN (
  SELECT x, count(*) c
  FROM (
    SELECT s1.id AS x, s2.id AS y
    FROM surveys AS s1, surveys AS s2
    WHERE s1.state IN (1, 3) # completed and verified
      AND s1.id != s2.id # dont self join
      AND s1.ip_address != "" AND s1.ip_address IS NOT NULL # not interested in blank entries
      AND s1.ip_address = s2.ip_address
      AND (s2.created BETWEEN (s1.created - INTERVAL 6 MONTH) AND (s1.created + INTERVAL 6 MONTH))
      AND s1.user_id = s2.user_id # where completed for the same user
  ) AS ipCount
  GROUP BY x
) n on s.id = n.x
SET s.ip_count = n.c

#1


2  

I don't have your table with me, so its hard for me to form correct sql that definitely works, but I can take a shot at this, and hopefully be able to help you..

我没有你的表格,所以我很难形成正确的sql,但我可以尝试一下,希望能帮助你。

First I would need to take the cartesian product of surveys against itself and filter out the rows I don't want

首先,我需要取调查的笛卡尔积然后过滤掉我不想要的行

select s1.survey_id x, s2.survey_id y from surveys s1, surveys s2 where s1.survey_id != s2.survey_id and s1.ip_address = s2.ip_address and (s1.created and s2.created fall 6 months within each other)

The output of this should contain every pair of surveys that match (according to your rules) TWICE (once for each id in the 1st position and once for it to be in the 2nd position)

这个输出应该包含每一对匹配的调查(根据你的规则)两次(一次是在第一个位置,一次是在第2个位置)

Then we can do a GROUP BY on the output of this to get a table that basically gives me the correct ip_count for each survey_id

然后我们可以对这个输出进行分组,得到一个表,它基本上为每个survey_id提供了正确的ip_count

(select x, count(*) c from (select s1.survey_id x, s2.survey_id y from surveys s1, surveys s2 where s1.survey_id != s2.survey_id and s1.ip_address = s2.ip_address and (s1.created and s2.created fall 6 months within each other)) group by x)

So now we have a table mapping each survey_id to its correct ip_count. To update the original table, we need to join that against this and copy the values over

现在我们有了一个表,将每个survey_id映射为正确的ip_count。要更新原始表,我们需要将其与此结合并将值复制过来

So that should look something like

看起来应该是这样的

UPDATE surveys SET s.ip_count = n.c from surveys s inner join (ABOVE QUERY) n on s.survey_id = n.x

There is some pseudo code in there, but I think the general idea should work

这里有一些伪代码,但我认为总体思路应该是可行的

I have never had to update a table based on the output of another query myself before.. Tried to guess the right syntax for doing this from this question - How do I UPDATE from a SELECT in SQL Server?

我以前从未根据另一个查询的输出更新过一个表。尝试从这个问题中猜测正确的语法——如何从SQL Server中的SELECT更新?

Also if I needed to do something like this for my own work, I wouldn't attempt to do it in a single query.. This would be a pain to maintain and might have memory/performance issues. It would be best have a script traverse the table row by row, update on a single row in a transaction before moving on to the next row. Much slower, but simpler to understand and possibly lighter on your database.

另外,如果我需要为我自己的工作做类似的事情,我不会尝试在一个查询中完成。这将是一个痛苦的维护和可能有记忆/性能问题。最好是让脚本逐行遍历表行,然后在事务中更新单个行,然后再转到下一行。速度要慢得多,但是更容易理解,而且可能在数据库上更轻。

#2


2  

A few (very) minor tweaks to what is shown above. Thank you again!

对上面显示的内容做一些(非常)小的调整。再一次感谢您!

UPDATE surveys AS s
INNER JOIN (
  SELECT x, count(*) c
  FROM (
    SELECT s1.id AS x, s2.id AS y
    FROM surveys AS s1, surveys AS s2
    WHERE s1.state IN (1, 3) # completed and verified
      AND s1.id != s2.id # dont self join
      AND s1.ip_address != "" AND s1.ip_address IS NOT NULL # not interested in blank entries
      AND s1.ip_address = s2.ip_address
      AND (s2.created BETWEEN (s1.created - INTERVAL 6 MONTH) AND (s1.created + INTERVAL 6 MONTH))
      AND s1.user_id = s2.user_id # where completed for the same user
  ) AS ipCount
  GROUP BY x
) n on s.id = n.x
SET s.ip_count = n.c