用户数据和更改日志的哪种布局是最有效的,更少的存储消耗?

时间:2022-09-23 17:44:08

My users can update their information, which is saved in a defined number of columns in a table, such as: user ( id INT, email VARCHAR, phone VARCHAR, address VARCHAR ), for example.

我的用户可以更新他们的信息,这些信息保存在表中定义数量的列中,例如:user(id INT,email VARCHAR,phone VARCHAR,address VARCHAR)。

I have seen other implementations, like the one for Wordpress, that stores this information for its users in a table called usermeta with a layout ( umeta_id INT, user_id INT, meta_key VARCHAR, meta_value VARCHAR ).

我已经看到了其他实现,比如Wordpress的实现,它将这些信息存储在一个名为usermeta的表中,其中包含一个布局(umeta_id INT,user_id INT,meta_key VARCHAR,meta_value VARCHAR)。

In the change log that I want to implement, I am evaluating between using a solution like that or making (what I think that will be better), a layout like: userLog ( id INT, date TIMESTAMP, email VARCHAR, phone VARCHAR, address VARCHAR ).
So, I can have a history of all the information any user had at a given date. Rows would only record the changes, having NULL on unaltered columns.

在我想要实现的更改日志中,我正在评估使用这样的解决方案或制作(我认为会更好),布局如:userLog(id INT,date TIMESTAMP,email VARCHAR,phone VARCHAR,address VARCHAR)。因此,我可以获得任何用户在给定日期所拥有的所有信息的历史记录。行只会记录更改,在未更改的列上具有NULL。

For the first question: Is there any advantage to this kind of layout other than being able to create new information type by just inserting an appropriate meta_key?
I sometimes think that this layout can be not really appropriate if performance is a matter in my environment, since I would be using a VARCHAR for every single kind of data that I want to store.

对于第一个问题:除了能够通过插入适当的meta_key创建新的信息类型之外,这种布局是否有任何优势?我有时认为如果性能在我的环境中是一个问题,这种布局可能不太合适,因为我会将VARCHAR用于我想要存储的每种数据。

For the second question: Can storage and select/insert efficiency really make a difference between the two solutions I am considering?
Which solution should be less (or more) space-consuming and/or less (or more) select/insert efficient than the other and why?

对于第二个问题:存储和选择/插入效率真的可以改变我正在考虑的两种解决方案吗?哪个解决方案应该比其他解决方案更少(或更多)占用空间和/或更少(或更多)选择/插入效率,为什么?

3 个解决方案

#1


3  

Some thoughts, if not necessarily an answer:

一些想法,如果不一定是答案:

Clearly a change log is a must-have for you, so the original structure with a single row per user is not a solution for you. So we're talking about the choice between:

显然,更改日志是您必备的,因此每个用户只有一行的原始结构不适合您。所以我们谈论的选择:

  1. A single row per version of each user's entire information-set; or
  2. 每个用户的整个信息集的每个版本一行;要么
  3. A single row per version of the each user's item of information
  4. 每个用户的信息项的每个版本的单行

Solution 1 corresponds to your

解决方案1对应于您的

userLog ( id INT, date TIMESTAMP, email VARCHAR, phone VARCHAR, address VARCHAR )

Solution 2 corresponds to the Wordpress one:

解决方案2对应于Wordpress:

umeta_id INT, user_id INT, meta_key VARCHAR, meta_value VARCHAR

Your question 1: I can't see any advantage to Solution2 except that, if you subsequently decide you want to capture users' (for example) Website URL or (for example) favourite colour as well, you can do that by adding a meta_key. But you could equally easily do this under Solution1, by simply doing an

您的问题1:我看不出解决方案2的任何优势,除非您随后决定要捕获用户(例如)网站URL或(例如)喜欢的颜色,您可以通过添加meta_key来实现。但是你可以通过简单的方法在Solution1下轻松完成

ALTER TABLE userlog ADD COLUMN WebSiteURL(etc)

That's not hard to do. Unless the DBAs in your shop are unusually Dobermann-like ( ;) ). Because you're holding a change-log, all existing users (at the time of the change) will now have a blank WebsiteURL column; but that's exactly what you want: you don't know their WebsiteURL, because the system didn't capture it before. Sure, the new column will have to be NULLABLE - but that may be unavoidable anyway, even with the "initial" data, unless the method you're using to capture user info insists on email, phone and address as required columns.

这并不难。除非您店铺中的DBA非常像Dobermann(())。因为您持有更改日志,所有现有用户(在更改时)现在将具有空白的WebsiteURL列;但这正是你想要的:你不知道他们的WebsiteURL,因为系统之前没有捕获它。当然,新列必须是NULLABLE - 但即使使用“初始”数据,这仍然是不可避免的,除非您用于捕获用户信息的方法坚持电子邮件,电话和地址作为必需列。

To me, the disadvantages of the meta_key solution outweigh the advantages. The disadvantages are:

对我来说,meta_key解决方案的缺点超过了优势。缺点是:

  • You have to develop a piece of pivot code to pivot user info for one user onto one
    row. You must call this code in every place you want to get user info on one row. In contrast, Solution1 only requires

    您必须开发一段数据透视代码,以便将一个用户的用户信息转移到一行。您必须在要获取一行用户信息的每个位置调用此代码。相比之下,Solution1只需要

    SELECT userID,[all user info] FROM userLog INNER JOIN (SELECT userID,MAX(datechanged) AS LatestDAteChanged FROM userlog GROUP BY userID) a ON userlog.userid=a.userID AND userlog.DateChanged=a.LatestDAteChanged

    SELECT userID,[all user info] FROM userLog INNER JOIN(SELECT userID,MAX(datechanged)AS LatestDAteChanged FROM userlog GROUP BY userID)a on userlog.userid = a.userID AND userlog.DateChanged = a.LatestDAteChanged

    which is far more efficient than a pivot. With an index on UserID,DateChanged, this'll run like the wind.

    这比枢轴更有效率。使用UserID,DateChanged的索引,这将像风一样运行。

  • Unless you really want to hold meta_key values multiple times in the userinfo table (Email, Email, Email, Email, Email), you'd need an extra Meta_Key_Lookup table.

    除非您真的想在userinfo表(电子邮件,电子邮件,电子邮件,电子邮件,电子邮件)中多次保存meta_key值,否则您需要额外的Meta_Key_Lookup表。

Second question: For ultimate space-efficiency, yes, the meta_key Solution2 is the best. Especially if you don't use VARCHAR metakeys, but metakey ID values, and have a separate meta_key lookup table (e.g. 1=Email, 2=Phone etc). But I don't think this is a conclusive argument for the meta_key Solution2, given the virtually-zero price of storage, and the difficulties involved in this solution.

第二个问题:为了获得最佳的空间效率,是的,meta_key Solution2是最好的。特别是如果您不使用VARCHAR元键,但使用metakey ID值,并且具有单独的meta_key查找表(例如1 =电子邮件,2 =电话等)。但鉴于存储的价格几乎为零,以及此解决方案所涉及的困难,我认为这不是meta_key Solution2的决定性论据。

(A note/thought: IMHO your idea of holding NULL values in your solution1, where the value has not changed, is a wrong road. The coding to try to get the most recent email, then phone, then address (separately) for each user, will be a nightmare: almost as hard to code/test - and for the server to run - as the pivot required by the other solution. And the reduction in storage marginal. Just hold the entire row every time one thing changes. Unless you're just giving examples, and the real user info-set is 50 columns wide...)

(注意/想法:恕我直言,你想在你的解决方案1中保持NULL值,其中值没有改变,这是一条错误的道路。编码试图获取最新的电子邮件,然后是电话,然后地址(单独)用户,将是一场噩梦:几乎同样难以编码/测试 - 以及服务器运行 - 作为另一个解决方案所需的枢轴。并且存储空间减少。只要每次改变一下就保持整行。除非你只是举例,真正的用户信息集是50列宽......)

IMHO the storage issue is not decisive. So let's turn to SELECT/INSERT efficiency:

恕我直言,存储问题不是决定性的。那么让我们转向SELECT / INSERT效率:

On this issue, I think Solution1 still wins. On Inserts, SOlution1 wins: only one row is inserted, even if the user changes every field in their info. On SELECTS, SOlution 1 wins again: you only need a view of the most recent info per user (code above), which is the kind of thing SQL is optimised for. In contrast, Solution2 would require a pivot: something SQL is not good at.

在这个问题上,我认为Solution1仍然获胜。在插入时,SOlution1获胜:即使用户更改其信息中的每个字段,也只插入一行。在SELECTS上,SOlution 1再次获胜:您只需要查看每个用户的最新信息(上面的代码),这是SQL优化的类型。相比之下,Solution2需要一个支点:SQL不擅长的东西。

#2


1  

I agree with @sebt about standard SQL solutions.

我同意@sebt关于标准SQL解决方案的看法。

If you need flexible solution in PostgreSQL I will recommend hstore type for you (postgresql 9.1 docs). This type can store many key => value pairs in one column.
There are many possible ways to insert, search and index this column. Documentation is a good way to start looking.

如果您需要在PostgreSQL中使用灵活的解决方案,我会为您推荐hstore类型(postgresql 9.1 docs)。此类型可以在一列中存储许多key =>值对。有许多方法可以插入,搜索和索引此列。文档是开始寻找的好方法。

#3


1  

The best choice largely depends on what you want to do, thus what queries you would be running (as with many things).

最好的选择很大程度上取决于你想要做什么,因此你将运行什么样的查询(就像许多事情一样)。

I don't quite understand the WordPress one (I get the fact that you store individual fields as rows, but I don't know where it fits in), so I'll just list all the options:

我不太了解WordPress(我得到的事实是你将各个字段存储为行,但我不知道它在哪里),所以我只列出所有选项:

  1. Have user and history tables store individual fields per row
  2. 让用户和历史表存储每行的单个字段
  3. Have only history table store individual fields per row
  4. 每行只有历史表存储单个字段
  5. Have only user table store individual fields per row
  6. 每行只有用户表存储单个字段
  7. Have neither store individual fields per row
  8. 每行都不存储单个字段
  9. Have 1 combined table for both user and history
  10. 为用户和历史记录提供1个组合表
  11. Have 1 combined table for both storing individual fields per row
  12. 有1个组合表,用于存储每行的单个字段

(5) and (6) doesn't really seem like options in most cases, as I suspect you'd want to get the details for a user (or a bunch of users) more often than you would want to get the history (unless most of your queries is to get both at the same time).

(5)和(6)在大多数情况下看起来并不像是选项,因为我怀疑你想要获得用户(或一堆用户)的详细信息比你想要获得历史更多(除非您的大多数查询都要同时获得两者。

(1) and (3) are not advised, unless many of the details are not filled in (thus you'd end up with very sparse tables in other cases).

不建议使用(1)和(3),除非没有填写许多细节(因此在其他情况下你最终会得到非常稀疏的表格)。

(4) is for when users tend to change all their details at once, which probably doesn't happen often, I suspect people just change 1 or 2 fields at a time. So, (2) is probably a better option, especially if the user table has many fields (and people just change 1 or 2 fields at a time).

(4)当用户倾向于一次性更改所有细节时,这可能不会经常发生,我怀疑人们一次只更改1或2个字段。所以,(2)可能是一个更好的选择,特别是如果用户表有很多字段(人们一次只更改1或2个字段)。

Generally, storing individual fields per row is for reduced storage space above performance (assuming there are some empty fields, otherwise storing individual fields per row is hands-down worse), you basically determine which is best by looking at your requirements and the expected data. Note we're talking mainly about selecting here, which is generally the slow operation, unless you have some weird stuff going on, or tons of inserts at once. For history, reduced storage is generally preferred above performance, so (2).

通常,每行存储单个字段是为了减少存储空间高于性能(假设存在一些空字段,否则每行存储单个字段的情况更糟糕),您基本上通过查看需求和预期数据来确定哪个是最佳的。注意我们主要讨论的是选择这里,这通常是缓慢的操作,除非你有一些奇怪的东西,或一次大量的插入。对于历史而言,减少存储通常优于性能,因此(2)。

Adding fields is generally a bit of effort in anyway, so just saying 'UPDATE user ADD COLUMN col' is not really a big deal, it can even be automated. This would be another (small) reason to prefer (2) above (4).

无论如何,添加字段通常需要付出一些努力,所以只是说'更新用户ADD COLUMN col'并不是什么大不了的事,它甚至可以自动化。这将是另一个(小)理由偏好(2)以上(4)。

#1


3  

Some thoughts, if not necessarily an answer:

一些想法,如果不一定是答案:

Clearly a change log is a must-have for you, so the original structure with a single row per user is not a solution for you. So we're talking about the choice between:

显然,更改日志是您必备的,因此每个用户只有一行的原始结构不适合您。所以我们谈论的选择:

  1. A single row per version of each user's entire information-set; or
  2. 每个用户的整个信息集的每个版本一行;要么
  3. A single row per version of the each user's item of information
  4. 每个用户的信息项的每个版本的单行

Solution 1 corresponds to your

解决方案1对应于您的

userLog ( id INT, date TIMESTAMP, email VARCHAR, phone VARCHAR, address VARCHAR )

Solution 2 corresponds to the Wordpress one:

解决方案2对应于Wordpress:

umeta_id INT, user_id INT, meta_key VARCHAR, meta_value VARCHAR

Your question 1: I can't see any advantage to Solution2 except that, if you subsequently decide you want to capture users' (for example) Website URL or (for example) favourite colour as well, you can do that by adding a meta_key. But you could equally easily do this under Solution1, by simply doing an

您的问题1:我看不出解决方案2的任何优势,除非您随后决定要捕获用户(例如)网站URL或(例如)喜欢的颜色,您可以通过添加meta_key来实现。但是你可以通过简单的方法在Solution1下轻松完成

ALTER TABLE userlog ADD COLUMN WebSiteURL(etc)

That's not hard to do. Unless the DBAs in your shop are unusually Dobermann-like ( ;) ). Because you're holding a change-log, all existing users (at the time of the change) will now have a blank WebsiteURL column; but that's exactly what you want: you don't know their WebsiteURL, because the system didn't capture it before. Sure, the new column will have to be NULLABLE - but that may be unavoidable anyway, even with the "initial" data, unless the method you're using to capture user info insists on email, phone and address as required columns.

这并不难。除非您店铺中的DBA非常像Dobermann(())。因为您持有更改日志,所有现有用户(在更改时)现在将具有空白的WebsiteURL列;但这正是你想要的:你不知道他们的WebsiteURL,因为系统之前没有捕获它。当然,新列必须是NULLABLE - 但即使使用“初始”数据,这仍然是不可避免的,除非您用于捕获用户信息的方法坚持电子邮件,电话和地址作为必需列。

To me, the disadvantages of the meta_key solution outweigh the advantages. The disadvantages are:

对我来说,meta_key解决方案的缺点超过了优势。缺点是:

  • You have to develop a piece of pivot code to pivot user info for one user onto one
    row. You must call this code in every place you want to get user info on one row. In contrast, Solution1 only requires

    您必须开发一段数据透视代码,以便将一个用户的用户信息转移到一行。您必须在要获取一行用户信息的每个位置调用此代码。相比之下,Solution1只需要

    SELECT userID,[all user info] FROM userLog INNER JOIN (SELECT userID,MAX(datechanged) AS LatestDAteChanged FROM userlog GROUP BY userID) a ON userlog.userid=a.userID AND userlog.DateChanged=a.LatestDAteChanged

    SELECT userID,[all user info] FROM userLog INNER JOIN(SELECT userID,MAX(datechanged)AS LatestDAteChanged FROM userlog GROUP BY userID)a on userlog.userid = a.userID AND userlog.DateChanged = a.LatestDAteChanged

    which is far more efficient than a pivot. With an index on UserID,DateChanged, this'll run like the wind.

    这比枢轴更有效率。使用UserID,DateChanged的索引,这将像风一样运行。

  • Unless you really want to hold meta_key values multiple times in the userinfo table (Email, Email, Email, Email, Email), you'd need an extra Meta_Key_Lookup table.

    除非您真的想在userinfo表(电子邮件,电子邮件,电子邮件,电子邮件,电子邮件)中多次保存meta_key值,否则您需要额外的Meta_Key_Lookup表。

Second question: For ultimate space-efficiency, yes, the meta_key Solution2 is the best. Especially if you don't use VARCHAR metakeys, but metakey ID values, and have a separate meta_key lookup table (e.g. 1=Email, 2=Phone etc). But I don't think this is a conclusive argument for the meta_key Solution2, given the virtually-zero price of storage, and the difficulties involved in this solution.

第二个问题:为了获得最佳的空间效率,是的,meta_key Solution2是最好的。特别是如果您不使用VARCHAR元键,但使用metakey ID值,并且具有单独的meta_key查找表(例如1 =电子邮件,2 =电话等)。但鉴于存储的价格几乎为零,以及此解决方案所涉及的困难,我认为这不是meta_key Solution2的决定性论据。

(A note/thought: IMHO your idea of holding NULL values in your solution1, where the value has not changed, is a wrong road. The coding to try to get the most recent email, then phone, then address (separately) for each user, will be a nightmare: almost as hard to code/test - and for the server to run - as the pivot required by the other solution. And the reduction in storage marginal. Just hold the entire row every time one thing changes. Unless you're just giving examples, and the real user info-set is 50 columns wide...)

(注意/想法:恕我直言,你想在你的解决方案1中保持NULL值,其中值没有改变,这是一条错误的道路。编码试图获取最新的电子邮件,然后是电话,然后地址(单独)用户,将是一场噩梦:几乎同样难以编码/测试 - 以及服务器运行 - 作为另一个解决方案所需的枢轴。并且存储空间减少。只要每次改变一下就保持整行。除非你只是举例,真正的用户信息集是50列宽......)

IMHO the storage issue is not decisive. So let's turn to SELECT/INSERT efficiency:

恕我直言,存储问题不是决定性的。那么让我们转向SELECT / INSERT效率:

On this issue, I think Solution1 still wins. On Inserts, SOlution1 wins: only one row is inserted, even if the user changes every field in their info. On SELECTS, SOlution 1 wins again: you only need a view of the most recent info per user (code above), which is the kind of thing SQL is optimised for. In contrast, Solution2 would require a pivot: something SQL is not good at.

在这个问题上,我认为Solution1仍然获胜。在插入时,SOlution1获胜:即使用户更改其信息中的每个字段,也只插入一行。在SELECTS上,SOlution 1再次获胜:您只需要查看每个用户的最新信息(上面的代码),这是SQL优化的类型。相比之下,Solution2需要一个支点:SQL不擅长的东西。

#2


1  

I agree with @sebt about standard SQL solutions.

我同意@sebt关于标准SQL解决方案的看法。

If you need flexible solution in PostgreSQL I will recommend hstore type for you (postgresql 9.1 docs). This type can store many key => value pairs in one column.
There are many possible ways to insert, search and index this column. Documentation is a good way to start looking.

如果您需要在PostgreSQL中使用灵活的解决方案,我会为您推荐hstore类型(postgresql 9.1 docs)。此类型可以在一列中存储许多key =>值对。有许多方法可以插入,搜索和索引此列。文档是开始寻找的好方法。

#3


1  

The best choice largely depends on what you want to do, thus what queries you would be running (as with many things).

最好的选择很大程度上取决于你想要做什么,因此你将运行什么样的查询(就像许多事情一样)。

I don't quite understand the WordPress one (I get the fact that you store individual fields as rows, but I don't know where it fits in), so I'll just list all the options:

我不太了解WordPress(我得到的事实是你将各个字段存储为行,但我不知道它在哪里),所以我只列出所有选项:

  1. Have user and history tables store individual fields per row
  2. 让用户和历史表存储每行的单个字段
  3. Have only history table store individual fields per row
  4. 每行只有历史表存储单个字段
  5. Have only user table store individual fields per row
  6. 每行只有用户表存储单个字段
  7. Have neither store individual fields per row
  8. 每行都不存储单个字段
  9. Have 1 combined table for both user and history
  10. 为用户和历史记录提供1个组合表
  11. Have 1 combined table for both storing individual fields per row
  12. 有1个组合表,用于存储每行的单个字段

(5) and (6) doesn't really seem like options in most cases, as I suspect you'd want to get the details for a user (or a bunch of users) more often than you would want to get the history (unless most of your queries is to get both at the same time).

(5)和(6)在大多数情况下看起来并不像是选项,因为我怀疑你想要获得用户(或一堆用户)的详细信息比你想要获得历史更多(除非您的大多数查询都要同时获得两者。

(1) and (3) are not advised, unless many of the details are not filled in (thus you'd end up with very sparse tables in other cases).

不建议使用(1)和(3),除非没有填写许多细节(因此在其他情况下你最终会得到非常稀疏的表格)。

(4) is for when users tend to change all their details at once, which probably doesn't happen often, I suspect people just change 1 or 2 fields at a time. So, (2) is probably a better option, especially if the user table has many fields (and people just change 1 or 2 fields at a time).

(4)当用户倾向于一次性更改所有细节时,这可能不会经常发生,我怀疑人们一次只更改1或2个字段。所以,(2)可能是一个更好的选择,特别是如果用户表有很多字段(人们一次只更改1或2个字段)。

Generally, storing individual fields per row is for reduced storage space above performance (assuming there are some empty fields, otherwise storing individual fields per row is hands-down worse), you basically determine which is best by looking at your requirements and the expected data. Note we're talking mainly about selecting here, which is generally the slow operation, unless you have some weird stuff going on, or tons of inserts at once. For history, reduced storage is generally preferred above performance, so (2).

通常,每行存储单个字段是为了减少存储空间高于性能(假设存在一些空字段,否则每行存储单个字段的情况更糟糕),您基本上通过查看需求和预期数据来确定哪个是最佳的。注意我们主要讨论的是选择这里,这通常是缓慢的操作,除非你有一些奇怪的东西,或一次大量的插入。对于历史而言,减少存储通常优于性能,因此(2)。

Adding fields is generally a bit of effort in anyway, so just saying 'UPDATE user ADD COLUMN col' is not really a big deal, it can even be automated. This would be another (small) reason to prefer (2) above (4).

无论如何,添加字段通常需要付出一些努力,所以只是说'更新用户ADD COLUMN col'并不是什么大不了的事,它甚至可以自动化。这将是另一个(小)理由偏好(2)以上(4)。