您如何在数据库模式中表示哈希表集合?

时间:2022-09-10 21:57:41

If you were trying to create a domain object in a database schema, and in your code said domain object has a hashtable/list member, like so:

如果您尝试在数据库模式中创建域对象,并且在您的代码中表示域对象具有哈希表/列表成员,如下所示:

public class SpaceQuadrant : PersistentObject
{

    public SpaceQuadrant()
    {
    }

    public virtual Dictionary<SpaceCoordinate, SpaceObject> Space
    {
        get;
        set;
    }
}

A Dictionary is just a hashtable/list mapping object keys to value keys, I've come up with multiple ways to do this, creating various join tables or loading techniques, but they all kind of suck in terms of getting that O(1) access time that you get in a hashtable.

字典只是一个哈希表/列表映射对象键值键,我已经提出了多种方法来创建它,创建各种连接表或加载技术,但它们在获取O(1)方面都很糟糕您在哈希表中获得的访问时间。

How would you represent the SpaceQuadrant, SpaceCoordinate, and Space Object in a database schema? A simple schema code description would be nice, ie.

您如何在数据库模式中表示SpaceQuadrant,SpaceCoordinate和Space Object?简单的模式代码描述会很好,即。

table SpaceQuadrant
{
    ID int not null primary key,
    EntryName varchar(255) not null,
    SpaceQuadrantJoinTableId int not null
                 foreign key references ...anothertable...
}

but any thoughts at all would be nice as well, thanks for reading!

但是任何想法都会很好,感谢阅读!

More Information:

Thanks for the great answers, already, I've only skimmed them, and I want to take some time thinking about each before I respond.

感谢你们给出了很好的答案,我只是撇去了他们,我想在回应之前花点时间思考一下。

If you think there is a better way to define these classes, then by all means show me an example, any language your comfortable with is cool

如果你认为有更好的方法来定义这些类,那么无论如何都要给我一个例子,你喜欢的任何语言都很酷

4 个解决方案

#1


1  

First, dedicated support for geo-located data exists in many databases - different algorithms can be used (a spatial version of a B-Tree exists for instance), and support for proximity searches probably will exist.

首先,许多数据库中存在对地理位置数据的专用支持 - 可以使用不同的算法(例如,存在B树的空间版本),并且可能存在对邻近搜索的支持。

Since you have a different hash table for each SpaceQuadrant, you'd need something like (edited from S.Lott's post):

由于每个SpaceQuadrant都有一个不同的哈希表,你需要类似的东西(从S.Lott的帖子编辑):

table Space {
    SpaceCoordinate,
    Quadrant Foreign Key SpaceQuadrant(ID),
    SpaceObject -- whatever the object is (by ID)
    Primary Key(SpaceCoordinate, Quadrant)
}

This is a (SpaceCoordinate, Quadrant) -> SpaceObjectId dictionary.

这是一个(SpaceCoordinate,Quadrant) - > SpaceObjectId字典。

=====

Now, about your O(1) performance concern, there is a lot of reasons why it's wrongly addressed.

现在,关于你的O(1)性能问题,有很多原因导致它被错误地解决了。

You can use in many DB's a hash index for memory-based tables, as somebody told you. But if you need persistent storage, you'd need to update two tables (the memory one and the persistent one) instead of one (if there is no built-in support for this). To discover whether that's worth, you'd need to benchmark on the actual data (with actual data sizes).

有人告诉你,你可以在许多DB中使用基于内存的表的哈希索引。但是如果你需要持久存储,你需要更新两个表(内存一个和持久存储)而不是一个(如果没有内置支持)。要发现这是否值得,您需要对实际数据(实际数据大小)进行基准测试。

Also, forcing a table into memory can have worse implications.

此外,强制表进入内存可能会产生更糟糕的影响。

If something ever gets swapped, you're dead - if you had used a B-Tree (i.e. normal disk-based index), its algorithms would have minimized the needed I/O. Otherwise, all DBMS's would use hash tables and rely on swapping, instead of B-Trees. You can try to anticipate whether you'll fit in memory, but...

如果某些东西被交换了,那么你已经死了 - 如果你使用了B-Tree(即普通的基于磁盘的索引),它的算法会最小化所需的I / O.否则,所有DBMS都将使用哈希表并依赖于交换,而不是B-Trees。你可以试着预测你是否适合记忆,但......

Moreover, B-Trees are not O(1) but they are O(log_512(N)), or stuff like that (I know that collapses to O(log N), but bear me on this). You'd need (2^9)^4 = 2^36 = 64GiB for that to be 4, and if you have so much data you'd need a big iron server anyway for that to fit in memory. So, it's almost O(1), and the constant factors are what actually matters.
Ever heard about low-asymptotic-complexity, big-constant-factor algorithms, that would be faster than simple ones just on unpractical data sizes?

而且,B-Tree不是O(1)但是它们是O(log_512(N)),或类似的东西(我知道崩溃到O(log N),但是请耐心等待)。你需要(2 ^ 9)^ 4 = 2 ^ 36 = 64GiB,因为它有4个,如果你有这么多的数据,你需要一个大的铁服务器,以适应内存。所以,它几乎是O(1),而常数因素实际上是重要的。曾经听说过低渐近复杂度,大常数因子算法,这些算法比不简单的数据大小更简单吗?

Finally, I think DB authors are smarter than me and you. Especially given the declarative nature of SQL, hand-optimizing it this way isn't gonna pay. If an index fits in memory, I guess they could choose to build and use a hashtable version of the disk index, as needed, if it was worth it. Investigate your docs for that.

最后,我认为数据库作者比我和你更聪明。特别是考虑到SQL的声明性,以这种方式手动优化它是不会付出代价的。如果索引适合内存,我猜他们可以根据需要选择构建和使用磁盘索引的哈希表版本,如果值得的话。调查你的文档。

But the bottom line is that, premature optimization is evil, especially when it's of this kind (weird optimizations we're thinking on our own, as opposed as standard SQL optimizations), and with a declarative language.

但最重要的是,过早的优化是邪恶的,特别是当它属于这种类型时(我们自己在考虑奇怪的优化,而不是标准的SQL优化),并且使用声明性语言。

#2


2  

Relations are not hash tables; they are sets.

关系不是哈希表;他们是一套。

I wouldn't organize the database using the coordinates as the key. What if an object changes location? Instead, I would probably treat coordinates as attributes of an object.

我不会使用坐标作为关键字来组织数据库。如果对象改变位置怎么办?相反,我可能会将坐标视为对象的属性。

Also, I assume there is a fixed number of dimensions, for example, three. If so, then you can store these attributes of an object in fixed columns:

此外,我假设有固定数量的维度,例如,三个。如果是这样,那么您可以将对象的这些属性存储在固定列中:

CREATE TABLE SpaceQuadrant (
  quadrant_id INT NOT NULL PRIMARY KEY,
  quadrant_name VARCHAR(20)
  -- other attributes
);

CREATE TABLE SpaceObject (
  object_id INT NOT NULL PRIMARY KEY,
  x NUMERIC(9,2) NOT NULL,
  y NUMERIC(9,2) NOT NULL
  z NUMERIC(9,2) NOT NULL,
  object_name VARCHAR(20) NOT NULL,
  -- other attributes
  quadrant_id INT NOT NULL,
  FOREIGN KEY (quadrant_id) REFERENCES SpaceQuadrant(quadrant_id)
);

In your object-oriented class, it's not clear why your objects are in a dictionary. You mention accessing them in O(1) time, but why do you do that by coordinate?

在面向对象的类中,不清楚为什么对象在字典中。你提到在O(1)时间访问它们,但为什么你通过坐标来做到这一点?

If you're using that to optimize finding objects that are near a certain point (the player's spaceship, for instance), you could also build into your SQL query that populates this SpaceQuadrant a calculation of every object's distance from that given point, and sort the results by distance.

如果您正在使用它来优化查找某个点附近的对象(例如,玩家的太空船),您还可以构建到您的SQL查询中,该查询将此SpaceQuadrant填充为计算每个对象距该给定点的距离,并进行排序距离的结果。

I don't know enough about your program to know if these suggestions are relevant. But are they at least making you think of different ways of organizing the data?

我对您的计划了解不足以了解这些建议是否相关。但它们至少让您想到组织数据的不同方式吗?

#3


2  

In the simplest case, the dictionary has a key which would map to the primary key of a table - so that when you specify the values of the key, you can immediately find the matching data via a simple lookup.

在最简单的情况下,字典有一个键映射到表的主键 - 这样当您指定键的值时,您可以通过简单的查找立即找到匹配的数据。

In this case, you would need a table SpaceQuadrant with any general (single-valued) attributes that describe or characterize a space quadrant. The SpaceQuadrant table would have a primary key, possibly a generated ID, possibly a natural value. The hashtable would then consist of a table with the primary key value for cross-referencing the SpaceQuadrant, with the position (a SpaceCoordinate) and the attributes of the quadrant and coordinate.

在这种情况下,您需要一个表SpaceQuadrant,其中包含描述或表征空间象限的任何通用(单值)属性。 SpaceQuadrant表将具有主键,可能是生成的ID,可能是自然值。然后,哈希表将包含一个表,其中主键值用于交叉引用SpaceQuadrant,其位置(SpaceCoordinate)以及象限和坐标的属性。

Now, if you have an extensible DBMS, you can define a user-defined type for the SpaceCoordinate; failing that, you can use a trio of columns - x, y, z or r, theta, rho, for example - to represent the position (SpaceCoordinate).

现在,如果您有可扩展的DBMS,则可以为SpaceCoordinate定义用户定义的类型;如果不这样做,你可以使用三个列 - 例如x,y,z或r,theta,rho--来表示位置(SpaceCoordinate)。

In general terms, the structure I'm describing is quite similar to Bill Karwin's; the key (pun not intended until after I was rereading the message) difference is that it is perfectly OK in my book to have the position as part of the primary key of the sub-ordinate table if you are sure that's the best way to organize it. You might also have an object ID column that is an alternative candidate key. Alternatively, if objects have an existence independent of the space quadrant they happen to be in at the moment (or can exist in multiple positions - because they aren't points but are space stations or something), then you might have the SpaceObject in a separate table. What is best depends on information that we don't have available to us.

总的来说,我所描述的结构与Bill Karwin非常相似;关键(在我重新读取消息之前没有意图)不同之处在于,如果您确定这是最好的组织方式,那么在我的书中完全可以将该位置作为子坐标表主键的一部分。它。您可能还有一个对象ID列,它是备用候选键。或者,如果对象具有独立于空间象限的存在,它们恰好位于当前(或者可以存在于多个位置 - 因为它们不是点而是空间站或某些东西),那么您可能将SpaceObject放入单独的表。什么是最好的取决于我们没有的信息。

You should be aware of the limitations of using a SpaceCoordinate as part of the primary key:

您应该了解使用SpaceCoordinate作为主键的一部分的限制:

  • no two objects can occupy the same position (that's called a collision in a hash table, as well as in 3D space),
  • 没有两个对象可以占据相同的位置(在哈希表中以及在3D空间中称为冲突),

  • if the position changes, then you have to update the key data, which is more expensive than an update up non-key data,
  • 如果位置发生变化,那么你必须更新密钥数据,这比更新非密钥数据更昂贵,

  • proximity lookups will be hard - exact lookups are easy enough.
  • 接近查找将很难 - 精确查找很容易。

The same is true of your dictionary in memory; if you change the coordinates, you have to remove the record from the old location and place it in the new location in the dictionary (or the language has to do that for you behind the scenes).

你的字典在记忆中也是如此;如果更改坐标,则必须从旧位置删除记录并将其放在字典中的新位置(或者语言必须在后台为您执行此操作)。

#4


2  

A dictionary is a table. The hash is a question of what kind of index is used. Most RDBMS assume that tables are big and densely packed, making a hashed index not appropriate.

字典是一张桌子。哈希是一个使用何种索引的问题。大多数RDBMS都假设表格大且密集,使散列索引不合适。

table SpaceQuadrant { 
    ID Primary Key,
    -- whatever other attributes are relevant
}

table Space {
    SpaceCoordinate Primary Key,
    Quadrant Foreign Key SpaceQuadrant(ID),
    SpaceObject -- whatever the object is
}

Your Space objects have FK references to the Quadrant in which they're located.

您的Space对象具有对它们所在的象限的FK引用。

Depending on your RDBMS, you might be able to find a hash-based index that gets you the performance you're hoping for. For example MySQL, using the HEAP storage engine supports HASH indexes.

根据您的RDBMS,您可能能够找到基于哈希的索引,以获得您希望的性能。例如MySQL,使用HEAP存储引擎支持HASH索引。

#1


1  

First, dedicated support for geo-located data exists in many databases - different algorithms can be used (a spatial version of a B-Tree exists for instance), and support for proximity searches probably will exist.

首先,许多数据库中存在对地理位置数据的专用支持 - 可以使用不同的算法(例如,存在B树的空间版本),并且可能存在对邻近搜索的支持。

Since you have a different hash table for each SpaceQuadrant, you'd need something like (edited from S.Lott's post):

由于每个SpaceQuadrant都有一个不同的哈希表,你需要类似的东西(从S.Lott的帖子编辑):

table Space {
    SpaceCoordinate,
    Quadrant Foreign Key SpaceQuadrant(ID),
    SpaceObject -- whatever the object is (by ID)
    Primary Key(SpaceCoordinate, Quadrant)
}

This is a (SpaceCoordinate, Quadrant) -> SpaceObjectId dictionary.

这是一个(SpaceCoordinate,Quadrant) - > SpaceObjectId字典。

=====

Now, about your O(1) performance concern, there is a lot of reasons why it's wrongly addressed.

现在,关于你的O(1)性能问题,有很多原因导致它被错误地解决了。

You can use in many DB's a hash index for memory-based tables, as somebody told you. But if you need persistent storage, you'd need to update two tables (the memory one and the persistent one) instead of one (if there is no built-in support for this). To discover whether that's worth, you'd need to benchmark on the actual data (with actual data sizes).

有人告诉你,你可以在许多DB中使用基于内存的表的哈希索引。但是如果你需要持久存储,你需要更新两个表(内存一个和持久存储)而不是一个(如果没有内置支持)。要发现这是否值得,您需要对实际数据(实际数据大小)进行基准测试。

Also, forcing a table into memory can have worse implications.

此外,强制表进入内存可能会产生更糟糕的影响。

If something ever gets swapped, you're dead - if you had used a B-Tree (i.e. normal disk-based index), its algorithms would have minimized the needed I/O. Otherwise, all DBMS's would use hash tables and rely on swapping, instead of B-Trees. You can try to anticipate whether you'll fit in memory, but...

如果某些东西被交换了,那么你已经死了 - 如果你使用了B-Tree(即普通的基于磁盘的索引),它的算法会最小化所需的I / O.否则,所有DBMS都将使用哈希表并依赖于交换,而不是B-Trees。你可以试着预测你是否适合记忆,但......

Moreover, B-Trees are not O(1) but they are O(log_512(N)), or stuff like that (I know that collapses to O(log N), but bear me on this). You'd need (2^9)^4 = 2^36 = 64GiB for that to be 4, and if you have so much data you'd need a big iron server anyway for that to fit in memory. So, it's almost O(1), and the constant factors are what actually matters.
Ever heard about low-asymptotic-complexity, big-constant-factor algorithms, that would be faster than simple ones just on unpractical data sizes?

而且,B-Tree不是O(1)但是它们是O(log_512(N)),或类似的东西(我知道崩溃到O(log N),但是请耐心等待)。你需要(2 ^ 9)^ 4 = 2 ^ 36 = 64GiB,因为它有4个,如果你有这么多的数据,你需要一个大的铁服务器,以适应内存。所以,它几乎是O(1),而常数因素实际上是重要的。曾经听说过低渐近复杂度,大常数因子算法,这些算法比不简单的数据大小更简单吗?

Finally, I think DB authors are smarter than me and you. Especially given the declarative nature of SQL, hand-optimizing it this way isn't gonna pay. If an index fits in memory, I guess they could choose to build and use a hashtable version of the disk index, as needed, if it was worth it. Investigate your docs for that.

最后,我认为数据库作者比我和你更聪明。特别是考虑到SQL的声明性,以这种方式手动优化它是不会付出代价的。如果索引适合内存,我猜他们可以根据需要选择构建和使用磁盘索引的哈希表版本,如果值得的话。调查你的文档。

But the bottom line is that, premature optimization is evil, especially when it's of this kind (weird optimizations we're thinking on our own, as opposed as standard SQL optimizations), and with a declarative language.

但最重要的是,过早的优化是邪恶的,特别是当它属于这种类型时(我们自己在考虑奇怪的优化,而不是标准的SQL优化),并且使用声明性语言。

#2


2  

Relations are not hash tables; they are sets.

关系不是哈希表;他们是一套。

I wouldn't organize the database using the coordinates as the key. What if an object changes location? Instead, I would probably treat coordinates as attributes of an object.

我不会使用坐标作为关键字来组织数据库。如果对象改变位置怎么办?相反,我可能会将坐标视为对象的属性。

Also, I assume there is a fixed number of dimensions, for example, three. If so, then you can store these attributes of an object in fixed columns:

此外,我假设有固定数量的维度,例如,三个。如果是这样,那么您可以将对象的这些属性存储在固定列中:

CREATE TABLE SpaceQuadrant (
  quadrant_id INT NOT NULL PRIMARY KEY,
  quadrant_name VARCHAR(20)
  -- other attributes
);

CREATE TABLE SpaceObject (
  object_id INT NOT NULL PRIMARY KEY,
  x NUMERIC(9,2) NOT NULL,
  y NUMERIC(9,2) NOT NULL
  z NUMERIC(9,2) NOT NULL,
  object_name VARCHAR(20) NOT NULL,
  -- other attributes
  quadrant_id INT NOT NULL,
  FOREIGN KEY (quadrant_id) REFERENCES SpaceQuadrant(quadrant_id)
);

In your object-oriented class, it's not clear why your objects are in a dictionary. You mention accessing them in O(1) time, but why do you do that by coordinate?

在面向对象的类中,不清楚为什么对象在字典中。你提到在O(1)时间访问它们,但为什么你通过坐标来做到这一点?

If you're using that to optimize finding objects that are near a certain point (the player's spaceship, for instance), you could also build into your SQL query that populates this SpaceQuadrant a calculation of every object's distance from that given point, and sort the results by distance.

如果您正在使用它来优化查找某个点附近的对象(例如,玩家的太空船),您还可以构建到您的SQL查询中,该查询将此SpaceQuadrant填充为计算每个对象距该给定点的距离,并进行排序距离的结果。

I don't know enough about your program to know if these suggestions are relevant. But are they at least making you think of different ways of organizing the data?

我对您的计划了解不足以了解这些建议是否相关。但它们至少让您想到组织数据的不同方式吗?

#3


2  

In the simplest case, the dictionary has a key which would map to the primary key of a table - so that when you specify the values of the key, you can immediately find the matching data via a simple lookup.

在最简单的情况下,字典有一个键映射到表的主键 - 这样当您指定键的值时,您可以通过简单的查找立即找到匹配的数据。

In this case, you would need a table SpaceQuadrant with any general (single-valued) attributes that describe or characterize a space quadrant. The SpaceQuadrant table would have a primary key, possibly a generated ID, possibly a natural value. The hashtable would then consist of a table with the primary key value for cross-referencing the SpaceQuadrant, with the position (a SpaceCoordinate) and the attributes of the quadrant and coordinate.

在这种情况下,您需要一个表SpaceQuadrant,其中包含描述或表征空间象限的任何通用(单值)属性。 SpaceQuadrant表将具有主键,可能是生成的ID,可能是自然值。然后,哈希表将包含一个表,其中主键值用于交叉引用SpaceQuadrant,其位置(SpaceCoordinate)以及象限和坐标的属性。

Now, if you have an extensible DBMS, you can define a user-defined type for the SpaceCoordinate; failing that, you can use a trio of columns - x, y, z or r, theta, rho, for example - to represent the position (SpaceCoordinate).

现在,如果您有可扩展的DBMS,则可以为SpaceCoordinate定义用户定义的类型;如果不这样做,你可以使用三个列 - 例如x,y,z或r,theta,rho--来表示位置(SpaceCoordinate)。

In general terms, the structure I'm describing is quite similar to Bill Karwin's; the key (pun not intended until after I was rereading the message) difference is that it is perfectly OK in my book to have the position as part of the primary key of the sub-ordinate table if you are sure that's the best way to organize it. You might also have an object ID column that is an alternative candidate key. Alternatively, if objects have an existence independent of the space quadrant they happen to be in at the moment (or can exist in multiple positions - because they aren't points but are space stations or something), then you might have the SpaceObject in a separate table. What is best depends on information that we don't have available to us.

总的来说,我所描述的结构与Bill Karwin非常相似;关键(在我重新读取消息之前没有意图)不同之处在于,如果您确定这是最好的组织方式,那么在我的书中完全可以将该位置作为子坐标表主键的一部分。它。您可能还有一个对象ID列,它是备用候选键。或者,如果对象具有独立于空间象限的存在,它们恰好位于当前(或者可以存在于多个位置 - 因为它们不是点而是空间站或某些东西),那么您可能将SpaceObject放入单独的表。什么是最好的取决于我们没有的信息。

You should be aware of the limitations of using a SpaceCoordinate as part of the primary key:

您应该了解使用SpaceCoordinate作为主键的一部分的限制:

  • no two objects can occupy the same position (that's called a collision in a hash table, as well as in 3D space),
  • 没有两个对象可以占据相同的位置(在哈希表中以及在3D空间中称为冲突),

  • if the position changes, then you have to update the key data, which is more expensive than an update up non-key data,
  • 如果位置发生变化,那么你必须更新密钥数据,这比更新非密钥数据更昂贵,

  • proximity lookups will be hard - exact lookups are easy enough.
  • 接近查找将很难 - 精确查找很容易。

The same is true of your dictionary in memory; if you change the coordinates, you have to remove the record from the old location and place it in the new location in the dictionary (or the language has to do that for you behind the scenes).

你的字典在记忆中也是如此;如果更改坐标,则必须从旧位置删除记录并将其放在字典中的新位置(或者语言必须在后台为您执行此操作)。

#4


2  

A dictionary is a table. The hash is a question of what kind of index is used. Most RDBMS assume that tables are big and densely packed, making a hashed index not appropriate.

字典是一张桌子。哈希是一个使用何种索引的问题。大多数RDBMS都假设表格大且密集,使散列索引不合适。

table SpaceQuadrant { 
    ID Primary Key,
    -- whatever other attributes are relevant
}

table Space {
    SpaceCoordinate Primary Key,
    Quadrant Foreign Key SpaceQuadrant(ID),
    SpaceObject -- whatever the object is
}

Your Space objects have FK references to the Quadrant in which they're located.

您的Space对象具有对它们所在的象限的FK引用。

Depending on your RDBMS, you might be able to find a hash-based index that gets you the performance you're hoping for. For example MySQL, using the HEAP storage engine supports HASH indexes.

根据您的RDBMS,您可能能够找到基于哈希的索引,以获得您希望的性能。例如MySQL,使用HEAP存储引擎支持HASH索引。