稀疏数据:在RDBMS中高效存储和检索

时间:2021-11-04 06:40:07

I have a table representing values of source file metrics across project revisions, like the following:

我有一个表格,代表项目修订版中源文件指标的值,如下所示:

Revision FileA FileB FileC FileD FileE ...
1           45     3    12   123   124
2           45     3    12   123   124
3           45     3    12   123   124
4           48     3    12   123   124
5           48     3    12   123   124
6           48     3    12   123   124
7           48    15    12   123   124

(The relational view of the above data is different. Each row contains the following columns: Revision, FileId, Value. The files and their revisions from which the data is calculated are stored in Subversion repositories, so we're trying to represent the repository's structure in a relational schema.)

(上述数据的关系视图不同。每行包含以下列:Revision,FileId,Value。从中计算数据的文件及其修订版存储在Subversion存储库中,因此我们试图表示存储库的关系模式中的结构。)

There can be up to 23750 files in 10000 revisions (this is the case for the ImageMagick drawing program). As you can see, most values are the same between successive revisions, so the table's useful data is quite sparse. I am looking for a way to store the data that

10000次修订中最多可以有23750个文件(ImageMagick绘图程序就是这种情况)。如您所见,连续修订之间的大多数值都是相同的,因此表的有用数据非常稀疏。我正在寻找一种存储数据的方法

  • avoids replication and uses space efficiently (currently the non-sparse representation requires 260 GB (data+index) for less than 10% of the data I want to store)
  • 避免复制并有效利用空间(目前非稀疏表示需要260 GB(数据+索引),少于我想要存储的数据的10%)

  • allows me to retrieve efficiently the values for a specific revision using an SQL query (without explicitly looping through revisions or files)
  • 允许我使用SQL查询有效地检索特定修订的值(无需显式循环修订或文件)

  • allows me to retrieve efficiently the revision for a specific metric value.
  • 允许我有效地检索特定度量值的修订版。

Ideally, the solution should not depend on a particular RDBMS and should be compatible with Hibernate. If this is not possible, I can live with using Hibernate, MySQL or PostgreSQL-specific features.

理想情况下,解决方案不应该依赖于特定的RDBMS,而应该与Hibernate兼容。如果这是不可能的,我可以使用Hibernate,MySQL或PostgreSQL特有的功能。

1 个解决方案

#1


This is how I might model it. I've left out the Revisions table and Files table as those should be pretty self-explanatory.

这就是我对它进行建模的方式。我遗漏了Revisions表和Files表,因为它们应该是不言自明的。

CREATE TABLE Revision_Files
(
    start_revision_number   INT NOT NULL,
    end_revision_number     INT NOT NULL,
    file_number             INT NOT NULL,
    value                   INT NOT NULL,
    CONSTRAINT PK_Revision_Files PRIMARY KEY CLUSTERED (start_revision_number, file_number),
    CONSTRAINT CHK_Revision_Files_start_before_end CHECK (start_revision_number <= end_revision_number)
)
GO

To get all of the values for files of a particular revision you could use the following query. Joining to the files table with an outer join would let you get those that have no defined value for that revision.

要获取特定修订版文件的所有值,可以使用以下查询。使用外部联接加入文件表可以让您获得那些没有为该修订定义的值。

SELECT
    REV.revision_number,
    RF.file_number,
    RF.value
FROM
    Revisions REV
INNER JOIN Revision_Files RF ON
    RF.start_revision_number <= REV.revision_number AND
    RF.end_revision_number >= REV.revision_number
GO

Assuming that I understand correctly what you want in your third point, this will let you get all of the revisions for which a particular file has a certain value:

假设我在第三点中正确理解了您想要的内容,这将使您获得特定文件具有特定值的所有修订:

SELECT
    REV.revision_number
FROM
    Revision_Files RF
INNER JOIN Revisions REV ON
    REV.revision_number BETWEEN RF.start_revision_number AND RF.end_revision_number
WHERE
    RF.file_number = @file_number AND
    RF.value = @value
GO

#1


This is how I might model it. I've left out the Revisions table and Files table as those should be pretty self-explanatory.

这就是我对它进行建模的方式。我遗漏了Revisions表和Files表,因为它们应该是不言自明的。

CREATE TABLE Revision_Files
(
    start_revision_number   INT NOT NULL,
    end_revision_number     INT NOT NULL,
    file_number             INT NOT NULL,
    value                   INT NOT NULL,
    CONSTRAINT PK_Revision_Files PRIMARY KEY CLUSTERED (start_revision_number, file_number),
    CONSTRAINT CHK_Revision_Files_start_before_end CHECK (start_revision_number <= end_revision_number)
)
GO

To get all of the values for files of a particular revision you could use the following query. Joining to the files table with an outer join would let you get those that have no defined value for that revision.

要获取特定修订版文件的所有值,可以使用以下查询。使用外部联接加入文件表可以让您获得那些没有为该修订定义的值。

SELECT
    REV.revision_number,
    RF.file_number,
    RF.value
FROM
    Revisions REV
INNER JOIN Revision_Files RF ON
    RF.start_revision_number <= REV.revision_number AND
    RF.end_revision_number >= REV.revision_number
GO

Assuming that I understand correctly what you want in your third point, this will let you get all of the revisions for which a particular file has a certain value:

假设我在第三点中正确理解了您想要的内容,这将使您获得特定文件具有特定值的所有修订:

SELECT
    REV.revision_number
FROM
    Revision_Files RF
INNER JOIN Revisions REV ON
    REV.revision_number BETWEEN RF.start_revision_number AND RF.end_revision_number
WHERE
    RF.file_number = @file_number AND
    RF.value = @value
GO