两个不同的表或只有一个与bool列?

时间:2022-09-21 22:52:30

We have two tables: OriginalDocument and ProcessedDocument. In the first one we put an original, not processed document. After it's validated and processed (converted to our XML format and parsed), it's put into ProcessedDocument table. Processed document can be valid or invalid. Which makes more sense: have two different tables for valid and invalid documents or just have one with 'Valid' column? Some of the columns (~5-7) are irrelevant for invalid document. Storing both invalid and valid documents would also make Document table filled with 'NULL' columns (if document is invalid, information like document number, receiver can be unknown). What else should we consider and weigh, when making this decision?

我们有两个表:OriginalDocument和ProcessedDocument。在第一个中,我们放置了原始的,未处理的文档。经过验证和处理(转换为我们的XML格式并进行解析)后,它将被放入ProcessedDocument表中。已处理的文档可能有效或无效。哪个更有意义:有两个不同的表用于有效和无效的文档,还是只有一个带有'有效'列?某些列(~5-7)与无效文档无关。存储无效和有效文档也会使文档表填充“NULL”列(如果文档无效,文档编号,接收者等信息可能是未知的)。在做出这个决定时,我们还应该考虑和权衡什么呢?

7 个解决方案

#1


6  

Whether the document is valid or invalid, it is still a document so it makes inital sense for them all to be in the same table.

无论文档是有效还是无效,它仍然是一个文档,因此它们使所有文档都在同一个表中。

However, if an invalid document is treated differently by your application to the point where it is almost forgotten (not queried, updated etc.) then split the tables. Having the two types of document together in the same table will do nothing but slow down your queries for no immediate benefit.

但是,如果您的应用程序对无效文档的处理方式不同(即未查询,更新等),则将表格拆分。将两种类型的文档放在同一个表中只会减慢您的查询速度而不会产生直接的好处。

I have a document table where valid and invalid documents are kept together but only because the app re-presents the bad document to the user and asks them to fix it.

我有一个文档表,其中有效和无效的文档保存在一起,但仅仅是因为应用程序将错误的文档重新呈现给用户并要求他们修复它。

#2


5  

To me it sounds that it would make sense to have a bit column, as all documents have actually been processed, it is just that some have been determined to be invalid. And depending on the number of columns if you only have 5 or so out of say 10-15 columns that don't apply, there is no need to manage two structures for the same data.

对我来说听起来有一个位列是有意义的,因为所有文件都已经被处理过,只是有些已被确定为无效。根据列数,如果您只有5个左右的10-15列不适用,则无需为同一数据管理两个结构。

Now, another thing you could look at is do you need to regularly get information on both valid and invalid documents at the same time? if so, then you really do want it in one table.

现在,您可以看到的另一件事是您是否需要同时定期获取有效和无效文档的信息?如果是这样,那么你真的希望它在一个表中。

If you don't ever need to query them together, or if a document is "invalid" you don't need it again except for history, then it could make sense to move it to its own table.

如果您不需要一起查询它们,或者文档是“无效”,除了历史记录之外不再需要它,那么将它移动到自己的表中是有意义的。

#3


3  

Wow, so much bad advice and design myths in a single question it's hard to know where to start.

哇,在一个问题上有如此多的糟糕建议和设计神话,很难知道从哪里开始。

Is this a VLDB? are you talking 100's of TB, 100's of GB, 1-10 GB?

这是VLDB吗?你在说100的TB,100的GB,1-10 GB?

Is this an untra-high performance DB? Do you need to squeeze out microseconds?

这是一个非高性能的数据库吗?你需要挤出微秒吗?

Most advice tends to lean toward those extremes where you might break a few basic rules for the sake of performance.

大多数建议倾向于倾向于那些为了表现而可能违反一些基本规则的极端情况。

An earlier poster said,

早先的一张海报说,

"Whether the document is valid or invalid, it is still a document so it makes inital sense for them all to be in the same table."

“无论文件是有效还是无效,它仍然是一份文件,因此它们在同一张桌子中都具有初步意义。”

He was on the right track there. And for that matter, whether it's processed or unprocessed it's also a document. I strongly question the first table split.

他在那里正确的轨道。就此而言,无论是处理还是未处理,它都是一份文件。我强烈质疑第一个分表。

He then says,

然后他说,

"Having the two types of document together in the same table will do nothing but slow down your queries for no immediate benefit."

“将两种类型的文档放在同一个表中只会减慢你的查询速度而不会立即受益。”

I have no idea what that advice is based upon. If your RDBMS supports indexes, more data will have a very marginal additional cost at certain sizes of your index because your b-tree gets one level deeper. If you take his statement at face value, you should limit your table to n rows each and keep making new ones because "more data in your table = slower queries." I have no idea why people persist in this notion. If you have queries that REQUIRE full table scans for one type or the other, let's talk partitioning, not a new table. It takes about an extra 10 milliseconds to find a row in a billion row table than it does in a million row table because an index will probably only be one blevel deeper between the two.

我不知道这个建议是基于什么的。如果您的RDBMS支持索引,则在索引的某些大小处,更多数据将具有非常小的额外成本,因为您的b树更深一层。如果你以表面值表达他的陈述,你应该将你的表限制为每行n行,并继续制作新的表,因为“你表中的更多数据=较慢的查询”。我不知道为什么人们坚持这个观念。如果您有查询要求对一种类型或另一种类型进行全表扫描的查询,那么让我们讨论分区,而不是新表。在十亿行表中找到一行比在一百万行表中花费大约10毫秒,因为索引可能只是两者之间的深度。

Another poster said,

另一张海报说,

"5-7 columns that do not apply to invalid documents NOT NULL so valid documents are required to have them. In my opinion, with that many columns empty for invalid documents, it justifies a different table."

“5-7列不适用于无效文档NOT NULL,因此需要有效文档才能拥有它们。在我看来,由于许多列为无效文档而空,所以它证明了另一个表格的合理性。”

I wish people would explain there reasons. HOW does it justify it? On what basis would you make that decision. Is 4 too many? Why not? But 5 is too many? Maybe he assumes you're using an ancient RDBMS with fixed field lengths. I can't tell. If you put the nullable columns at the end of the row, you'll pay no cost for them. In the middle, a few extra bytes. If that's a HUGE deal, if you're really scrambling to make this multi-TB table a wee smaller... we'll talk about vertical partitioning... not a whole new table. Since you'll be extending the length of n% of rows, you'll want to carefully choose your PCTFREE, or how ever your database does that. Other than that, there's little downside of the nullable columns.

我希望人们能解释原因。它如何证明它的合理性?在什么基础上你会做出这个决定。 4太多了吗?为什么不?但是5太多了?也许他假设你正在使用一个固定场长的古老RDBMS。我说不出来。如果您将可以为空的列放在行的末尾,则不会为它们支付任何费用。在中间,几个额外的字节。如果这是一个巨大的交易,如果你真的争先恐后地使这个多TB桌子变得更小......我们将讨论垂直分区......而不是一个全新的表格。由于您将延长n%行的长度,因此您需要仔细选择PCTFREE,或者您的数据库如何做到这一点。除此之外,可空列的缺点很少。

So let's talk about all the downsides of three tables.

那么让我们来谈谈三个表的所有缺点。

I'm going to assume your table looks like this;

我假设你的桌子看起来像这样;

A surrogate PK column with a unique index.
A candidate key column with a unique index.
a few foreign keys to 'lookup' tables.
Several data fields.
the 5-7 nullable columns that are filled if a document becomes invalid.

The first issue is that you'll have 3 PK's across all the tables to make sure that the key is unique... but there's no cross table object to guarantee uniqueness in all three combined. Unless you're painstaking in your approach to the code that moves data from one table to the next, you could have the same document twice or more. Once in each table. If you have a single table for Original, processed, and invalid, then there's no way you'll ever have that happen.

第一个问题是你将在所有表中有3个PK,以确保密钥是唯一的......但是没有交叉表对象来保证所有三个组合的唯一性。除非您在处理将数据从一个表移动到另一个表的代码的过程中苦心经营,否则您可以使用相同的文档两次或更多次。一旦进入每张桌子。如果您有一个原始,已处理和无效的表,那么您将无法实现这一目标。

With three tables, all of your constraints are going to be validated over and over. When you do your insert into the Original table, the PK is validated, the AK is validate, the FKs are validated, the other columns are validate. Room is made in all of the indexes for these new enteries, and perhaps causing block splits. Now you process the file and delete the entry from the Original table, all of those indexes suffer deletes, leaving empty space behind. Your insert into the next table, suffers all of that cost of your first insert again. Your indexes are acted upon, maybe causing block splits, your PK, AK and FK's are all validated again. Lather rinse repeat for invalid table.

使用三个表,您的所有约束都将被反复验证。当你插入原始表时,PK被验证,AK被验证,FK被验证,其他列被验证。在这些新的肠道的所有索引中制作房间,并且可能导致块分裂。现在您处理文件并从原始表中删除条目,所有这些索引都会删除,留下空白区域。您插入下一个表格会再次承担第一次插入的所有费用。你的索引被采取行动,可能导致块分裂,你的PK,AK和FK都会再次被验证。泡沫重复冲洗无效表。

Now, what happens to your data model if you adopt this paradigm when you discover that the business needs a 4th state? You're going to add a 4th document table for those in the unsubmitted state, or sent state. After all, the new sent state has 5-7 columns unneeded by the other states.

现在,如果您在发现业务需要第四个状态时采用此范例,那么您的数据模型会发生什么?您将为处于未提交状态或已发送状态的人添加第4个文档表。毕竟,新发送的状态有5-7列其他状态不需要。

And there are lots of queries which become hoorible to write and run with multiple tables, with a single table they are clear, concise and fast... size of a table will really only affect Full Table Scans, which we try to avoid for tables like these.

并且有很多查询在使用多个表编写和运行时变得很可靠,单个表格清晰,简洁,快速...表格的大小实际上只影响全表扫描,我们尽量避免表格像这些。

I've seen systems like these. One major operational query is, "Where is my document?"

我见过像这样的系统。一个主要的操作查询是“我的文档在哪里?”

You've got to search 3 tables to find its state. What most people do next is build a UNION ALL view of all three tables to facilitate the myriad of questions like that. IF the other poster thinks your queries slow down with other data in your table, see how they really slow down when you do a UNION ALL to accomplish the same thing. 1 index of blevel 3 as opposed to 3 indexes of blevel 2.

你必须搜索3个表才能找到它的状态。大多数人接下来要做的是构建所有三个表的UNION ALL视图,以便解决诸如此类的无数问题。如果另一张海报认为您的查询速度会随着表中的其他数据而变慢,那么当您执行UNION ALL以完成相同操作时,请查看它们是如何真正减速的。 1个窜气3的指数,而不是膛2的3个指标。

EXAMPLE/EDIT

I work in a trading company. We execute trades with counterparties. For accounting and legal reasons our company is defined as several companies. Well call them Trading, Holding, JointVenture. Our counterparties we'll call. JonesCo, SmithBarely, GoldSax.

我在一家贸易公司工作。我们与交易对手进行交易。出于会计和法律原因,我们公司被定义为几家公司。好吧叫他们Trading,Holding,JointVenture。我们会打电话给我们的交易对手。 JonesCo,SmithBarely,GoldSax。

So if I consider that the internal companies have a unique set of columns and the counterparties have a unique set of columns. You'd say that proper normalization would force them into two tables. So let's do that.

因此,如果我认为内部公司有一组独特的列,而对手方有一组独特的列。你会说适当的规范化会迫使它们分成两个表。因此,让我们做到这一点。

INT_CO_T 1 Trading 2 Holding 3 JointVenture

INT_CO_T 1交易2持有3 JointVENTure

CNTR_PTY_T 1 JonesCo 2 SmithBarely 3 GoldSax

CNTR_PTY_T 1 JonesCo 2 SmithBarely 3 GoldSax

Now I need a trade table where I map the transaction between our company(ies) and counterparties

现在我需要一个交易表,我在该交易表中映射我们公司和交易对手之间的交易

TRADE_T (Int_co_T.ID, Ctr_pty_T.ID, other trade columns)

TRADE_T(Int_co_T.ID,Ctr_pty_T.ID,其他交易栏目)

Great.

大。

Whoops, Business says that the JointVenture will execute trades with Trading. BTW, This is a very common scenario, this happens all the time. Trading house would call these Book-to-Book trades.

哎呀,Business表示JointVenture将通过交易执行交易。 BTW,这是一种非常常见的情况,这种情况一直都在发生。交易所会称这些Book-to-Book交易。

Now I'm left with two choices. (Three really) but.

现在我有两个选择。 (三个真的)但是。

1 is that I could do something very silly and place JointVenture and Trading into the Counterparty table so that my mapping table will still work. This leads to nightmare queries which I'm sure those involved in this conversation will recognize. Or I can build a separate Mapping table.. and that too leads to some unions if I want to see all of the trades from a given company.

1是我可以做一些非常愚蠢的事情并将JointVenture和Trading放入Counterparty表中,这样我的映射表仍然可以工作。这导致了噩梦般的查询,我确信参与此对话的人都会认出来。或者我可以构建一个单独的映射表..如果我想查看给定公司的所有交易,这也会导致一些联盟。

The third and better way is to build a single table for both counterparties and internal companies, called Trading_entities or something. Now I need one mapping table to show either internal or external trades. I can easily see net position and net exposure with one query, two tables. etc.

第三种也是更好的方法是为交易对手和内部公司建立一个单独的表,称为Trading_entities或其他东西。现在我需要一个映射表来显示内部或外部交易。我可以通过一个查询,两个表轻松查看净头寸和净敞口。等等

If you're really hung up on the nullable fields then vertically partition that table and use three tables. But the main table will have a list and most importantly a single key for either subtype of trading participant.

如果你真的挂在可空字段上,那么垂直分区该表并使用三个表。但主表将有一个列表,最重要的是交易参与者的任何子类型的单个键。

#4


2  

Do try to make distinction between logical and physical modelling.

尝试区分逻辑和物理建模。

Even if the difference between the two entities is only seven properties, they are logically a different thing in those seven items. At the same time they are a same thing in other properties.

即使两个实体之间的差异只有七个属性,它们在这七个项目中在逻辑上是不同的。与此同时,它们在其他属性中也是一样的。

The way to logically represent that is this have one-to-one-or-zero relationship between the two tables, and to have one table store all the common properties (superclass) and in the other (subclass) you would only store the ID from the superclass.

逻辑上表示的方式是,这两个表之间存在一对一或零关系,并且一个表存储所有公共属性(超类),而另一个(子类)只存储ID来自超类。

In terms of performance this is not so bad:

在性能方面,这并不是那么糟糕:

  • when you don't care about what type of document you work with you will query the superclass table (gain)
  • 当你不关心你使用什么类型的文档时,你将查询超类表(增益)
  • when you know you want only specific properties found in the subclass table you will work only with that table (this might be real gain)
  • 当你知道你只想要在子类表中找到的特定属性时,你只能使用那个表(这可能是真正的好处)
  • you will pay a price only when you need to join the two tables (joins have a price compared to denormalized structures such as storing everything in a single table)
  • 只有在需要加入两个表时才会付出代价(连接的价格与非规范化结构相比,例如将所有内容存储在一个表中)
  • you will also pay a price when inserting subclass records, because you will be inserting into two tables (this might be very low and/or justified)
  • 你也会在插入子类记录时付出代价,因为你将插入两个表(这可能是非常低和/或合理的)

Depending on the processes you are modelling, the frequency of these queries and other things (such as security for both entities, ownership, difference in integrity rules) you might decide to store this information in one table in the database or in two (either can be much faster in border-line cases and two table solution can also be denormalized a bit; for example you could still store information in a main table about the type of the document to avoid the join if that kind of query is all you care).

根据您正在建模的流程,这些查询的频率和其他内容(例如实体的安全性,所有权,完整性规则的差异),您可能决定将此信息存储在数据库中的一个表中或两个中(或者可以在边界行情况下要快得多,并且两个表解决方案也可以进行非规范化处理;例如,您仍然可以在主表中存储有关文档类型的信息,以避免在您关心的那种查询时进行连接。

Or maybe your implementation decisions might be driven by your choice of application framework and for that reason you might really prefer working with single table or the other way around (for example automatic creation of data entry forms in frameworks such as django-admin).

或者,您的实现决策可能是由您选择的应用程序框架驱动的,因此您可能更喜欢使用单表或其他方式(例如在django-admin等框架中自动创建数据输入表单)。

Whatever you do, realize the difference between the logical and physical design. In your logical design normalize everything - it will pay off. In physical implementation make different scenarios and - test, test, test with your own data. Never confuse the order of the two (logical-conceptual and physical-practical modelling).

无论你做什么,都要意识到逻辑和物理设计之间的区别。在你的逻辑设计中规范化一切 - 它会得到回报。在物理实现中,制作不同的场景,并使用您自己的数据进行测试,测试和测试。永远不要混淆两者的顺序(逻辑概念和物理实际建模)。

#5


1  

What shape are your queries? Do you frequently wish to deal with a group (all?) documents, irrespective of whether they're valid? Or does every query only every concern valid (or invalid) documents.

你的查询有什么形状?您是否经常希望处理一组(所有?)文档,无论它们是否有效?或者每个查询只检查每个有效(或无效)的文档。

Or do you wish to deal with groups (irrespective of validity), but wish to frequently perform additional work with valid documents. That may point to a base table and an additional table containing the valid document columns?

或者您希望与团体打交道(不论其有效性),但希望经常使用有效文件进行额外的工作。这可能指向一个基表和一个包含有效文档列的附加表?

#6


1  

Think of OriginalDocuments as of intermediate table. It can change as you input formats change. And it will contain fields which are not valid for imported ("processed") documents, like import date or import error description. And you can clean this table periodically.

将OriginalDocuments视为中间表。它可以随着输入格式的变化而改变。它将包含对导入(“已处理”)文档无效的字段,如导入日期或导入错误描述。你可以定期清理这张桌子。

In contrast to OriginalDocument, ProcessedDocument table will contain only documents and fields valid for your system, with all of the check constraints, indexes and associated business logic. It's structure will change as your system's internal logic changes.

与OriginalDocument相比,ProcessedDocument表将仅包含对您的系统有效的文档和字段,以及所有检查约束,索引和关联的业务逻辑。它的结构将随着系统内部逻辑的变化而变化。

#7


1  

Another thing you might want to take into consideration is the lifecycle and use cases of the rows. If the invalid documents are purged regularly, it might help to have them in separate tables. If the attributes of invalid documents stay limited, but valid documents are getting new columns, that would be a factor in favor of separate tables, too. As the entities are more and more different in behavior and usage, there are more indications that separate tables are merited.

您可能需要考虑的另一件事是行的生命周期和用例。如果定期清除无效文档,则将它们放在单独的表中可能会有所帮助。如果无效文档的属性保持有限,但有效文档正在获得新列,那么这也是支持单独表的一个因素。由于实体在行为和使用方面越来越不同,因此有更多迹象表明单独的表是值得的。

#1


6  

Whether the document is valid or invalid, it is still a document so it makes inital sense for them all to be in the same table.

无论文档是有效还是无效,它仍然是一个文档,因此它们使所有文档都在同一个表中。

However, if an invalid document is treated differently by your application to the point where it is almost forgotten (not queried, updated etc.) then split the tables. Having the two types of document together in the same table will do nothing but slow down your queries for no immediate benefit.

但是,如果您的应用程序对无效文档的处理方式不同(即未查询,更新等),则将表格拆分。将两种类型的文档放在同一个表中只会减慢您的查询速度而不会产生直接的好处。

I have a document table where valid and invalid documents are kept together but only because the app re-presents the bad document to the user and asks them to fix it.

我有一个文档表,其中有效和无效的文档保存在一起,但仅仅是因为应用程序将错误的文档重新呈现给用户并要求他们修复它。

#2


5  

To me it sounds that it would make sense to have a bit column, as all documents have actually been processed, it is just that some have been determined to be invalid. And depending on the number of columns if you only have 5 or so out of say 10-15 columns that don't apply, there is no need to manage two structures for the same data.

对我来说听起来有一个位列是有意义的,因为所有文件都已经被处理过,只是有些已被确定为无效。根据列数,如果您只有5个左右的10-15列不适用,则无需为同一数据管理两个结构。

Now, another thing you could look at is do you need to regularly get information on both valid and invalid documents at the same time? if so, then you really do want it in one table.

现在,您可以看到的另一件事是您是否需要同时定期获取有效和无效文档的信息?如果是这样,那么你真的希望它在一个表中。

If you don't ever need to query them together, or if a document is "invalid" you don't need it again except for history, then it could make sense to move it to its own table.

如果您不需要一起查询它们,或者文档是“无效”,除了历史记录之外不再需要它,那么将它移动到自己的表中是有意义的。

#3


3  

Wow, so much bad advice and design myths in a single question it's hard to know where to start.

哇,在一个问题上有如此多的糟糕建议和设计神话,很难知道从哪里开始。

Is this a VLDB? are you talking 100's of TB, 100's of GB, 1-10 GB?

这是VLDB吗?你在说100的TB,100的GB,1-10 GB?

Is this an untra-high performance DB? Do you need to squeeze out microseconds?

这是一个非高性能的数据库吗?你需要挤出微秒吗?

Most advice tends to lean toward those extremes where you might break a few basic rules for the sake of performance.

大多数建议倾向于倾向于那些为了表现而可能违反一些基本规则的极端情况。

An earlier poster said,

早先的一张海报说,

"Whether the document is valid or invalid, it is still a document so it makes inital sense for them all to be in the same table."

“无论文件是有效还是无效,它仍然是一份文件,因此它们在同一张桌子中都具有初步意义。”

He was on the right track there. And for that matter, whether it's processed or unprocessed it's also a document. I strongly question the first table split.

他在那里正确的轨道。就此而言,无论是处理还是未处理,它都是一份文件。我强烈质疑第一个分表。

He then says,

然后他说,

"Having the two types of document together in the same table will do nothing but slow down your queries for no immediate benefit."

“将两种类型的文档放在同一个表中只会减慢你的查询速度而不会立即受益。”

I have no idea what that advice is based upon. If your RDBMS supports indexes, more data will have a very marginal additional cost at certain sizes of your index because your b-tree gets one level deeper. If you take his statement at face value, you should limit your table to n rows each and keep making new ones because "more data in your table = slower queries." I have no idea why people persist in this notion. If you have queries that REQUIRE full table scans for one type or the other, let's talk partitioning, not a new table. It takes about an extra 10 milliseconds to find a row in a billion row table than it does in a million row table because an index will probably only be one blevel deeper between the two.

我不知道这个建议是基于什么的。如果您的RDBMS支持索引,则在索引的某些大小处,更多数据将具有非常小的额外成本,因为您的b树更深一层。如果你以表面值表达他的陈述,你应该将你的表限制为每行n行,并继续制作新的表,因为“你表中的更多数据=较慢的查询”。我不知道为什么人们坚持这个观念。如果您有查询要求对一种类型或另一种类型进行全表扫描的查询,那么让我们讨论分区,而不是新表。在十亿行表中找到一行比在一百万行表中花费大约10毫秒,因为索引可能只是两者之间的深度。

Another poster said,

另一张海报说,

"5-7 columns that do not apply to invalid documents NOT NULL so valid documents are required to have them. In my opinion, with that many columns empty for invalid documents, it justifies a different table."

“5-7列不适用于无效文档NOT NULL,因此需要有效文档才能拥有它们。在我看来,由于许多列为无效文档而空,所以它证明了另一个表格的合理性。”

I wish people would explain there reasons. HOW does it justify it? On what basis would you make that decision. Is 4 too many? Why not? But 5 is too many? Maybe he assumes you're using an ancient RDBMS with fixed field lengths. I can't tell. If you put the nullable columns at the end of the row, you'll pay no cost for them. In the middle, a few extra bytes. If that's a HUGE deal, if you're really scrambling to make this multi-TB table a wee smaller... we'll talk about vertical partitioning... not a whole new table. Since you'll be extending the length of n% of rows, you'll want to carefully choose your PCTFREE, or how ever your database does that. Other than that, there's little downside of the nullable columns.

我希望人们能解释原因。它如何证明它的合理性?在什么基础上你会做出这个决定。 4太多了吗?为什么不?但是5太多了?也许他假设你正在使用一个固定场长的古老RDBMS。我说不出来。如果您将可以为空的列放在行的末尾,则不会为它们支付任何费用。在中间,几个额外的字节。如果这是一个巨大的交易,如果你真的争先恐后地使这个多TB桌子变得更小......我们将讨论垂直分区......而不是一个全新的表格。由于您将延长n%行的长度,因此您需要仔细选择PCTFREE,或者您的数据库如何做到这一点。除此之外,可空列的缺点很少。

So let's talk about all the downsides of three tables.

那么让我们来谈谈三个表的所有缺点。

I'm going to assume your table looks like this;

我假设你的桌子看起来像这样;

A surrogate PK column with a unique index.
A candidate key column with a unique index.
a few foreign keys to 'lookup' tables.
Several data fields.
the 5-7 nullable columns that are filled if a document becomes invalid.

The first issue is that you'll have 3 PK's across all the tables to make sure that the key is unique... but there's no cross table object to guarantee uniqueness in all three combined. Unless you're painstaking in your approach to the code that moves data from one table to the next, you could have the same document twice or more. Once in each table. If you have a single table for Original, processed, and invalid, then there's no way you'll ever have that happen.

第一个问题是你将在所有表中有3个PK,以确保密钥是唯一的......但是没有交叉表对象来保证所有三个组合的唯一性。除非您在处理将数据从一个表移动到另一个表的代码的过程中苦心经营,否则您可以使用相同的文档两次或更多次。一旦进入每张桌子。如果您有一个原始,已处理和无效的表,那么您将无法实现这一目标。

With three tables, all of your constraints are going to be validated over and over. When you do your insert into the Original table, the PK is validated, the AK is validate, the FKs are validated, the other columns are validate. Room is made in all of the indexes for these new enteries, and perhaps causing block splits. Now you process the file and delete the entry from the Original table, all of those indexes suffer deletes, leaving empty space behind. Your insert into the next table, suffers all of that cost of your first insert again. Your indexes are acted upon, maybe causing block splits, your PK, AK and FK's are all validated again. Lather rinse repeat for invalid table.

使用三个表,您的所有约束都将被反复验证。当你插入原始表时,PK被验证,AK被验证,FK被验证,其他列被验证。在这些新的肠道的所有索引中制作房间,并且可能导致块分裂。现在您处理文件并从原始表中删除条目,所有这些索引都会删除,留下空白区域。您插入下一个表格会再次承担第一次插入的所有费用。你的索引被采取行动,可能导致块分裂,你的PK,AK和FK都会再次被验证。泡沫重复冲洗无效表。

Now, what happens to your data model if you adopt this paradigm when you discover that the business needs a 4th state? You're going to add a 4th document table for those in the unsubmitted state, or sent state. After all, the new sent state has 5-7 columns unneeded by the other states.

现在,如果您在发现业务需要第四个状态时采用此范例,那么您的数据模型会发生什么?您将为处于未提交状态或已发送状态的人添加第4个文档表。毕竟,新发送的状态有5-7列其他状态不需要。

And there are lots of queries which become hoorible to write and run with multiple tables, with a single table they are clear, concise and fast... size of a table will really only affect Full Table Scans, which we try to avoid for tables like these.

并且有很多查询在使用多个表编写和运行时变得很可靠,单个表格清晰,简洁,快速...表格的大小实际上只影响全表扫描,我们尽量避免表格像这些。

I've seen systems like these. One major operational query is, "Where is my document?"

我见过像这样的系统。一个主要的操作查询是“我的文档在哪里?”

You've got to search 3 tables to find its state. What most people do next is build a UNION ALL view of all three tables to facilitate the myriad of questions like that. IF the other poster thinks your queries slow down with other data in your table, see how they really slow down when you do a UNION ALL to accomplish the same thing. 1 index of blevel 3 as opposed to 3 indexes of blevel 2.

你必须搜索3个表才能找到它的状态。大多数人接下来要做的是构建所有三个表的UNION ALL视图,以便解决诸如此类的无数问题。如果另一张海报认为您的查询速度会随着表中的其他数据而变慢,那么当您执行UNION ALL以完成相同操作时,请查看它们是如何真正减速的。 1个窜气3的指数,而不是膛2的3个指标。

EXAMPLE/EDIT

I work in a trading company. We execute trades with counterparties. For accounting and legal reasons our company is defined as several companies. Well call them Trading, Holding, JointVenture. Our counterparties we'll call. JonesCo, SmithBarely, GoldSax.

我在一家贸易公司工作。我们与交易对手进行交易。出于会计和法律原因,我们公司被定义为几家公司。好吧叫他们Trading,Holding,JointVenture。我们会打电话给我们的交易对手。 JonesCo,SmithBarely,GoldSax。

So if I consider that the internal companies have a unique set of columns and the counterparties have a unique set of columns. You'd say that proper normalization would force them into two tables. So let's do that.

因此,如果我认为内部公司有一组独特的列,而对手方有一组独特的列。你会说适当的规范化会迫使它们分成两个表。因此,让我们做到这一点。

INT_CO_T 1 Trading 2 Holding 3 JointVenture

INT_CO_T 1交易2持有3 JointVENTure

CNTR_PTY_T 1 JonesCo 2 SmithBarely 3 GoldSax

CNTR_PTY_T 1 JonesCo 2 SmithBarely 3 GoldSax

Now I need a trade table where I map the transaction between our company(ies) and counterparties

现在我需要一个交易表,我在该交易表中映射我们公司和交易对手之间的交易

TRADE_T (Int_co_T.ID, Ctr_pty_T.ID, other trade columns)

TRADE_T(Int_co_T.ID,Ctr_pty_T.ID,其他交易栏目)

Great.

大。

Whoops, Business says that the JointVenture will execute trades with Trading. BTW, This is a very common scenario, this happens all the time. Trading house would call these Book-to-Book trades.

哎呀,Business表示JointVenture将通过交易执行交易。 BTW,这是一种非常常见的情况,这种情况一直都在发生。交易所会称这些Book-to-Book交易。

Now I'm left with two choices. (Three really) but.

现在我有两个选择。 (三个真的)但是。

1 is that I could do something very silly and place JointVenture and Trading into the Counterparty table so that my mapping table will still work. This leads to nightmare queries which I'm sure those involved in this conversation will recognize. Or I can build a separate Mapping table.. and that too leads to some unions if I want to see all of the trades from a given company.

1是我可以做一些非常愚蠢的事情并将JointVenture和Trading放入Counterparty表中,这样我的映射表仍然可以工作。这导致了噩梦般的查询,我确信参与此对话的人都会认出来。或者我可以构建一个单独的映射表..如果我想查看给定公司的所有交易,这也会导致一些联盟。

The third and better way is to build a single table for both counterparties and internal companies, called Trading_entities or something. Now I need one mapping table to show either internal or external trades. I can easily see net position and net exposure with one query, two tables. etc.

第三种也是更好的方法是为交易对手和内部公司建立一个单独的表,称为Trading_entities或其他东西。现在我需要一个映射表来显示内部或外部交易。我可以通过一个查询,两个表轻松查看净头寸和净敞口。等等

If you're really hung up on the nullable fields then vertically partition that table and use three tables. But the main table will have a list and most importantly a single key for either subtype of trading participant.

如果你真的挂在可空字段上,那么垂直分区该表并使用三个表。但主表将有一个列表,最重要的是交易参与者的任何子类型的单个键。

#4


2  

Do try to make distinction between logical and physical modelling.

尝试区分逻辑和物理建模。

Even if the difference between the two entities is only seven properties, they are logically a different thing in those seven items. At the same time they are a same thing in other properties.

即使两个实体之间的差异只有七个属性,它们在这七个项目中在逻辑上是不同的。与此同时,它们在其他属性中也是一样的。

The way to logically represent that is this have one-to-one-or-zero relationship between the two tables, and to have one table store all the common properties (superclass) and in the other (subclass) you would only store the ID from the superclass.

逻辑上表示的方式是,这两个表之间存在一对一或零关系,并且一个表存储所有公共属性(超类),而另一个(子类)只存储ID来自超类。

In terms of performance this is not so bad:

在性能方面,这并不是那么糟糕:

  • when you don't care about what type of document you work with you will query the superclass table (gain)
  • 当你不关心你使用什么类型的文档时,你将查询超类表(增益)
  • when you know you want only specific properties found in the subclass table you will work only with that table (this might be real gain)
  • 当你知道你只想要在子类表中找到的特定属性时,你只能使用那个表(这可能是真正的好处)
  • you will pay a price only when you need to join the two tables (joins have a price compared to denormalized structures such as storing everything in a single table)
  • 只有在需要加入两个表时才会付出代价(连接的价格与非规范化结构相比,例如将所有内容存储在一个表中)
  • you will also pay a price when inserting subclass records, because you will be inserting into two tables (this might be very low and/or justified)
  • 你也会在插入子类记录时付出代价,因为你将插入两个表(这可能是非常低和/或合理的)

Depending on the processes you are modelling, the frequency of these queries and other things (such as security for both entities, ownership, difference in integrity rules) you might decide to store this information in one table in the database or in two (either can be much faster in border-line cases and two table solution can also be denormalized a bit; for example you could still store information in a main table about the type of the document to avoid the join if that kind of query is all you care).

根据您正在建模的流程,这些查询的频率和其他内容(例如实体的安全性,所有权,完整性规则的差异),您可能决定将此信息存储在数据库中的一个表中或两个中(或者可以在边界行情况下要快得多,并且两个表解决方案也可以进行非规范化处理;例如,您仍然可以在主表中存储有关文档类型的信息,以避免在您关心的那种查询时进行连接。

Or maybe your implementation decisions might be driven by your choice of application framework and for that reason you might really prefer working with single table or the other way around (for example automatic creation of data entry forms in frameworks such as django-admin).

或者,您的实现决策可能是由您选择的应用程序框架驱动的,因此您可能更喜欢使用单表或其他方式(例如在django-admin等框架中自动创建数据输入表单)。

Whatever you do, realize the difference between the logical and physical design. In your logical design normalize everything - it will pay off. In physical implementation make different scenarios and - test, test, test with your own data. Never confuse the order of the two (logical-conceptual and physical-practical modelling).

无论你做什么,都要意识到逻辑和物理设计之间的区别。在你的逻辑设计中规范化一切 - 它会得到回报。在物理实现中,制作不同的场景,并使用您自己的数据进行测试,测试和测试。永远不要混淆两者的顺序(逻辑概念和物理实际建模)。

#5


1  

What shape are your queries? Do you frequently wish to deal with a group (all?) documents, irrespective of whether they're valid? Or does every query only every concern valid (or invalid) documents.

你的查询有什么形状?您是否经常希望处理一组(所有?)文档,无论它们是否有效?或者每个查询只检查每个有效(或无效)的文档。

Or do you wish to deal with groups (irrespective of validity), but wish to frequently perform additional work with valid documents. That may point to a base table and an additional table containing the valid document columns?

或者您希望与团体打交道(不论其有效性),但希望经常使用有效文件进行额外的工作。这可能指向一个基表和一个包含有效文档列的附加表?

#6


1  

Think of OriginalDocuments as of intermediate table. It can change as you input formats change. And it will contain fields which are not valid for imported ("processed") documents, like import date or import error description. And you can clean this table periodically.

将OriginalDocuments视为中间表。它可以随着输入格式的变化而改变。它将包含对导入(“已处理”)文档无效的字段,如导入日期或导入错误描述。你可以定期清理这张桌子。

In contrast to OriginalDocument, ProcessedDocument table will contain only documents and fields valid for your system, with all of the check constraints, indexes and associated business logic. It's structure will change as your system's internal logic changes.

与OriginalDocument相比,ProcessedDocument表将仅包含对您的系统有效的文档和字段,以及所有检查约束,索引和关联的业务逻辑。它的结构将随着系统内部逻辑的变化而变化。

#7


1  

Another thing you might want to take into consideration is the lifecycle and use cases of the rows. If the invalid documents are purged regularly, it might help to have them in separate tables. If the attributes of invalid documents stay limited, but valid documents are getting new columns, that would be a factor in favor of separate tables, too. As the entities are more and more different in behavior and usage, there are more indications that separate tables are merited.

您可能需要考虑的另一件事是行的生命周期和用例。如果定期清除无效文档,则将它们放在单独的表中可能会有所帮助。如果无效文档的属性保持有限,但有效文档正在获得新列,那么这也是支持单独表的一个因素。由于实体在行为和使用方面越来越不同,因此有更多迹象表明单独的表是值得的。