在关系数据库中建模原子事实

I want to record what various sources have to say about a historical figure. i.e.

我想记录各种消息来源对历史人物的看法。即

The website Wikipedia says that Susan B. Anthony was born February 15, 1820 and her favorite color was blue

*网站称Susan B. Anthony出生于1820年2月15日,她最喜欢的颜色是蓝色

The book Century of Struggle says that Susan B. Anthony was born on February 12, 1820 and her favorite color was red

“斗争世纪”一书中说,Susan B. Anthony出生于1820年2月12日,她最喜欢的颜色是红色

The book History of Woman's Suffrage says that Susan B. Anthony was born on February 15, 1820 and her favorite color was red and she was the second cousin of Abraham Lincoln

“女人的偶然历史”一书中说,苏珊·B·安东尼出生于1820年2月15日,她最喜欢的颜色是红色,她是亚伯拉罕·林肯的第二代堂兄

I also want researchers to be able to express their confidence, for instance with a percentage, in the individual claims that these sources are making. i.e.

我还希望研究人员能够在这些来源的个人主张中表达他们的信心,例如百分比。即

User A is 90% confident that Susan B. Anthony was born on February 15, 1820; 75% confident that her favorite color was blue, and 30% confident that she was second cousins with Abraham Lincoln

用户A对Susan B. Anthony于1820年2月15日出生的人有90%的信心; 75%的人确信她最喜欢的颜色是蓝色,30%的人确信她是亚伯拉罕林肯的第二代表兄弟

User B is 30% confident that Susan B. Anthony was born on February 12, 1820; 60% confident that her favorite color was blue, and 10% confident that she was second cousins with Abraham Lincoln

用户B对Susan B. Anthony于1820年2月12日出生的人有30%的信心; 60%的人确信她最喜欢的颜色是蓝色,10%的人确信她是亚伯拉罕林肯的第二代表兄弟

I then want each user to have a view of Susan B. Anthony that shows her birthday, favorite color, and relationships that the users thinks are most likely to be true.

然后,我希望每个用户都能看到Susan B. Anthony的视图,其中显示了她的生日,最喜欢的颜色以及用户认为最有可能真实的关系。

I'm also want to use a relational database datastore, and the way that I can think to do this is to create a separate table for every individual type of atomic fact that I want the users to be able to express their confidence in. So for this example there would be eight tables in total, and three separate table for the three separate atomic facts.

我也想使用关系数据库数据存储区,我想这样做的方法是为每个单独的原子事实类型创建一个单独的表,我希望用户能够表达自己的信心。所以对于这个例子,总共有八个表,三个单独的表用于三个独立的原子事实。

Source(id)
Person(id)

Claim(claim_id, source, FOREIGN KEY(source) REFERENCES Source(id) )
Alleged_birth_date(claim_id, person, birth_date, FOREIGN KEY(claim_id) REFERENCES Claim(id), FOREIGN KEY(person) REFERENCES person(id))
Alleged_favorite_color(claim_id, person, color, FOREIGN KEY(claim_id) REFERENCES Claim(id), FOREIGN KEY(person) REFERENCES person(id)) 
Alleged_kinship(claim_id, person, relationship type, kin, FOREIGN KEY(claim_id) REFERENCES Claim(id), FOREIGN KEY(person) REFERENCES Person(id))

User(id)
Confidence_in_claim(user, claim, confidence, FOREIGN KEY(user) REFERENCES User(id), FOREIGN KEY(claim) REFERENCES claim(id))

This feels like it gets very complicated very quickly, as actually want to record a lot of types of atomic facts. Are there better ways to do this?

这感觉很快变得非常复杂,因为实际上想要记录很多类型的原子事实。有没有更好的方法来做到这一点?

This is, I think, the same issue that Martin Fowler calls Contradictory Observations.

我认为,这与马丁福勒称之为矛盾观察的问题相同。

4 个解决方案

#1

RDF is great for this. It's usually described as a format for metadata; but in fact it's a graph model of 'assertions' on triplets.

RDF非常适合这一点。它通常被描述为元数据的格式;但实际上它是三元组'断言'的图形模型。

The whole 'semantic web' idea is to publish lots of facts on RDF, and search engines would be inference engines that traverse the unified graph to find relationships.

整个“语义网”的想法是在RDF上发布大量事实,搜索引擎将是遍历统一图以找到关系的推理引擎。

There's also some mechanisms to refer to a triplet, so you can say something about an assertion, like it's origin (who says this?), or when it was asserted (when did he said that?), or how much you beleive it to be true, etc.

还有一些机制可以引用一个三元组,所以你可以说一个关于断言的东西,比如它的起源(谁说这个?),或者当它被断言时(他什么时候说的那样?),或者你对它的信任程度是多少是的,等等

As a big example, the whole OpenCyc 'commonsense knowledge base' is queryable in RDF

作为一个很好的例子,整个OpenCyc'常识知识库'在RDF中是可查询的

#2

You should try a Star Schema model, centered around a "Fact" table and several "Dimension" tables. This is a well-explored model, and there are many database optimizations for it.

您应该尝试一个Star Schema模型,以“Fact”表和几个“Dimension”表为中心。这是一个经过深入研究的模型,并且有许多数据库优化。

claim_fact(source_id, person_id, user_id, details_id, weight)

claim_fact(source_id,person_id,user_id,details_id,weight)

Source_dimension(id, name)

Person_dimension(id, name)

User_dimension(id, name)

details_dimension(id, name NOT NULL, color NULLABLE, kinship NULLABLE, birthday NULLABLE)

details_dimension(id,name NOT NULL,color NULLABLE,kinship NULLABLE,birthday NULLABLE)

Every claim would have a source, person, user, and details. NAME values for details would be values such as "kinship", "birthday".

每个声明都有源,人,用户和详细信息。详细信息的名称值将是“亲属关系”,“生日”等值。

Keep in mind that this is an OLAP schema (rather than an OLTP structure), and being so it is not fully normalized. The benefits to this outweigh any problems you may come across due to redundancy, as queries to star schemas are highly optimized by DBMSs configured for Data Warehousing.

请记住,这是一个OLAP架构(而不是OLTP结构),因此它没有完全规范化。这样做的好处超过了由于冗余而可能遇到的任何问题,因为对星型模式的查询通过为数据仓库配置的DBMS进行了高度优化。

RECOMMENDED READING: The Data Warehouse Toolkit (Kimball, et al.)

推荐阅读:数据仓库工具包(Kimball等)

#3

I think what you want to use is a "property bag". Instead of modeling each individual type of fact that you want to describe, you want to have a table which will contain an ID, a "key" (in this case, the alleged information (such as "kinship")) and a "value" (in this case, the alleged value (such as "Abraham Lincoln)). Then you want to have a second table which ties your claimants to that table, along with a level of confidence that they have in that information. That table would simply have the ID of the source, the ID of the property, and the confidence that the source has in the information. In that way, you can have a source which has either a lot or a little information; you can also model differing sources having differing levels of confidence in a given attribute; there is also no limitation on how many differing types of information you can store.

我想你想要使用的是一个“财产袋”。您想拥有一个包含ID,“密钥”(在这种情况下,所谓的信息(例如“亲属关系”))和“值”的表,而不是对您要描述的每个单独类型的事实进行建模。 “(在这种情况下,所谓的价值(例如”亚伯拉罕·林肯“)。然后你想要一张第二张表,将你的索赔人与该表联系起来,以及他们对该信息的信任程度。只需拥有源的ID,属性的ID以及源在信息中的置信度。这样,您可以拥有一个包含大量或少量信息的源;您还可以建模不同的源在给定属性中具有不同的置信水平;对于您可以存储多少种不同类型的信息也没有限制。

It's a pretty standard solution for situations such as yours where you have large amounts of optional information that you want to cross-reference.

对于像您这样的情况,这是一个非常标准的解决方案,您需要交叉引用大量可选信息。

#4

This feels like it gets very complicated very quickly

这感觉很快变得非常复杂

You're not kidding. Have a look at the work on ontology and knowledge representation.

你不是在开玩笑。看看本体和知识表示的工作。

#1