实现一个数据库——如何开始

时间:2022-06-29 21:22:00

I've been trying to learn programming for a while. I've studied Java and Python, and I'm comfortable with their syntax. Recently, I wanted to use what I've learnt with coding a tangible software from ground up.

我试着学习编程已经有一段时间了。我学过Java和Python,对它们的语法很熟悉。最近,我想利用我在从头开始编写有形软件时学到的知识。

I want to implement a database engine, sort of a NoSQL database. I've put together a small document, sort of a specification to follow throughout my adventure of coding it. But all I know is a bunch of keywords. I don't know where to start.

我想实现一个数据库引擎,一种NoSQL数据库。我已经编写了一个小文档,在编写它的整个冒险过程中都遵循某种规范。但我只知道一堆关键词。我不知道从哪里开始。

Can someone help me find out how to gather the knowledge I need for this kind of work and in what order to learn things? I have searched for documents, but I feel like I'll end up finding unrelated/erroneous content or start from a wrong point, because implementing a complete database engine is (seeming to be) a truly complicated task.

有人能帮我找到我需要的知识吗?我需要做这样的工作,为了学习什么?我已经搜索了文档,但是我感觉我最终会发现不相关的/错误的内容,或者从错误的地方开始,因为实现一个完整的数据库引擎(似乎)是一项非常复杂的任务。

I wan't to express that I'd prefer theses and whitepapers and (e)books to codes of other projects, because I've asked a question of kind in which people usually get answered in the form of "read project - x' source code". I'm not at the level of comfortably reading and understanding source code.

我不想说比起其他项目的代码,我更喜欢论文、白皮书和(e)书籍,因为我问了一个问题,人们通常会以“阅读项目- x的源代码”的形式得到回答。我还没有阅读和理解源代码的能力。

2 个解决方案

#1


8  

First, you may have a look that the answers for How to write a simple database engine. While it focus on a SQL engine, there is still a lot of good material in the answers.

首先,您可以看看如何编写一个简单的数据库引擎的答案。虽然它关注的是SQL引擎,但是答案中仍然有很多好的材料。

Otherwise, a good project tutorial is Implementation of a B-Tree Database Class. The example code is in C++, but the description of what is done and why is probably what you'll want to look at anyway.

否则,一个好的项目教程就是B-Tree数据库类的实现。示例代码在c++中,但是对所做的事情和原因的描述可能是您希望看到的。

Also, there is Designing and Implementing Structured Storage (Database Engine) over at MSDN. Plenty of information there to help you in your learning project.

此外,还有在MSDN上设计和实现结构化存储(数据库引擎)。这里有很多信息可以帮助你完成学习计划。

#2


4  

Because the accepted answer only offers (good) links to other resources, I'd thought I share my experience writing webdb, a small experimental database for browsers. I also invite you to read the source code. It's pretty small. You should be able to read through it and get a basic understanding of what it's doing in a couple of hours. Warning: I am a n00b at this and since writing it I learned a lot more about it and see I have been doing some things wrong. It can help you get started though.

因为公认的答案只提供(好的)到其他资源的链接,所以我想分享我编写webdb的经验,这是一个小型的浏览器实验数据库。我也邀请您阅读源代码。这是很小的。你应该能够读懂它,并对它在几个小时内做的事情有一个基本的了解。警告:我在这方面很在行,自从写了这篇文章后,我学到了很多,我发现我做错了一些事情。它可以帮助你开始。

The basics: BTree

I started out with adapting an AVL tree to suit my needs. An AVL tree is a kind of self-balancing binary search tree. You store the key K and related data (if any) in a node, then all items with key < K in a node in the left subtree and all items with key > K in a right subtree. You can use an array to store the data items if you want to support non unique keys.

我开始调整AVL树以适应我的需要。AVL树是一种自平衡二叉搜索树。在节点中存储密钥K和相关数据(如果有),然后在左子树的节点中存储密钥< K的所有项,在右子树中存储密钥>k的所有项。如果希望支持非唯一键,可以使用数组来存储数据项。

This tree will give you the basics: Create, Update, Delete and a way to quickly get an item by key, or all items with key < x, or with key between x and y etc. It can serve as the index for our table.

这棵树将为您提供基本的信息:创建、更新、删除和一种按键快速获取项的方法,或者所有按键< x、键在x和y之间的项。它可以作为我们表的索引。

A schema

As a next step I wrote code that lets the client code define a schema. Methods like createTable() etc. Schemas are typically associated with SQL, but even no-SQL sort-of has a schema; they usually require you to mark the ID field and any other fields you want to search on. You can make your schema as fancy as you want, but you typically want to model at least which column(s) serve as primary key and which fields will be searched on frequently and need an index.

作为下一步,我编写了允许客户端代码定义模式的代码。诸如createTable()等方法。模式通常与SQL相关联,但即使是非SQL排序也有一个模式;它们通常要求您标记ID字段和您想要搜索的任何其他字段。您可以根据自己的需要设置模式,但您通常希望至少建模哪些列作为主键,以及哪些字段将被频繁地搜索,并需要一个索引。

Creating a data structure to store a table

I decided to use the tree I created in the first step to store my items. These were simple JS objects. Having defined which field contains the PK, I could simply insert the item into the tree using that field's value as the key. This gives me quick lookup by ID (range).

我决定使用我在第一步中创建的树来存储我的项目。这些是简单的JS对象。定义了包含PK的字段后,我可以使用该字段的值作为键将项目插入到树中。这使我可以通过ID(范围)快速查找。

Next I added another tree for every column that needs an index. In these trees I did not store the full record, but only the key. So to fetch a customer by last name, I would first use the index on last name to get the ID, then the primary key index to get the actual record. The reason I did not just store the (reference to the) actual object is because it makes set operations a little bit simpler (see next step)

接下来,我为每个需要索引的列添加了另一棵树。在这些树中,我没有保存完整的记录,只有钥匙。因此,为了获取客户的姓,我首先使用姓氏上的索引获取ID,然后使用主键索引获取实际的记录。我之所以不存储实际对象(引用)是因为它使设置操作更简单(参见下一个步骤)

Querying

Now that we have a table with indexes for PK and search fields, we can implement querying. I did not take this very far as it becomes complicated quickly, but you can get some nice functionality with just some basics. WebDB does not implement joins; all queries operate only on a single table. But once you understand this you see a pretty clear (though long and winding) path to doing joins and other complicated stuff as well.

现在我们有了一个包含PK和搜索字段索引的表,我们可以实现查询。我并没有深入到这一步,因为它很快就会变得复杂起来,但是您可以使用一些基本的功能获得一些不错的功能。WebDB不实现连接;所有查询仅在一个表上操作。但是一旦您理解了这一点,您就会看到一个非常清晰的路径(虽然很长而且很曲折)来处理连接和其他复杂的事情。

In WebDB, to get all customers with firstName = 'John' and city = 'New York' (assuming those are two search fields), you would write something like:

在WebDB中,若要让所有的客户都具有firstName = 'John'和city = 'New York'(假设这两个字段是两个搜索字段),您可以这样写:

var webDb = ...
var johnsFromNY = webDb.customers.get({
  firstName: 'John',
  city: 'New York'
})

To solve it, we first do two lookups: we get the set X of all IDs of customers named 'John' and we get the set Y of all IDs of customers from New York. We then perform an intersection on these two sets to get all IDs of customers that are both named 'John' AND from New York. We then run through our set of resulting IDs, getting the actual record for each one and adding it to the result array.

为了解决这个问题,我们首先进行两个查找:我们获得所有名为“John”的客户id的集合X,以及来自纽约的所有客户id的集合Y。然后我们在这两个集合上执行一个交集,以获取所有名为“John”和来自纽约的客户id。然后运行生成的id集,获取每个id的实际记录并将其添加到结果数组中。

Using the set operators like union and intersection we can perform AND and OR searches. I only implemented AND.

使用集合运算符如联合和交集,我们可以执行和或搜索。我只和实现。

Doing joins would (I think) involve creating temporary tables in memory, then populating them as the query runs with the joined results, then applying the query criteria to the temp table. I never got there. I attempted some syncing logic next but that was too ambitious and it went downhill from there :)

执行连接(我认为)需要在内存中创建临时表,然后在查询运行时使用已连接的结果填充它们,然后将查询条件应用到临时表。我从来没有到达那里。我接下来尝试了一些同步逻辑,但这太过雄心勃勃了,从那以后就走下坡路了。

#1


8  

First, you may have a look that the answers for How to write a simple database engine. While it focus on a SQL engine, there is still a lot of good material in the answers.

首先,您可以看看如何编写一个简单的数据库引擎的答案。虽然它关注的是SQL引擎,但是答案中仍然有很多好的材料。

Otherwise, a good project tutorial is Implementation of a B-Tree Database Class. The example code is in C++, but the description of what is done and why is probably what you'll want to look at anyway.

否则,一个好的项目教程就是B-Tree数据库类的实现。示例代码在c++中,但是对所做的事情和原因的描述可能是您希望看到的。

Also, there is Designing and Implementing Structured Storage (Database Engine) over at MSDN. Plenty of information there to help you in your learning project.

此外,还有在MSDN上设计和实现结构化存储(数据库引擎)。这里有很多信息可以帮助你完成学习计划。

#2


4  

Because the accepted answer only offers (good) links to other resources, I'd thought I share my experience writing webdb, a small experimental database for browsers. I also invite you to read the source code. It's pretty small. You should be able to read through it and get a basic understanding of what it's doing in a couple of hours. Warning: I am a n00b at this and since writing it I learned a lot more about it and see I have been doing some things wrong. It can help you get started though.

因为公认的答案只提供(好的)到其他资源的链接,所以我想分享我编写webdb的经验,这是一个小型的浏览器实验数据库。我也邀请您阅读源代码。这是很小的。你应该能够读懂它,并对它在几个小时内做的事情有一个基本的了解。警告:我在这方面很在行,自从写了这篇文章后,我学到了很多,我发现我做错了一些事情。它可以帮助你开始。

The basics: BTree

I started out with adapting an AVL tree to suit my needs. An AVL tree is a kind of self-balancing binary search tree. You store the key K and related data (if any) in a node, then all items with key < K in a node in the left subtree and all items with key > K in a right subtree. You can use an array to store the data items if you want to support non unique keys.

我开始调整AVL树以适应我的需要。AVL树是一种自平衡二叉搜索树。在节点中存储密钥K和相关数据(如果有),然后在左子树的节点中存储密钥< K的所有项,在右子树中存储密钥>k的所有项。如果希望支持非唯一键,可以使用数组来存储数据项。

This tree will give you the basics: Create, Update, Delete and a way to quickly get an item by key, or all items with key < x, or with key between x and y etc. It can serve as the index for our table.

这棵树将为您提供基本的信息:创建、更新、删除和一种按键快速获取项的方法,或者所有按键< x、键在x和y之间的项。它可以作为我们表的索引。

A schema

As a next step I wrote code that lets the client code define a schema. Methods like createTable() etc. Schemas are typically associated with SQL, but even no-SQL sort-of has a schema; they usually require you to mark the ID field and any other fields you want to search on. You can make your schema as fancy as you want, but you typically want to model at least which column(s) serve as primary key and which fields will be searched on frequently and need an index.

作为下一步,我编写了允许客户端代码定义模式的代码。诸如createTable()等方法。模式通常与SQL相关联,但即使是非SQL排序也有一个模式;它们通常要求您标记ID字段和您想要搜索的任何其他字段。您可以根据自己的需要设置模式,但您通常希望至少建模哪些列作为主键,以及哪些字段将被频繁地搜索,并需要一个索引。

Creating a data structure to store a table

I decided to use the tree I created in the first step to store my items. These were simple JS objects. Having defined which field contains the PK, I could simply insert the item into the tree using that field's value as the key. This gives me quick lookup by ID (range).

我决定使用我在第一步中创建的树来存储我的项目。这些是简单的JS对象。定义了包含PK的字段后,我可以使用该字段的值作为键将项目插入到树中。这使我可以通过ID(范围)快速查找。

Next I added another tree for every column that needs an index. In these trees I did not store the full record, but only the key. So to fetch a customer by last name, I would first use the index on last name to get the ID, then the primary key index to get the actual record. The reason I did not just store the (reference to the) actual object is because it makes set operations a little bit simpler (see next step)

接下来,我为每个需要索引的列添加了另一棵树。在这些树中,我没有保存完整的记录,只有钥匙。因此,为了获取客户的姓,我首先使用姓氏上的索引获取ID,然后使用主键索引获取实际的记录。我之所以不存储实际对象(引用)是因为它使设置操作更简单(参见下一个步骤)

Querying

Now that we have a table with indexes for PK and search fields, we can implement querying. I did not take this very far as it becomes complicated quickly, but you can get some nice functionality with just some basics. WebDB does not implement joins; all queries operate only on a single table. But once you understand this you see a pretty clear (though long and winding) path to doing joins and other complicated stuff as well.

现在我们有了一个包含PK和搜索字段索引的表,我们可以实现查询。我并没有深入到这一步,因为它很快就会变得复杂起来,但是您可以使用一些基本的功能获得一些不错的功能。WebDB不实现连接;所有查询仅在一个表上操作。但是一旦您理解了这一点,您就会看到一个非常清晰的路径(虽然很长而且很曲折)来处理连接和其他复杂的事情。

In WebDB, to get all customers with firstName = 'John' and city = 'New York' (assuming those are two search fields), you would write something like:

在WebDB中,若要让所有的客户都具有firstName = 'John'和city = 'New York'(假设这两个字段是两个搜索字段),您可以这样写:

var webDb = ...
var johnsFromNY = webDb.customers.get({
  firstName: 'John',
  city: 'New York'
})

To solve it, we first do two lookups: we get the set X of all IDs of customers named 'John' and we get the set Y of all IDs of customers from New York. We then perform an intersection on these two sets to get all IDs of customers that are both named 'John' AND from New York. We then run through our set of resulting IDs, getting the actual record for each one and adding it to the result array.

为了解决这个问题,我们首先进行两个查找:我们获得所有名为“John”的客户id的集合X,以及来自纽约的所有客户id的集合Y。然后我们在这两个集合上执行一个交集,以获取所有名为“John”和来自纽约的客户id。然后运行生成的id集,获取每个id的实际记录并将其添加到结果数组中。

Using the set operators like union and intersection we can perform AND and OR searches. I only implemented AND.

使用集合运算符如联合和交集,我们可以执行和或搜索。我只和实现。

Doing joins would (I think) involve creating temporary tables in memory, then populating them as the query runs with the joined results, then applying the query criteria to the temp table. I never got there. I attempted some syncing logic next but that was too ambitious and it went downhill from there :)

执行连接(我认为)需要在内存中创建临时表,然后在查询运行时使用已连接的结果填充它们,然后将查询条件应用到临时表。我从来没有到达那里。我接下来尝试了一些同步逻辑,但这太过雄心勃勃了,从那以后就走下坡路了。