将电子邮件存储在数据库中

What sort of database schema would you use to store email messages, with as much header information as practical/possible, into a database?

您将使用哪种数据库模式将电子邮件消息与实际/可能的标题信息一起存储到数据库中？

Assume that they have been fed into a script from the MTA and parsed into the relevant headers/body/attachments.

假设它们已经从MTA输入到脚本中并解析为相关的标题/正文/附件。

Would you store the message body whole in the database table, or split any MIME-parts apart? What about attachments?

您是将整个邮件正文存储在数据库表中，还是将所有MIME部分拆分？附件怎么样？

9 个解决方案

#1

You may want to check the architecture and the DB schema of "Archiveopteryx".

您可能想要检查“Archiveopteryx”的体系结构和数据库架构。

#2

Depends on what you're going to be doing with it. If you're going to need to do frequent searching against certain bits of it, you'll want to break it up in a way that makes sense for your usage case. If it's just for something like storage of e-mail for Sarbanes-Oxley compliance, you'd probably be okay storing the whole thing - headers, parts, etc. - as one big text field.

取决于你将要用它做什么。如果您需要经常搜索它的某些部分，您将需要以对您的使用情况有意义的方式进行分解。如果仅仅是为了存储电子邮件以满足“萨班斯 - 奥克斯利法案”的要求，那么您可能可以将整个内容 - 标题，部件等 - 存储为一个大文本字段。

#3

Suggestion: create a well defined table for storing e-mail with a column for each relevant part of a message: sender, header, subject, body. It is going to be much simpler later if you want to query, for example, by subject field. In the same table you can define a field to keep the path of a attachment and store the attached file on the file system, rather than storing it in blob fields.

建议：创建一个定义良好的表，用于存储电子邮件，其中包含邮件的每个相关部分的列：发件人，标题，主题，正文。如果你想查询，例如，通过主题字段，它会更加简单。在同一个表中，您可以定义一个字段以保留附件的路径并将附加文件存储在文件系统中，而不是将其存储在blob字段中。

#4

You may want to use a schema where the message body and attachment records can be shared between multiple recipients on the message. It's not uncommon to see email servers where fully 50% of the disk storage is used by duplicate emails.

您可能希望使用一种模式，其中可以在邮件的多个收件人之间共享邮件正文和附件记录。在电子邮件服务器中，重复电子邮件使用完全50％的磁盘存储空间并不罕见。

A simple hash of the body/attachment would be enough to see if that record was already in the database. However, you would still need to keep separate headers.

正文/附件的简单哈希足以查看该记录是否已存在于数据库中。但是，您仍需要保留单独的标头。

#5

It all depends on what you want to do with the data, but in general I would want to store all data and also make sure that the semantics interpreted by the MUA are preserved in the db, so for example: - All headers that are parsed should have their own column - A column should contain the whole headers - The attachments (including body, multipart) should be in a many to one table with the email table.

这一切都取决于你想对数据做什么，但一般来说我想存储所有数据，并确保MUA解释的语义保留在db中，例如： - 所有解析的头文件应该有自己的列 - 列应该包含整个标题 - 附件（包括正文，多部分）应该与电子邮件表在多对一表中。

#6

You'll probably want to at least store attachments separately to optimize storage. It's astonishing to see the size and quantity of attachments (videos, etc.) that most users unhesitatingly attach to emails.

您可能希望至少分别存储附件以优化存储。看到大多数用户毫不犹豫地附加到电子邮件的附件（视频等）的大小和数量令人惊讶。

In the case of outgoing emails you may have multiple emails sending the same attachment. It's far more efficient to store a single copy of the attachment that is referenced by all emails that share it.

如果是外发电子邮件，您可能会有多封电子邮件发送相同的附件。存储共享它的所有电子邮件引用的附件的单个副本效率要高得多。

Another reason for storing attachments separately is that it gives you some archiving options later on. Should storage space become an issue, you can always go back and delete large attachments older than a given date in order to compact the database.

单独存储附件的另一个原因是它稍后会为您提供一些存档选项。如果存储空间成为问题，您可以随时返回并删除早于给定日期的大型附件，以压缩数据库。

#7

An important step in database schema design is to figure out what types of entity you want to model. For this application the entities might be:

数据库模式设计中的一个重要步骤是确定要建模的实体类型。对于此应用程序，实体可能是：

Messages
消息
E-mail addresses
电子邮件地址
Conversation threads (perhaps: if you want to do efficient threading)
会话线程（可能：如果你想做有效的线程）
Attachments (perhaps: as suggested in other answers)
附件（可能：如其他答案所示）
...
...

Once you know the entities, you can identify relationships between entities, which can be represented by tables:

一旦了解了实体，就可以识别实体之间的关系，这可以通过表来表示：

Messages have a many-many relationship to messages (In-Reply-To and References headers).
消息与消息（In-Reply-To和References标头）有很多关系。
Messages have a many-many relationship to e-mail addresses (From, To, Cc etc headers).
消息与电子邮件地址（From，To，Cc等标题）有很多关系。
Messages have a many-one relationship with threads.
消息与线程有很多关系。
Messages have a many-many relationship with attachments.
消息与附件有很多关系。
...
...

#8

If it is already split up, and you can be sure that the routine to split the data is sound, then I would split up the table as granular as possible. You can always parse it back together in your middle tier. If space is not an issue, you could always store it twice. One, split up into the relevant fields, and another field that has the whole thing as one blob, if putting it back together is hard.

如果已经拆分，并且您可以确定分割数据的例程是合理的，那么我会尽可能精细地分割表格。您始终可以在中间层中将其解析回来。如果空间不是问题，您可以始终存储两次。一个，分成相关的字段，另一个将整个事物作为一个blob的字段，如果将它重新组合在一起很难。

#9

It is not trivial to parse an email, so consider storing the email as a blob then parse it into whatever pieces you need afterwards.

解析电子邮件并非易事，因此请考虑将电子邮件存储为blob，然后将其解析为之后需要的任何部分。

/Allan

/艾伦

#1