将一个文件与一段数据关联的最好方法是什么?

时间:2021-11-03 17:00:23

I have an application that creates records in a table (rocket science, I know). Users want to associate files (.doc, .xls, .pdf, etc...) to a single record in the table.

我有一个在表格中创建记录的应用程序(火箭科学,我知道)。用户希望关联文件(。doc, .xls, .pdf等)到表中的单个记录。

  • Should I store the contents of the file(s) in the database? Wouldn't this bloat the database?

    我应该将文件的内容存储在数据库中吗?难道数据库不会膨胀吗?

  • Should I store the file(s) on a file server, and store the path(s) in the database?

    是否应该将文件存储在文件服务器上,并将路径存储在数据库中?

What is the best way to do this?

最好的方法是什么?

8 个解决方案

#1


10  

I think you've accurately captured the two most popular approaches to solving this problem. There are pros and cons to each:

我认为你已经准确地找到了解决这个问题的两种最流行的方法。各有利弊:

Store the Files in the DB

Most rbms have support for storing blobs (or binary file data, .doc, .xls, etc.) in a db. So you're not breaking new ground here.

大多数rbms都支持在db中存储blobs(或二进制文件数据、.doc、.xls等)。所以你没有在这里开辟新的天地。

Pros

  • Simplifies Backup of the data: you backup the db you have all the files.
  • 简化数据备份:你备份数据库,你有所有的文件。
  • The linkage between the metadata (the other columns ABOUT the files) and the file itself is solid and built into the db; so its a one stop shop to get data about your files.
  • 元数据(关于文件的其他列)和文件本身之间的链接是可靠的,并且构建在db中;所以这是一个一站式的数据收集你的文件。

Cons

  • Backups can quickly blossom into a HUGE nightmare as you're storing all of that binary data with your database. You could alleviate some of the headaches by keeping the files in a separate DB.
  • 当您将所有的二进制数据存储到数据库中时,备份可能很快就会变成一场噩梦。您可以通过将文件保存在单独的DB中来减轻一些麻烦。
  • Without the DB or an interface to the DB, there's no easy way to get to the file content to modify or update it.
  • 没有DB或DB的接口,就没有简单的方法可以访问文件内容来修改或更新它。
  • In general, its harder to code and coordinate the upload and storage of data to a DB vs. the filesystem.
  • 一般来说,它更难编码和协调数据的上传和存储到数据库和文件系统之间。

Store the Files on the FileSystem

This approach is pretty simple, you store the files themselves in the filesystem. Your database stores a reference to the file's location (as well as all of the metadata about the file). One helpful hint here is to standardize your naming schema for the files on disk (don't use the file that the user gives you, create one on your own and store theirs in the db).

这种方法非常简单,您将文件本身存储在文件系统中。您的数据库存储对文件位置的引用(以及关于文件的所有元数据)。这里有一个有用的提示,就是对磁盘上的文件标准化命名模式(不要使用用户提供的文件,自己创建一个,并将它们存储在db中)。

Pros

  • Keeps your file data cleanly separated from the database.
  • 保持文件数据与数据库的分离。
  • Easy to maintain the files themselves (if you need to change out the file or update it), you do so in the file system itself. You can just as easily do it from the application as well via a new upload.
  • 易于维护文件本身(如果需要更改文件或更新文件),可以在文件系统本身中进行维护。您也可以通过一个新的上传工具从应用程序中轻松地完成它。

Cons

  • If you're not careful, your database about the files can get out of sync with the files themselves.
  • 如果您不小心,您的数据库关于文件可能会与文件本身不同步。
  • Security can be an issue (again if you're careless) depending on where you store the files and whether or not that filesystem is available to the public (via the web I'm assuming here).
  • 安全性可能是一个问题(如果您不小心的话),这取决于您存储文件的位置以及该文件系统是否对公众可用(我假设是通过web)。

At the end of the day, we chose to go the filesystem route. It was easier to implement quickly, easy on the backup, pretty secure once we locked down any holes and streamed the file out (instead of just serving directly from the filesystem). Its been operational in pretty much the same format for about 6 years in two different government applications.

最后,我们选择了文件系统路径。一旦我们锁定了任何漏洞并将文件流出(而不是直接从文件系统中服务),就可以更容易地实现快速、易于备份、非常安全。它在两种不同的*应用中以几乎相同的方式运行了大约6年。

J

J

#2


4  

How well you can store binaries, or BLOBs, in a database will be highly dependant on the DBMS you are using.

在数据库中存储二进制文件或blob的能力将高度依赖于使用的DBMS。

If you store binaries on the file system, you need to consider what happens in the case of file name collision, where you try and store two different files with the same name - and if this is a valid operation or not. So, along with the reference to where the file lives on the file system, you may also need to store the original file name.

如果您在文件系统上存储二进制文件,您需要考虑在文件名冲突的情况下会发生什么,在这种情况下,您尝试使用相同的名称存储两个不同的文件——如果这是一个有效的操作的话。因此,除了对文件系统中文件所在位置的引用之外,还可能需要存储原始文件名。

Also, if you are storing a large amount of files, be aware of possible performance hits of storing all your files in one folder. (You didn't specify your operating system, but you might want to look at this question for NTFS, or this reference for ext3.)

此外,如果您正在存储大量的文件,请注意将所有文件存储在一个文件夹中的可能的性能影响。(您没有指定您的操作系统,但是您可能希望查看NTFS的这个问题,或者ext3的这个引用。)

We had a system that had to store several thousands of files on the file system, on a file system where we were concerned about the number of files in any one folder (it may have been FAT32, I think).

我们有一个必须在文件系统上存储数千个文件的系统,在一个文件系统上,我们关心的是任何一个文件夹中的文件数量(我认为可能是FAT32)。

Our system would take a new file to be added, and generate an MD5 checksum for it (in hex). It would take the first two characters and make that the first folder, the next two characters and make that the second folder as a sub-folder of the first folder, and then the next two as the third folder as a sub-folder of the second folder.

我们的系统将获取一个要添加的新文件,并为它生成一个MD5校验和(以十六进制)。它将取前两个字符,使第一个文件夹,下两个字符,使第二个文件夹作为第一个文件夹的子文件夹,然后将下两个文件夹作为第二个文件夹的子文件夹。

That way, we ended up with a three-level set of folders, and the files were reasonably well scattered so no one folder filled up too much.

这样,我们就得到了一组三层的文件夹,文件分散得很好,没有一个文件夹被填得太多。

If we still had a file name collision after that, then we would just add "_n" to the file name (before the extension), where n was just an incrementing number until we got a name that didn't exist (and even then, I think we did atomic file creation, just to be sure).

如果我们仍然有一个文件名称冲突之后,我们只会添加“_n””文件名(扩展之前),其中n是一个递增数字,直到我们得到了一个不存在的名称(即使这样,我认为我们做了原子文件创建,就可以肯定的)。

Of course, then you need tools to do the occasional comparison of the database records to the file system, flagging any missing files and cleaning up any orphaned ones where the database record no longer exists.

当然,您需要一些工具来对数据库记录与文件系统进行偶尔的比较,标记任何丢失的文件,清理数据库记录不再存在的孤立文件。

#3


2  

You should only store files in the database if you're reasonably sure you know that the sizes of those files aren't going to get out of hand.

您应该只在合理确定这些文件的大小不会失控的情况下将文件存储在数据库中。

I use our database to store small banner images, which I always know what size they're going to be. Your database will store a pointer to the data inside a row and then plunk the data itself somewhere else, so it doesn't necessarily impact speed.

我用我们的数据库存储小的横幅图像,我总是知道它们的大小。您的数据库将存储一个指向行内数据的指针,然后将数据本身放在其他地方,因此它不一定会影响速度。

If there are too many unknowns though, using the filesystem is the safer route.

如果有太多的未知,使用文件系统是更安全的路线。

#4


2  

The best solution would be to put the documents in the database. This simplifies all the linking and backingup and restoring issues - But it might not solve the basic 'we just want to point to documents on our file server' mindset the users may have.

最好的解决方案是将文档放到数据库中。这简化了所有的链接、备份和恢复问题——但是它可能不能解决基本的“我们只想指向文件服务器上的文档”的问题。

It all depends (in the end) on actual user requirements.

这一切(最终)取决于实际的用户需求。

BUt my recommendation would be to put it all together in the database so you retain control of them. Leaving them in the file system leaves them open to being deleted, moved, ACL'd or anyone of hundreds of other changes that could render your linking to them pointless or even damaging.

但我的建议是将所有这些都放在数据库中,这样您就可以保持对它们的控制。将它们留在文件系统中,会让它们被删除、移动、ACL或其他数百个更改,这些更改可能会使您的链接变得毫无意义甚至具有破坏性。

Database bloat is only an issue if you haven't sized for it. Do some tests and see what effects it has. 100GB of files on a disk is probably just as big as the same files in a database.

数据库膨胀只是一个问题,如果您没有评估它。做一些测试,看看它有什么效果。磁盘上100GB的文件可能与数据库中的相同文件一样大。

#5


2  

Use the database for data and the filesystem for files. Simply store the file path in the database.

使用数据库获取数据和文件系统。只需将文件路径存储在数据库中。

In addition, your webserver can probably serve files more efficiently than you application code will do (in order to stream the file from the DB back to the client).

此外,您的web服务器可能比应用程序代码更有效地服务于文件(以便将文件从DB流回客户机)。

#6


2  

Store the paths in the database. This keeps your database from bloating, and also allows you to separately back up the external files. You can also relocate them more easily; just move them to a new location and then UPDATE the database.

将路径存储在数据库中。这样可以使您的数据库不受bloating的影响,并且还允许您单独备份外部文件。你也可以更容易地重新安置它们;只需将它们移动到新的位置,然后更新数据库。

One additional thing to keep in mind: In order to use most of the filetypes you mentioned, you'll end up having to:

还有一件事要记住:为了使用您提到的大多数文件类型,您将不得不:

  • Query the database to get the file contents in a blob
  • 查询数据库以获取blob中的文件内容
  • Write the blob data to a disk file
  • 将blob数据写入磁盘文件
  • Launch an application to open/edit/whatever the file you just created
  • 启动一个应用程序打开/编辑/不管你刚刚创建的文件。
  • Read the file back in from disk to a blob
  • 将文件从磁盘读入到blob
  • Update the database with the new content
  • 使用新的内容更新数据库

All that as opposed to:

与此相反的是:

  • Read the file path from the DB
  • 从DB中读取文件路径
  • Launch the app to open/edit/whatever the file
  • 启动应用程序打开/编辑/任何文件

I prefer the second set of steps, myself.

我更喜欢第二套步骤,我自己。

#7


1  

I would try to store it all in the database. Haven't done it. But if not. There are a small risk that file names get out of sync with files on the disk. Then you have a big problem.

我会试着把它们都存储在数据库中。没有做它。但是如果不是这样。文件名称与磁盘上的文件不同步的风险很小。那你就有大问题了。

#8


0  

And now for the completely off the wall suggestion - you could consider storing the binaries as attachments in a CouchDB document database. This would avoid the file name collision issues as you would use a generated UID as each document ID (which you what you would store in your RDBMS), and the actual attachment's file name is kept with the document.

现在,对于完全关闭的墙建议—您可以考虑将二进制文件作为附件存储在CouchDB文档数据库中。这将避免文件名称冲突问题,因为您将使用生成的UID作为每个文档ID(您将在RDBMS中存储该ID),并将实际附件的文件名保存在文档中。

If you are building a web-based system, then the fact that CouchDB uses REST over HTTP could also be leveraged. And, there's also the replication facilities that could prove of use.

如果您正在构建一个基于web的系统,那么CouchDB通过HTTP使用REST的事实也可以被利用。而且,还有可以证明有用的复制设备。

Of course, CouchDB is still in incubation, although there are some who are already using it 'in the wild'.

当然,CouchDB仍处于孵化阶段,尽管有些人已经在“野外”使用它了。

#1


10  

I think you've accurately captured the two most popular approaches to solving this problem. There are pros and cons to each:

我认为你已经准确地找到了解决这个问题的两种最流行的方法。各有利弊:

Store the Files in the DB

Most rbms have support for storing blobs (or binary file data, .doc, .xls, etc.) in a db. So you're not breaking new ground here.

大多数rbms都支持在db中存储blobs(或二进制文件数据、.doc、.xls等)。所以你没有在这里开辟新的天地。

Pros

  • Simplifies Backup of the data: you backup the db you have all the files.
  • 简化数据备份:你备份数据库,你有所有的文件。
  • The linkage between the metadata (the other columns ABOUT the files) and the file itself is solid and built into the db; so its a one stop shop to get data about your files.
  • 元数据(关于文件的其他列)和文件本身之间的链接是可靠的,并且构建在db中;所以这是一个一站式的数据收集你的文件。

Cons

  • Backups can quickly blossom into a HUGE nightmare as you're storing all of that binary data with your database. You could alleviate some of the headaches by keeping the files in a separate DB.
  • 当您将所有的二进制数据存储到数据库中时,备份可能很快就会变成一场噩梦。您可以通过将文件保存在单独的DB中来减轻一些麻烦。
  • Without the DB or an interface to the DB, there's no easy way to get to the file content to modify or update it.
  • 没有DB或DB的接口,就没有简单的方法可以访问文件内容来修改或更新它。
  • In general, its harder to code and coordinate the upload and storage of data to a DB vs. the filesystem.
  • 一般来说,它更难编码和协调数据的上传和存储到数据库和文件系统之间。

Store the Files on the FileSystem

This approach is pretty simple, you store the files themselves in the filesystem. Your database stores a reference to the file's location (as well as all of the metadata about the file). One helpful hint here is to standardize your naming schema for the files on disk (don't use the file that the user gives you, create one on your own and store theirs in the db).

这种方法非常简单,您将文件本身存储在文件系统中。您的数据库存储对文件位置的引用(以及关于文件的所有元数据)。这里有一个有用的提示,就是对磁盘上的文件标准化命名模式(不要使用用户提供的文件,自己创建一个,并将它们存储在db中)。

Pros

  • Keeps your file data cleanly separated from the database.
  • 保持文件数据与数据库的分离。
  • Easy to maintain the files themselves (if you need to change out the file or update it), you do so in the file system itself. You can just as easily do it from the application as well via a new upload.
  • 易于维护文件本身(如果需要更改文件或更新文件),可以在文件系统本身中进行维护。您也可以通过一个新的上传工具从应用程序中轻松地完成它。

Cons

  • If you're not careful, your database about the files can get out of sync with the files themselves.
  • 如果您不小心,您的数据库关于文件可能会与文件本身不同步。
  • Security can be an issue (again if you're careless) depending on where you store the files and whether or not that filesystem is available to the public (via the web I'm assuming here).
  • 安全性可能是一个问题(如果您不小心的话),这取决于您存储文件的位置以及该文件系统是否对公众可用(我假设是通过web)。

At the end of the day, we chose to go the filesystem route. It was easier to implement quickly, easy on the backup, pretty secure once we locked down any holes and streamed the file out (instead of just serving directly from the filesystem). Its been operational in pretty much the same format for about 6 years in two different government applications.

最后,我们选择了文件系统路径。一旦我们锁定了任何漏洞并将文件流出(而不是直接从文件系统中服务),就可以更容易地实现快速、易于备份、非常安全。它在两种不同的*应用中以几乎相同的方式运行了大约6年。

J

J

#2


4  

How well you can store binaries, or BLOBs, in a database will be highly dependant on the DBMS you are using.

在数据库中存储二进制文件或blob的能力将高度依赖于使用的DBMS。

If you store binaries on the file system, you need to consider what happens in the case of file name collision, where you try and store two different files with the same name - and if this is a valid operation or not. So, along with the reference to where the file lives on the file system, you may also need to store the original file name.

如果您在文件系统上存储二进制文件,您需要考虑在文件名冲突的情况下会发生什么,在这种情况下,您尝试使用相同的名称存储两个不同的文件——如果这是一个有效的操作的话。因此,除了对文件系统中文件所在位置的引用之外,还可能需要存储原始文件名。

Also, if you are storing a large amount of files, be aware of possible performance hits of storing all your files in one folder. (You didn't specify your operating system, but you might want to look at this question for NTFS, or this reference for ext3.)

此外,如果您正在存储大量的文件,请注意将所有文件存储在一个文件夹中的可能的性能影响。(您没有指定您的操作系统,但是您可能希望查看NTFS的这个问题,或者ext3的这个引用。)

We had a system that had to store several thousands of files on the file system, on a file system where we were concerned about the number of files in any one folder (it may have been FAT32, I think).

我们有一个必须在文件系统上存储数千个文件的系统,在一个文件系统上,我们关心的是任何一个文件夹中的文件数量(我认为可能是FAT32)。

Our system would take a new file to be added, and generate an MD5 checksum for it (in hex). It would take the first two characters and make that the first folder, the next two characters and make that the second folder as a sub-folder of the first folder, and then the next two as the third folder as a sub-folder of the second folder.

我们的系统将获取一个要添加的新文件,并为它生成一个MD5校验和(以十六进制)。它将取前两个字符,使第一个文件夹,下两个字符,使第二个文件夹作为第一个文件夹的子文件夹,然后将下两个文件夹作为第二个文件夹的子文件夹。

That way, we ended up with a three-level set of folders, and the files were reasonably well scattered so no one folder filled up too much.

这样,我们就得到了一组三层的文件夹,文件分散得很好,没有一个文件夹被填得太多。

If we still had a file name collision after that, then we would just add "_n" to the file name (before the extension), where n was just an incrementing number until we got a name that didn't exist (and even then, I think we did atomic file creation, just to be sure).

如果我们仍然有一个文件名称冲突之后,我们只会添加“_n””文件名(扩展之前),其中n是一个递增数字,直到我们得到了一个不存在的名称(即使这样,我认为我们做了原子文件创建,就可以肯定的)。

Of course, then you need tools to do the occasional comparison of the database records to the file system, flagging any missing files and cleaning up any orphaned ones where the database record no longer exists.

当然,您需要一些工具来对数据库记录与文件系统进行偶尔的比较,标记任何丢失的文件,清理数据库记录不再存在的孤立文件。

#3


2  

You should only store files in the database if you're reasonably sure you know that the sizes of those files aren't going to get out of hand.

您应该只在合理确定这些文件的大小不会失控的情况下将文件存储在数据库中。

I use our database to store small banner images, which I always know what size they're going to be. Your database will store a pointer to the data inside a row and then plunk the data itself somewhere else, so it doesn't necessarily impact speed.

我用我们的数据库存储小的横幅图像,我总是知道它们的大小。您的数据库将存储一个指向行内数据的指针,然后将数据本身放在其他地方,因此它不一定会影响速度。

If there are too many unknowns though, using the filesystem is the safer route.

如果有太多的未知,使用文件系统是更安全的路线。

#4


2  

The best solution would be to put the documents in the database. This simplifies all the linking and backingup and restoring issues - But it might not solve the basic 'we just want to point to documents on our file server' mindset the users may have.

最好的解决方案是将文档放到数据库中。这简化了所有的链接、备份和恢复问题——但是它可能不能解决基本的“我们只想指向文件服务器上的文档”的问题。

It all depends (in the end) on actual user requirements.

这一切(最终)取决于实际的用户需求。

BUt my recommendation would be to put it all together in the database so you retain control of them. Leaving them in the file system leaves them open to being deleted, moved, ACL'd or anyone of hundreds of other changes that could render your linking to them pointless or even damaging.

但我的建议是将所有这些都放在数据库中,这样您就可以保持对它们的控制。将它们留在文件系统中,会让它们被删除、移动、ACL或其他数百个更改,这些更改可能会使您的链接变得毫无意义甚至具有破坏性。

Database bloat is only an issue if you haven't sized for it. Do some tests and see what effects it has. 100GB of files on a disk is probably just as big as the same files in a database.

数据库膨胀只是一个问题,如果您没有评估它。做一些测试,看看它有什么效果。磁盘上100GB的文件可能与数据库中的相同文件一样大。

#5


2  

Use the database for data and the filesystem for files. Simply store the file path in the database.

使用数据库获取数据和文件系统。只需将文件路径存储在数据库中。

In addition, your webserver can probably serve files more efficiently than you application code will do (in order to stream the file from the DB back to the client).

此外,您的web服务器可能比应用程序代码更有效地服务于文件(以便将文件从DB流回客户机)。

#6


2  

Store the paths in the database. This keeps your database from bloating, and also allows you to separately back up the external files. You can also relocate them more easily; just move them to a new location and then UPDATE the database.

将路径存储在数据库中。这样可以使您的数据库不受bloating的影响,并且还允许您单独备份外部文件。你也可以更容易地重新安置它们;只需将它们移动到新的位置,然后更新数据库。

One additional thing to keep in mind: In order to use most of the filetypes you mentioned, you'll end up having to:

还有一件事要记住:为了使用您提到的大多数文件类型,您将不得不:

  • Query the database to get the file contents in a blob
  • 查询数据库以获取blob中的文件内容
  • Write the blob data to a disk file
  • 将blob数据写入磁盘文件
  • Launch an application to open/edit/whatever the file you just created
  • 启动一个应用程序打开/编辑/不管你刚刚创建的文件。
  • Read the file back in from disk to a blob
  • 将文件从磁盘读入到blob
  • Update the database with the new content
  • 使用新的内容更新数据库

All that as opposed to:

与此相反的是:

  • Read the file path from the DB
  • 从DB中读取文件路径
  • Launch the app to open/edit/whatever the file
  • 启动应用程序打开/编辑/任何文件

I prefer the second set of steps, myself.

我更喜欢第二套步骤,我自己。

#7


1  

I would try to store it all in the database. Haven't done it. But if not. There are a small risk that file names get out of sync with files on the disk. Then you have a big problem.

我会试着把它们都存储在数据库中。没有做它。但是如果不是这样。文件名称与磁盘上的文件不同步的风险很小。那你就有大问题了。

#8


0  

And now for the completely off the wall suggestion - you could consider storing the binaries as attachments in a CouchDB document database. This would avoid the file name collision issues as you would use a generated UID as each document ID (which you what you would store in your RDBMS), and the actual attachment's file name is kept with the document.

现在,对于完全关闭的墙建议—您可以考虑将二进制文件作为附件存储在CouchDB文档数据库中。这将避免文件名称冲突问题,因为您将使用生成的UID作为每个文档ID(您将在RDBMS中存储该ID),并将实际附件的文件名保存在文档中。

If you are building a web-based system, then the fact that CouchDB uses REST over HTTP could also be leveraged. And, there's also the replication facilities that could prove of use.

如果您正在构建一个基于web的系统,那么CouchDB通过HTTP使用REST的事实也可以被利用。而且,还有可以证明有用的复制设备。

Of course, CouchDB is still in incubation, although there are some who are already using it 'in the wild'.

当然,CouchDB仍处于孵化阶段,尽管有些人已经在“野外”使用它了。