Amazon S3,存储大量文件(数百万和数TB的数据)

时间:2022-10-29 23:04:45

I'll have to store millions of files (many TB in the future) in S3. Are there any limitations? (not a price :) ), i'm asking about architectural limitations (like - don't store it this way, the other way will be better/faster). My files are in a hierarchy

我将不得不在S3中存储数百万个文件(将来会有很多TB)。有没有限制? (不是价格:)),我问的是架构限制(比如 - 不要以这种方式存储,另一种方式会更好/更快)。我的文件是一个层次结构

/{country}/{number}/{code}/docs

and i checked i can keep them that way (to access them easy thru REST) (of course i know S3 keeps them internally in other way - not important to me). So, are there any limitations/pitfalls ?

我检查过我可以保持这种方式(通过REST轻松访问它们)(当然我知道S3以其他方式将它们保留在内部 - 对我来说并不重要)。那么,有任何限制/陷阱吗?

2 个解决方案

#1


S3 has no limits that you would hit. The files are not really in folders, they are just strings as locations. Make the folder structure something that is easy for you to keep track of and organize.

S3没有你想要的限制。这些文件实际上不在文件夹中,它们只是作为位置的字符串。使文件夹结构易于跟踪和组织。

You do NOT want to be listing the "folder" contents in S3 to find things. S3 is slow at giving directory listings, because it's not really directories.

您不希望在S3中列出“文件夹”内容以查找内容。 S3在提供目录列表方面很慢,因为它不是真正的目录。

You should be storing either the whole path /{country}/{number}/{code}/docs in a database or the logic should be so repeatable that you can be confident that the file will be in that location.

您应该将整个路径/ {country} / {number} / {code} / docs存储在数据库中,或者逻辑应该是可重复的,以确保文件将位于该位置。

James Brady gave an excellent and very detailed answer to how s3 treats file storage in a question here https://*.com/a/394505/4179009

詹姆斯布拉迪在一个问题中给出了s3如何处理文件存储的优秀而非常详细的答案https://*.com/a/394505/4179009

#2


AWS S3 does definitely have limits to access 100req/sec in case of similar path prefix, see the official docs: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

在类似路径前缀的情况下,AWS S3肯定有限制访问100req / sec,请参阅官方文档:http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

From the other side a hierarchical approach makes logic complicated. A trade off depends on your requirements, one of good options can be using at least 4 symbols length key (primary id or hash key) in front of URL. In case of having limited number countries try using multiple buckets with country code as a bucket name, it also helps to define a specific physical location if required.

从另一方面来看,分层方法使逻辑变得复杂。权衡取决于您的要求,其中一个好的选择是在URL前面使用至少4个符号长度的密钥(主ID或散列密钥)。如果国家数量有限,请尝试使用多个国家/地区代码作为存储桶名称的存储桶,如果需要,还可以帮助定义特定的物理位置。

#1


S3 has no limits that you would hit. The files are not really in folders, they are just strings as locations. Make the folder structure something that is easy for you to keep track of and organize.

S3没有你想要的限制。这些文件实际上不在文件夹中,它们只是作为位置的字符串。使文件夹结构易于跟踪和组织。

You do NOT want to be listing the "folder" contents in S3 to find things. S3 is slow at giving directory listings, because it's not really directories.

您不希望在S3中列出“文件夹”内容以查找内容。 S3在提供目录列表方面很慢,因为它不是真正的目录。

You should be storing either the whole path /{country}/{number}/{code}/docs in a database or the logic should be so repeatable that you can be confident that the file will be in that location.

您应该将整个路径/ {country} / {number} / {code} / docs存储在数据库中,或者逻辑应该是可重复的,以确保文件将位于该位置。

James Brady gave an excellent and very detailed answer to how s3 treats file storage in a question here https://*.com/a/394505/4179009

詹姆斯布拉迪在一个问题中给出了s3如何处理文件存储的优秀而非常详细的答案https://*.com/a/394505/4179009

#2


AWS S3 does definitely have limits to access 100req/sec in case of similar path prefix, see the official docs: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

在类似路径前缀的情况下,AWS S3肯定有限制访问100req / sec,请参阅官方文档:http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

From the other side a hierarchical approach makes logic complicated. A trade off depends on your requirements, one of good options can be using at least 4 symbols length key (primary id or hash key) in front of URL. In case of having limited number countries try using multiple buckets with country code as a bucket name, it also helps to define a specific physical location if required.

从另一方面来看,分层方法使逻辑变得复杂。权衡取决于您的要求,其中一个好的选择是在URL前面使用至少4个符号长度的密钥(主ID或散列密钥)。如果国家数量有限,请尝试使用多个国家/地区代码作为存储桶名称的存储桶,如果需要,还可以帮助定义特定的物理位置。