如何在S3中查询异构JSON数据？

We have an Amazon S3 bucket that contains around a million JSON files, each one around 500KB compressed. These files are put there by AWS Kinesis Firehose, and a new one is written every 5 minutes. These files all describe similar events and so are logically all the same, and are all valid JSON, but have different structures/hierarchies. Also their format & line endings are inconsistent: some objects are on a single line, some on many lines, and sometimes the end of one object is on the same line as the start of another object (i.e., }{).

我们有一个包含大约一百万个JSON文件的Amazon S3存储桶,每个文件压缩大约500KB。 AWS Kinesis Firehose将这些文件放在那里,每5分钟写一个新文件。这些文件都描述了类似的事件,因此在逻辑上都是相同的,并且都是有效的JSON,但具有不同的结构/层次结构。它们的格式和行结尾也是不一致的:一些对象在一行上,一些在多行上,有时一个对象的末尾与另一个对象的开头在同一行(即} {)。

We need to parse/query/shred these objects and then import the results into our on-premise data warehouse SQL Server database.

我们需要解析/查询/粉碎这些对象,然后将结果导入我们的内部部署数据仓库SQL Server数据库。

Amazon Athena can't deal with the inconsistent spacing/structure. I thought of creating a Lambda function that would clean up the spacing, but that still leaves the problem of different structures. Since the files are laid down by Kinesis, which forces you to put the files in folders nested by year, month, day, and hour, we would have to create thousands of partitions every year. The limit to the number of partitions in Athena is not well known, but research suggests we would quickly exhaust this limit if we create one per hour.

亚马逊雅典娜无法处理不一致的间距/结构。我想创建一个可以清理间距的Lambda函数,但仍然存在不同结构的问题。由于文件是由Kinesis制定的,这会强制您将文件放在按年,月,日和小时嵌套的文件夹中,因此我们每年必须创建数千个分区。雅典娜分区数量的限制并不为人所知,但研究表明,如果我们每小时创建一个分区,我们很快就会用尽这个限制。

I've looked at pumping the data into Redshift first and then pulling it down. Amazon Redshift external tables can deal with the spacing issues, but can't deal with nested JSON, which almost all these files have. COPY commands can deal with nested JSON, but require us to know the JSON structure beforehand, and don't allow us to access the filename, which we would need for a complete import (it's the only way we can get the date). In general, Redshift has the same problem as Athena: the inconsistent structure makes it difficult to define a schema.

我已经看过先将数据泵入Redshift然后将其拉下来。 Amazon Redshift外部表可以处理间距问题,但无法处理嵌套的JSON,这几乎所有这些文件都有。 COPY命令可以处理嵌套的JSON,但要求我们事先知道JSON结构,并且不允许我们访问文件名,我们需要完全导入(这是我们获取日期的唯一方法)。通常,Redshift与Athena具有相同的问题:不一致的结构使得定义模式变得困难。

I've looked into using tools like AWS Glue, but they just move data, and they can't move data into our on-premise server, so we have to find some sort of intermediary, which increases cost, latency, and maintenance overhead.

我已经研究过使用像AWS Glue这样的工具,但他们只是移动数据,他们无法将数据移动到我们的内部部署服务器上,所以我们必须找到某种中介,这会增加成本,延迟和维护开销。

I've tried cutting out the middleman and using ZappySys' S3 JSON SSIS task to pull the files directly and aggregate them in an SSIS package, but it can't deal with the spacing issues or the inconsistent structure.

我试图切断中间人并使用ZappySys的S3 JSON SSIS任务直接提取文件并将它们聚合在一个SSIS包中,但它无法处理间距问题或不一致的结构。

I can't be the first person to face this problem, but I just keep spinning my wheels.

我不能成为第一个面对这个问题的人,但我只是继续旋转我的车轮。

1 个解决方案

#1

I would probably suggest 2 types of solutions

我可能会建议两种类型的解决方案

I believe MongoDB/DynamoDB/Cassandra are good at processing Heterogenous JSON structure. I am not sure about the inconsistency in ur JSON but as long as it is a valid JSON, I believe it should be ingestable in one of the above DBs. Please provide a sample JSON if possible. But these tools have their own advantages and disadvantages. The data modelling is entirely different for these No SQL's than the traditional SQLs.

我相信MongoDB / DynamoDB / Cassandra擅长处理异构JSON结构。我不确定你的JSON中的不一致性,但只要它是一个有效的JSON,我相信它应该可以在上面的一个DB中输入。如果可能,请提供示例JSON。但这些工具各有利弊。对于这些No SQL而言,数据建模与传统SQL完全不同。

I am not sure why your Lambda is not able to do a cleanup. I believe you would have tried to call a Lambda when a S3 PUT happens in a bucket. This should be able to cleanup the JSON unless there are complex processes involved.

我不确定为什么你的Lambda无法进行清理。我相信当一个S3 PUT发生在桶中时,你会尝试调用Lambda。这应该能够清除JSON,除非涉及复杂的过程。

Unless the JSON is in a proper format, no tool would be able to process it perfectly, I believe more than Athena or Spectrum, MongoDB/DyanoDB/Cassandra will be right fit to this use case

除非JSON格式正确,否则没有工具可以完美地处理它,我相信比Athena或Spectrum更多,MongoDB / DyanoDB / Cassandra将适合这个用例

Would be great if you could share the limitations that you faced when you created a lot of partitions?

如果您可以分享创建大量分区时遇到的限制,那会很棒吗?

#1