在cassandra中将json存储为文本还是blob有什么优点或缺点?

时间:2022-10-17 17:05:29

One problem with blob for me is, in java, ByteBuffer (which is mapped to blob in cassandra) is not Serializable hence does not work well with EJBs.

对于我来说,blob的一个问题是,在java中,ByteBuffer(在cassandra中映射到blob)是不可序列化的,因此不能很好地使用ejb。

Considering the json is fairly large what would be the better type for storing json in cassandra. Is it text or blob?

考虑到json是相当大的,那么在cassandra中存储json最好的类型是什么呢?是文本还是blob?

Does the size of the json matter when deciding the blob vs json?

在决定blob和json时,json的大小有关系吗?

If it were any other database like oracle, it's common to use blob/clob. But in Cassandra where each cell can hold as large as 2GB, does it matter?

如果是像oracle这样的其他数据库,那么使用blob/clob是很常见的。但在卡桑德拉,每个细胞都能容纳2GB的容量,这有关系吗?

Please consider this question as the choose between text vs blob for this case, instead of sorting to suggestions regarding whether to use single column for json.

请把这个问题看作是文本与blob之间的选择,而不是对是否使用json的单个列的建议进行排序。

4 个解决方案

#1


13  

I don't think there's any benefit for storing the literal JSON data as a BLOB in Cassandra. At best your storage costs are identical, and in general the API's are less convenient in terms of working with BLOB types as they are for working with strings/text.

我认为将JSON数据作为BLOB存储在Cassandra中没有任何好处。在最好的情况下,您的存储成本是相同的,一般来说,API在处理BLOB类型时不太方便,就像处理字符串/文本一样。

For instance, if you're using their Java API then in order to store the data as a BLOB using a parameterized PreparedStatement you first need to load it all into a ByteBuffer, for instance by packing your JSON data into an InputStream.

例如,如果您正在使用他们的Java API,那么为了使用参数化PreparedStatement将数据作为BLOB存储,您首先需要将其全部加载到ByteBuffer中,例如将JSON数据打包到InputStream中。

Unless you're dealing with very large JSON snippets that force you to stream your data anyways, that's a fair bit of extra work to get access to the BLOB type. And what would you gain from it? Essentially nothing.

除非您处理的是非常大的JSON片段,这些片段会迫使您以任何方式流数据,否则要访问BLOB类型,就需要做一些额外的工作。你能从中得到什么?基本上什么都没有。

However, I think there's some merit in asking 'Should I store JSON as text, or gzip it and store the compressed data as a BLOB?'.

然而,我认为问“应该将JSON存储为文本,还是将其gzip并将压缩数据存储为BLOB?”

And the answer to that comes down to how you've configured Cassandra and your table. In particular, as long as you're using Cassandra version 1.1 or later your tables have compression enabled by default. That may be adequate, particularly if your JSON data is fairly uniform across each row.

答案取决于你如何配置Cassandra和你的表格。特别是,只要使用Cassandra版本1.1或更高版本,您的表默认启用了压缩。这可能已经足够了,特别是如果您的JSON数据在每一行都相当一致的话。

However, Cassandra's built-in compression is applied table-wide, rather than to individual rows. So you may get a better compression ratio by manually compressing your JSON data before storage, writing the compressed bytes into a ByteBuffer, and then shipping the data into Cassandra as a BLOB.

然而,Cassandra的内置压缩应用于表宽,而不是单独的行。因此,您可以通过在存储前手动压缩JSON数据,将压缩的字节写入ByteBuffer,然后将数据作为BLOB传输到Cassandra,从而获得更好的压缩比。

So it essentially comes down to a tradeoff in terms of storage space vs. programming convenience vs. CPU usage. I would decide the matter as follows:

所以它本质上是在存储空间和编程方便性和CPU使用率之间进行权衡。我将决定以下事项:

  1. Is minimizing the amount of storage consumed your biggest concern?
    • If yes, compress the JSON data and store the compressed bytes as a BLOB;
    • 如果是,压缩JSON数据并将压缩的字节存储为BLOB;
    • Otherwise, proceed to #2.
    • 否则,继续# 2。
  2. 最大限度地减少存储消耗是你最关心的吗?如果是,压缩JSON数据并将压缩的字节存储为BLOB;否则,继续# 2。
  3. Is Cassandra's built-in compression available and enabled for your table?
    • If no (and if you can't enable the compression), compress the JSON data and store the compressed bytes as a BLOB;
    • 如果没有(如果不能启用压缩),则压缩JSON数据并将压缩的字节存储为BLOB;
    • Otherwise, proceed to #3.
    • 否则,继续# 3。
  4. Cassandra的内置压缩是否可用并为您的表启用?如果没有(如果您不能启用压缩),压缩JSON数据并将压缩字节存储为BLOB;否则,继续# 3。
  5. Is the data you'll be storing relatively uniform across each row?
    • Probably for JSON data the answer is 'yes', in which case you should store the data as text and let Cassandra handle the compression;
    • 对于JSON数据,答案可能是“是”,在这种情况下,应该将数据存储为文本,并让Cassandra处理压缩;
    • Otherwise proceed to #4.
    • 否则继续# 4。
  6. 在每一行中存储的数据是否相对一致?对于JSON数据,答案可能是“是”,在这种情况下,应该将数据存储为文本,并让Cassandra处理压缩;否则继续# 4。
  7. Do you want efficiency, or convenience?
    • Efficiency; compress the JSON data and store the compressed bytes as a BLOB.
    • 效率;压缩JSON数据并将压缩的字节存储为BLOB。
    • Convenience; compress the JSON data, base64 the compressed data, and then store the base64-encoded data as text.
    • 方便;压缩JSON数据,以base64压缩的数据,然后将base64编码的数据存储为文本。
  8. 你想要效率还是方便?效率;压缩JSON数据并将压缩字节存储为BLOB。方便;压缩JSON数据,以base64压缩的数据,然后将base64编码的数据存储为文本。

#2


0  

Since the data is not binary there is really little reason to use a Binary Large OBject. Sure you can do it, but why? Text is easier to read for humans, and there isn't really a speed/size difference (.

由于数据不是二进制的,所以几乎没有理由使用二进制大对象。你当然能做到,但为什么?对于人类来说,文本更容易阅读,并且没有真正的速度/大小差异(。

Even in other DBs you can often store JSON as text. E.g. even MySQL has text fields that can handle quite bit of text (LONGTEXT = 4Gb). Yeah, Oracle is behind, but hopefully they will also get a reasonable long text field sometimes.

即使在其他DBs中,也可以将JSON存储为文本。例如,即使MySQL也有文本字段,可以处理相当多的文本(long - text = 4Gb)。是的,Oracle落后了,但是希望他们有时也能得到一个合理的长文本域。

But why do you want to store a whole Json object as text? The json should really be normalized and stored as multiple fields in the DB.

但是为什么要将整个Json对象存储为文本呢?json应该被规范化并存储为DB中的多个字段。

#3


0  

I would definitely say that text would be better than a blob for storing JSON. JSON is ultimately text, so this type makes sense, but also there may be extra overhead for blobs as some of drivers seem to require that they be converted to Hex before inserting them. Also, blobs show up as base64-encoded strings when using cqlsh, so you wouldn't be able to easily check what JSON was actually stored if you needed to for testing purposes. I'm not sure exactly how blobs are stored on disk, but I'd imagine it's very similar to how text is.

我肯定会说,对于存储JSON来说,文本比blob要好。JSON最终是文本,所以这种类型是有意义的,但是也可能会有额外的开销,因为一些驱动程序似乎需要在插入之前将其转换为十六进制。此外,blobs在使用cqlsh时显示为base64编码的字符串,因此如果需要进行测试,您将无法轻松检查实际存储的JSON。我不确定blobs是如何存储在磁盘上的,但是我可以想象它与文本是如何存储的非常相似。

With that said, storing large entries can cause problems and is not recommended. This can cause issues with sharding and consume a lot of memory. Although the FAQ refers to files over 64MB, from experience even files a few megabytes each on average can cause performance issues when you start storing a large number of them. If possible, it would be better to use an object store if you expect the JSON to be in the megabytes in size and store references to that store in Cassandra instead.

话虽如此,存储大型条目可能会导致问题,不建议这样做。这可能导致切分问题并消耗大量内存。虽然FAQ指的是超过64MB的文件,但是从经验来看,即使是只有几兆字节的文件,当您开始存储大量的文件时,也会导致性能问题。如果可能的话,最好使用对象存储,如果您希望JSON的大小是兆字节,并将对该存储的引用存储在Cassandra中。

#4


-1  

In the upcoming 2.2 release there is also native support in Cassandra for JSON. http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-2-json-support

在即将发布的2.2版本中,Cassandra对JSON的原生支持。http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-2-json-support

#1


13  

I don't think there's any benefit for storing the literal JSON data as a BLOB in Cassandra. At best your storage costs are identical, and in general the API's are less convenient in terms of working with BLOB types as they are for working with strings/text.

我认为将JSON数据作为BLOB存储在Cassandra中没有任何好处。在最好的情况下,您的存储成本是相同的,一般来说,API在处理BLOB类型时不太方便,就像处理字符串/文本一样。

For instance, if you're using their Java API then in order to store the data as a BLOB using a parameterized PreparedStatement you first need to load it all into a ByteBuffer, for instance by packing your JSON data into an InputStream.

例如,如果您正在使用他们的Java API,那么为了使用参数化PreparedStatement将数据作为BLOB存储,您首先需要将其全部加载到ByteBuffer中,例如将JSON数据打包到InputStream中。

Unless you're dealing with very large JSON snippets that force you to stream your data anyways, that's a fair bit of extra work to get access to the BLOB type. And what would you gain from it? Essentially nothing.

除非您处理的是非常大的JSON片段,这些片段会迫使您以任何方式流数据,否则要访问BLOB类型,就需要做一些额外的工作。你能从中得到什么?基本上什么都没有。

However, I think there's some merit in asking 'Should I store JSON as text, or gzip it and store the compressed data as a BLOB?'.

然而,我认为问“应该将JSON存储为文本,还是将其gzip并将压缩数据存储为BLOB?”

And the answer to that comes down to how you've configured Cassandra and your table. In particular, as long as you're using Cassandra version 1.1 or later your tables have compression enabled by default. That may be adequate, particularly if your JSON data is fairly uniform across each row.

答案取决于你如何配置Cassandra和你的表格。特别是,只要使用Cassandra版本1.1或更高版本,您的表默认启用了压缩。这可能已经足够了,特别是如果您的JSON数据在每一行都相当一致的话。

However, Cassandra's built-in compression is applied table-wide, rather than to individual rows. So you may get a better compression ratio by manually compressing your JSON data before storage, writing the compressed bytes into a ByteBuffer, and then shipping the data into Cassandra as a BLOB.

然而,Cassandra的内置压缩应用于表宽,而不是单独的行。因此,您可以通过在存储前手动压缩JSON数据,将压缩的字节写入ByteBuffer,然后将数据作为BLOB传输到Cassandra,从而获得更好的压缩比。

So it essentially comes down to a tradeoff in terms of storage space vs. programming convenience vs. CPU usage. I would decide the matter as follows:

所以它本质上是在存储空间和编程方便性和CPU使用率之间进行权衡。我将决定以下事项:

  1. Is minimizing the amount of storage consumed your biggest concern?
    • If yes, compress the JSON data and store the compressed bytes as a BLOB;
    • 如果是,压缩JSON数据并将压缩的字节存储为BLOB;
    • Otherwise, proceed to #2.
    • 否则,继续# 2。
  2. 最大限度地减少存储消耗是你最关心的吗?如果是,压缩JSON数据并将压缩的字节存储为BLOB;否则,继续# 2。
  3. Is Cassandra's built-in compression available and enabled for your table?
    • If no (and if you can't enable the compression), compress the JSON data and store the compressed bytes as a BLOB;
    • 如果没有(如果不能启用压缩),则压缩JSON数据并将压缩的字节存储为BLOB;
    • Otherwise, proceed to #3.
    • 否则,继续# 3。
  4. Cassandra的内置压缩是否可用并为您的表启用?如果没有(如果您不能启用压缩),压缩JSON数据并将压缩字节存储为BLOB;否则,继续# 3。
  5. Is the data you'll be storing relatively uniform across each row?
    • Probably for JSON data the answer is 'yes', in which case you should store the data as text and let Cassandra handle the compression;
    • 对于JSON数据,答案可能是“是”,在这种情况下,应该将数据存储为文本,并让Cassandra处理压缩;
    • Otherwise proceed to #4.
    • 否则继续# 4。
  6. 在每一行中存储的数据是否相对一致?对于JSON数据,答案可能是“是”,在这种情况下,应该将数据存储为文本,并让Cassandra处理压缩;否则继续# 4。
  7. Do you want efficiency, or convenience?
    • Efficiency; compress the JSON data and store the compressed bytes as a BLOB.
    • 效率;压缩JSON数据并将压缩的字节存储为BLOB。
    • Convenience; compress the JSON data, base64 the compressed data, and then store the base64-encoded data as text.
    • 方便;压缩JSON数据,以base64压缩的数据,然后将base64编码的数据存储为文本。
  8. 你想要效率还是方便?效率;压缩JSON数据并将压缩字节存储为BLOB。方便;压缩JSON数据,以base64压缩的数据,然后将base64编码的数据存储为文本。

#2


0  

Since the data is not binary there is really little reason to use a Binary Large OBject. Sure you can do it, but why? Text is easier to read for humans, and there isn't really a speed/size difference (.

由于数据不是二进制的,所以几乎没有理由使用二进制大对象。你当然能做到,但为什么?对于人类来说,文本更容易阅读,并且没有真正的速度/大小差异(。

Even in other DBs you can often store JSON as text. E.g. even MySQL has text fields that can handle quite bit of text (LONGTEXT = 4Gb). Yeah, Oracle is behind, but hopefully they will also get a reasonable long text field sometimes.

即使在其他DBs中,也可以将JSON存储为文本。例如,即使MySQL也有文本字段,可以处理相当多的文本(long - text = 4Gb)。是的,Oracle落后了,但是希望他们有时也能得到一个合理的长文本域。

But why do you want to store a whole Json object as text? The json should really be normalized and stored as multiple fields in the DB.

但是为什么要将整个Json对象存储为文本呢?json应该被规范化并存储为DB中的多个字段。

#3


0  

I would definitely say that text would be better than a blob for storing JSON. JSON is ultimately text, so this type makes sense, but also there may be extra overhead for blobs as some of drivers seem to require that they be converted to Hex before inserting them. Also, blobs show up as base64-encoded strings when using cqlsh, so you wouldn't be able to easily check what JSON was actually stored if you needed to for testing purposes. I'm not sure exactly how blobs are stored on disk, but I'd imagine it's very similar to how text is.

我肯定会说,对于存储JSON来说,文本比blob要好。JSON最终是文本,所以这种类型是有意义的,但是也可能会有额外的开销,因为一些驱动程序似乎需要在插入之前将其转换为十六进制。此外,blobs在使用cqlsh时显示为base64编码的字符串,因此如果需要进行测试,您将无法轻松检查实际存储的JSON。我不确定blobs是如何存储在磁盘上的,但是我可以想象它与文本是如何存储的非常相似。

With that said, storing large entries can cause problems and is not recommended. This can cause issues with sharding and consume a lot of memory. Although the FAQ refers to files over 64MB, from experience even files a few megabytes each on average can cause performance issues when you start storing a large number of them. If possible, it would be better to use an object store if you expect the JSON to be in the megabytes in size and store references to that store in Cassandra instead.

话虽如此,存储大型条目可能会导致问题,不建议这样做。这可能导致切分问题并消耗大量内存。虽然FAQ指的是超过64MB的文件,但是从经验来看,即使是只有几兆字节的文件,当您开始存储大量的文件时,也会导致性能问题。如果可能的话,最好使用对象存储,如果您希望JSON的大小是兆字节,并将对该存储的引用存储在Cassandra中。

#4


-1  

In the upcoming 2.2 release there is also native support in Cassandra for JSON. http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-2-json-support

在即将发布的2.2版本中,Cassandra对JSON的原生支持。http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-2-json-support