varchar是否会因数据碎片而导致性能下降?

时间:2022-03-09 16:06:59

How are varchar columns handled internally by a database engine? For a column defined as char(100), the DBMS allocates 100 contiguous bytes on the disk. However for a column defined as varchar(100), that presumably isn't the case, since the whole point of varchar is to not allocate any more space than required to store the actual data value stored in the column. So, when a user updates a database row containing an empty varchar(100) column to a value consisting of 80 characters for instance, where does the space for that 80 characters get allocated from? It seems that varchar columns must result in a fair amount of fragmentation of the actual database rows, at least in scenarios where column values are initially inserted as blank or NULL, and then updated later with actual values. Does this fragmentation result in degraded performance on database queries, as opposed to using char type values, where the space for the columns stored in the rows is allocated contiguously? Obviously using varchar results in less disk space than using char, but is there a performance hit when optimizing for query performance, especially for columns whose values are frequently updated after the initial insert?

varchar列如何由数据库引擎在内部处理?对于定义为char(100)的列,DBMS在磁盘上分配100个连续字节。但是对于定义为varchar(100)的列,可能并非如此,因为varchar的整个点不会分配比存储列中存储的实际数据值所需的更多空间。因此,当用户将包含空varchar(100)列的数据库行更新为包含80个字符的值时,该80个字符的空间从哪里分配?似乎varchar列必须导致实际数据库行的大量碎片,至少在最初将列值插入为空或NULL时,然后随后使用实际值更新的情况下。这种碎片是否导致数据库查询性能下降,而不是使用char类型值,其中存储在行中的列的空间是连续分配的?显然使用varchar会导致比使用char更少的磁盘空间,但是在优化查询性能时是否存在性能损失,尤其是对于在初始插入后经常更新其值的列?

6 个解决方案

#1


4  

The data structures used inside a database engine is far more complex than you are giving it credit for! Yes, there are issues of fragmentation and issues where updating a varchar with a large value can cause a performance hit, however its difficult to explain /understand what the implications of those issues are without a fuller understanding of the datastructures involved.

在数据库引擎中使用的数据结构远比你给它的信誉复杂得多!是的,存在碎片问题以及更新具有较大值的varchar可能导致性能损失的问题,但是如果不对所涉及的数据结构进行更全面的理解,则难以解释/理解这些问题的含义。

For MS Sql server you might want to start with understanding pages - the fundamental unit of storage (see http://msdn.microsoft.com/en-us/library/ms190969.aspx)

对于MS Sql服务器,您可能需要从理解页面开始 - 基本的存储单元(请参阅http://msdn.microsoft.com/en-us/library/ms190969.aspx)

In terms of the performance implications of fixes vs variable storage types on performance there are a number of points to consider:

就修复与可变存储类型对性能的性能影响而言,需要考虑以下几点:

  • Using variable length columns can improve performance as it allows more rows to fit on a single page, meaning fewer reads
  • 使用可变长度列可以提高性能,因为它允许更多行适合单个页面,这意味着更少的读取
  • Using variable length columns requires special offset values, and the maintenance of these values requires a slight overhead, however this extra overhead is generally neglible.
  • 使用可变长度列需要特殊的偏移值,并且维护这些值需要很小的开销,但是这种额外的开销通常是可以忽略的。
  • Another potential cost is the cost of increasing the size of a column when the page containing that row is nearly full
  • 另一个潜在成本是当包含该行的页面几乎已满时增加列大小的成本

As you can see, the situation is rather complex - generally speaking however you can trust the database engine to be pretty good at dealing with variable data types and they should be the data type of choice when there may be a significant variance of the length of data held in a column.

正如您所看到的,情况相当复杂 - 一般来说,您可以信任数据库引擎在处理可变数据类型时非常擅长,并且当可能存在显着的长度变化时,它们应该是所选择的数据类型。列中保存的数据。

At this point I'm also going to recommend the excellent book "Microsoft Sql Server 2008 Internals" for some more insight into how complex things like this really get!

在这一点上,我还将推荐一本优秀的书“Microsoft Sql Server 2008 Internals”,以便更深入地了解这样复杂的事情是如何实现的!

#2


7  

You make a lot of assumptions in your question that aren't necessarily true.

你在问题中做了很多假设,但这些假设并不一定正确。

The type of the a column in any DBMS tells you nothing at all about the nature of the storage of that data unless the documentation clearly tells you how the data is stored. IF that's not stated, you don't know how it is stored and the DBMS is free to change the storage mechanism from release to release.

除非文档清楚地告诉您数据的存储方式,否则任何DBMS中a列的类型都不会告诉您有关该数据存储性质的任何信息。如果没有说明,您不知道它是如何存储的,DBMS可以*地将存储机制从发行版更改为发行版。

In fact some databases store CHAR fields internally as VARCHAR, while others make a decision about how to the store the column based on the declared size of the column. Some database store VARCHAR with the other columns, some with BLOB data, and some implement other storage, Some databases always rewrite the entire row when a column is updated, others don't. Some pad VARCHARs to allow for limited future updating without relocating the storage.

事实上,有些数据库在内部将CHAR字段存储为VARCHAR,而其他数据库则根据声明的列大小决定如何存储列。某些数据库存储VARCHAR与其他列,一些存储BLOB数据,另一些存储实现其他存储,某些数据库在更新列时始终重写整行,而其他数据库则不会。一些pad VARCHAR允许有限的将来更新而无需重新定位存储。

The DBMS is responsible for figuring out how to store the data and return it to you in a speedy and consistent fashion. It always amazes me how many people to try out think the database, generally in advance of detecting any performance problem.

DBMS负责确定如何存储数据并以快速一致的方式将其返回给您。通常在检测到任何性能问题之前,总是让我感到惊讶的是有多少人试用了数据库。

#3


3  

The answer will depend on the specific DBMS. For Oracle, it is certainly possible to end up with fragmentation in the form of "chained rows", and that incurs a performance penalty. However, you can mitigate against that by pre-allocating some empty space in the table blocks to allow for some expansion due to updates. However, CHAR columns will typically make the table much bigger, which has its own impact on performance. CHAR also has other issues such as blank-padded comparisons which mean that, in Oracle, use of the CHAR datatype is almost never a good idea.

答案取决于具体的DBMS。对于Oracle来说,最终可能会以“链式行”的形式出现碎片,并且会导致性能下降。但是,您可以通过在表块中预先分配一些空白空间来缓解这种情况,以允许由于更新而进行一些扩展。但是,CHAR列通常会使表格更大,这会对性能产生影响。 CHAR还有其他问题,例如空白填充比较,这意味着,在Oracle中,使用CHAR数据类型几乎不是一个好主意。

#4


2  

Your question is too general because different database engines will have different behavior. If you really need to know this, I suggest that you set up a benchmark to write a large number of records and time it. You would want enough records to take at least an hour to write.

您的问题太笼统,因为不同的数据库引擎会有不同的行为。如果你真的需要知道这一点,我建议你设置一个基准来编写大量的记录并计算它。您需要足够的记录至少花费一个小时来编写。

As you suggested, it would be interesting to see what happens if you write insert all the records with an empty string ("") and then update them to have 100 characters that are reasonably random, not just 100 Xs.

正如您所建议的那样,如果您使用空字符串(“”)编写插入所有记录,然后将它们更新为具有100个合理随机的字符,而不仅仅是100 Xs,那将会很有趣。

If you try this with SQLITE and see no significant difference, then I think it unlikely that the larger database servers, with all the analysis and tuning that goes on, would be worse than SQLITE.

如果您使用SQLITE进行尝试并且看不出显着差异,那么我认为大型数据库服务器不可能进行所有分析和调优,而不是SQLITE。

#5


2  

This is going to be completely database specific.

这将完全是数据库特定的。

I do know that in Oracle, the database will reserve a certain percentage of each block for future updates (The PCTFREE parameter). For example, if PCTFREE is set to 25%, then a block will only be used for new data until it is 75% full. By doing that, room is left for rows to grow. If the row grows such that the 25% reserved space is completely used up, then you do end up with chained rows and a performance penalty. If you find that a table has a large number of chained rows, you can tune the PCTFREE for that table. If you have a table which will never have any updates at all, a PCTFREE of zero would make sense

我知道在Oracle中,数据库将保留每个块的一定百分比以用于将来的更新(PCTFREE参数)。例如,如果PCTFREE设置为25%,则块将仅用于新数据,直到75%已满。通过这样做,空间留给行增长。如果行增长使得25%的保留空间被完全耗尽,那么最终会出现链接行和性能损失。如果发现表具有大量链接行,则可以调整该表的PCTFREE。如果你有一张永远不会有任何更新的表格,那么PCT的PCT为零是有道理的

#6


1  

In SQL Server varchar (except varchar(MAX)) is generally stored together with the rest of the row's data (on the same page if the row's data is < 8KB and on the same extent if it is < 64KB. Only the large data types such as TEXT, NTEXT, IMAGE, VARHCAR(MAX), NVARHCAR(MAX), XML and VARBINARY(MAX) are stored seperately.

在SQL Server中,varchar(varchar(MAX)除外)通常与行的数据的其余部分一起存储(如果行的数据<8KB,则在同一页面上,如果<64KB,则在同一页面上。只有大数据类型)例如TEXT,NTEXT,IMAGE,VARHCAR(MAX),NVARHCAR(MAX),XML和VARBINARY(MAX)分别存储。

#1


4  

The data structures used inside a database engine is far more complex than you are giving it credit for! Yes, there are issues of fragmentation and issues where updating a varchar with a large value can cause a performance hit, however its difficult to explain /understand what the implications of those issues are without a fuller understanding of the datastructures involved.

在数据库引擎中使用的数据结构远比你给它的信誉复杂得多!是的,存在碎片问题以及更新具有较大值的varchar可能导致性能损失的问题,但是如果不对所涉及的数据结构进行更全面的理解,则难以解释/理解这些问题的含义。

For MS Sql server you might want to start with understanding pages - the fundamental unit of storage (see http://msdn.microsoft.com/en-us/library/ms190969.aspx)

对于MS Sql服务器,您可能需要从理解页面开始 - 基本的存储单元(请参阅http://msdn.microsoft.com/en-us/library/ms190969.aspx)

In terms of the performance implications of fixes vs variable storage types on performance there are a number of points to consider:

就修复与可变存储类型对性能的性能影响而言,需要考虑以下几点:

  • Using variable length columns can improve performance as it allows more rows to fit on a single page, meaning fewer reads
  • 使用可变长度列可以提高性能,因为它允许更多行适合单个页面,这意味着更少的读取
  • Using variable length columns requires special offset values, and the maintenance of these values requires a slight overhead, however this extra overhead is generally neglible.
  • 使用可变长度列需要特殊的偏移值,并且维护这些值需要很小的开销,但是这种额外的开销通常是可以忽略的。
  • Another potential cost is the cost of increasing the size of a column when the page containing that row is nearly full
  • 另一个潜在成本是当包含该行的页面几乎已满时增加列大小的成本

As you can see, the situation is rather complex - generally speaking however you can trust the database engine to be pretty good at dealing with variable data types and they should be the data type of choice when there may be a significant variance of the length of data held in a column.

正如您所看到的,情况相当复杂 - 一般来说,您可以信任数据库引擎在处理可变数据类型时非常擅长,并且当可能存在显着的长度变化时,它们应该是所选择的数据类型。列中保存的数据。

At this point I'm also going to recommend the excellent book "Microsoft Sql Server 2008 Internals" for some more insight into how complex things like this really get!

在这一点上,我还将推荐一本优秀的书“Microsoft Sql Server 2008 Internals”,以便更深入地了解这样复杂的事情是如何实现的!

#2


7  

You make a lot of assumptions in your question that aren't necessarily true.

你在问题中做了很多假设,但这些假设并不一定正确。

The type of the a column in any DBMS tells you nothing at all about the nature of the storage of that data unless the documentation clearly tells you how the data is stored. IF that's not stated, you don't know how it is stored and the DBMS is free to change the storage mechanism from release to release.

除非文档清楚地告诉您数据的存储方式,否则任何DBMS中a列的类型都不会告诉您有关该数据存储性质的任何信息。如果没有说明,您不知道它是如何存储的,DBMS可以*地将存储机制从发行版更改为发行版。

In fact some databases store CHAR fields internally as VARCHAR, while others make a decision about how to the store the column based on the declared size of the column. Some database store VARCHAR with the other columns, some with BLOB data, and some implement other storage, Some databases always rewrite the entire row when a column is updated, others don't. Some pad VARCHARs to allow for limited future updating without relocating the storage.

事实上,有些数据库在内部将CHAR字段存储为VARCHAR,而其他数据库则根据声明的列大小决定如何存储列。某些数据库存储VARCHAR与其他列,一些存储BLOB数据,另一些存储实现其他存储,某些数据库在更新列时始终重写整行,而其他数据库则不会。一些pad VARCHAR允许有限的将来更新而无需重新定位存储。

The DBMS is responsible for figuring out how to store the data and return it to you in a speedy and consistent fashion. It always amazes me how many people to try out think the database, generally in advance of detecting any performance problem.

DBMS负责确定如何存储数据并以快速一致的方式将其返回给您。通常在检测到任何性能问题之前,总是让我感到惊讶的是有多少人试用了数据库。

#3


3  

The answer will depend on the specific DBMS. For Oracle, it is certainly possible to end up with fragmentation in the form of "chained rows", and that incurs a performance penalty. However, you can mitigate against that by pre-allocating some empty space in the table blocks to allow for some expansion due to updates. However, CHAR columns will typically make the table much bigger, which has its own impact on performance. CHAR also has other issues such as blank-padded comparisons which mean that, in Oracle, use of the CHAR datatype is almost never a good idea.

答案取决于具体的DBMS。对于Oracle来说,最终可能会以“链式行”的形式出现碎片,并且会导致性能下降。但是,您可以通过在表块中预先分配一些空白空间来缓解这种情况,以允许由于更新而进行一些扩展。但是,CHAR列通常会使表格更大,这会对性能产生影响。 CHAR还有其他问题,例如空白填充比较,这意味着,在Oracle中,使用CHAR数据类型几乎不是一个好主意。

#4


2  

Your question is too general because different database engines will have different behavior. If you really need to know this, I suggest that you set up a benchmark to write a large number of records and time it. You would want enough records to take at least an hour to write.

您的问题太笼统,因为不同的数据库引擎会有不同的行为。如果你真的需要知道这一点,我建议你设置一个基准来编写大量的记录并计算它。您需要足够的记录至少花费一个小时来编写。

As you suggested, it would be interesting to see what happens if you write insert all the records with an empty string ("") and then update them to have 100 characters that are reasonably random, not just 100 Xs.

正如您所建议的那样,如果您使用空字符串(“”)编写插入所有记录,然后将它们更新为具有100个合理随机的字符,而不仅仅是100 Xs,那将会很有趣。

If you try this with SQLITE and see no significant difference, then I think it unlikely that the larger database servers, with all the analysis and tuning that goes on, would be worse than SQLITE.

如果您使用SQLITE进行尝试并且看不出显着差异,那么我认为大型数据库服务器不可能进行所有分析和调优,而不是SQLITE。

#5


2  

This is going to be completely database specific.

这将完全是数据库特定的。

I do know that in Oracle, the database will reserve a certain percentage of each block for future updates (The PCTFREE parameter). For example, if PCTFREE is set to 25%, then a block will only be used for new data until it is 75% full. By doing that, room is left for rows to grow. If the row grows such that the 25% reserved space is completely used up, then you do end up with chained rows and a performance penalty. If you find that a table has a large number of chained rows, you can tune the PCTFREE for that table. If you have a table which will never have any updates at all, a PCTFREE of zero would make sense

我知道在Oracle中,数据库将保留每个块的一定百分比以用于将来的更新(PCTFREE参数)。例如,如果PCTFREE设置为25%,则块将仅用于新数据,直到75%已满。通过这样做,空间留给行增长。如果行增长使得25%的保留空间被完全耗尽,那么最终会出现链接行和性能损失。如果发现表具有大量链接行,则可以调整该表的PCTFREE。如果你有一张永远不会有任何更新的表格,那么PCT的PCT为零是有道理的

#6


1  

In SQL Server varchar (except varchar(MAX)) is generally stored together with the rest of the row's data (on the same page if the row's data is < 8KB and on the same extent if it is < 64KB. Only the large data types such as TEXT, NTEXT, IMAGE, VARHCAR(MAX), NVARHCAR(MAX), XML and VARBINARY(MAX) are stored seperately.

在SQL Server中,varchar(varchar(MAX)除外)通常与行的数据的其余部分一起存储(如果行的数据<8KB,则在同一页面上,如果<64KB,则在同一页面上。只有大数据类型)例如TEXT,NTEXT,IMAGE,VARHCAR(MAX),NVARHCAR(MAX),XML和VARBINARY(MAX)分别存储。