工作原因经常折腾一大批的数据,目前常接触的数据都是以 sql 格式给出的,所以拿到数据的第一时间就是要导入数据库方便后续的分析。
折腾过sql导入的亲们都知道,mysql 默认的参数,导入的速度还是很慢的,特别是数据忒多的情况。笔者自己经常碰到几千万甚至几亿条数据的情况。如果以 mysql 的默认参数进行数据的导入,那么几十个小时的时间是肯定要的,注意我这里说的是几十个小时,不是几个小时。
于是针对 mysql 导入慢的情况,可以设置下述参数。
innodb_flush_log_at_trx_commit
innodb_flush_log_at_trx_commit。官方手册对这个值解释如下:
Controls the balance between strict ACID compliance for commit operations and higher performance that is possible when commit-related I/O operations are rearranged and done in batches. You can achieve better performance by changing the default value but then you can lose up to a second of transactions in a crash. The default value of 1 is required for full ACID compliance. With this value, the contents of the InnoDB log buffer are written out to the log file at each transaction commit and the log file is flushed to disk. With a value of 0, the contents of the InnoDB log buffer are written to the log file approximately once per second and the log file is flushed to disk. No writes from the log buffer to the log file are performed at transaction commit. Once-per-second flushing is not guaranteed to happen every second due to process scheduling issues. Because the flush to disk operation only occurs approximately once per second, you can lose up to a second of transactions with any mysqld process crash. With a value of 2, the contents of the InnoDB log buffer are written to the log file after each transaction commit and the log file is flushed to disk approximately once per second. Once-per-second flushing is not 100% guaranteed to happen every second, due to process scheduling issues. Because the flush to disk operation only occurs approximately once per second, you can lose up to a second of transactions in an operating system crash or a power outage. InnoDB log flushing frequency is controlled by innodb_flush_log_at_timeout, which allows you to set log flushing frequency to N seconds (where N is 1 ... 2700, with a default value of 1). However, any mysqld process crash can erase up to N seconds of transactions. DDL changes and other internal InnoDB activities flush the InnoDB log independent of the innodb_flush_log_at_trx_commit setting. InnoDB crash recovery works regardless of the innodb_flush_log_at_trx_commit setting. Transactions are either applied entirely or erased entirely. For durability and consistency in a replication setup that uses InnoDB with transactions: If binary logging is enabled, set sync_binlog=1. Always set innodb_flush_log_at_trx_commit=1. Caution Many operating systems and some disk hardware fool the flush-to-disk operation. They may tell mysqld that the flush has taken place, even though it has not. In this case, the durability of transactions is not guaranteed even with the setting 1, and in the worst case, a power outage can corrupt InnoDB data. Using a battery-backed disk cache in the SCSI disk controller or in the disk itself speeds up file flushes, and makes the operation safer. You can also try to disable the caching of disk writes in hardware caches.
也就是
-
1 默认值,最慢,每次事务提交都要写入log并刷新到磁盘上,这是最保险的方式
-
0 最快,每隔1S将log刷新到磁盘,但是不保证。事务提交不会触发log写入。很不安全,mysql挂了,那么上一秒的数据就都丢了。
-
2 折中的一种,事务提交会写入log,但是log刷新还是每秒一次,不保证。这种时候,就算mysql崩了,但是只要操作系统还在运转,数据还是会被写到磁盘上。
这里提到,有些磁盘系统,就算是刷新也无法保证数据确实被写入了(某些所谓存储硬盘,会在每次写入信号收到之后,直接回复一个写入完成,然后才执行写入操作,美其名曰 high available,很多所谓监控硬盘或 Nas专用硬盘都是这种逻辑,这是硬件层面的逻辑,软件层面根本无法处理),笔者就碰到过文件copy到硬盘(机械硬盘)上,机器死掉了,重启之后,只有不到一半的数据还在。查了才知道,数据只是被写入硬盘的缓存上了,还没有写入硬盘。
sync_binlog
Controls how often the MySQL server synchronizes the binary log to disk. sync_binlog=0: Disables synchronization of the binary log to disk by the MySQL server. Instead, the MySQL server relies on the operating system to flush the binary log to disk from time to time as it does for any other file. This setting provides the best performance, but in the event of a power failure or operating system crash, it is possible that the server has committed transactions that have not been synchronized to the binary log. sync_binlog=1: Enables synchronization of the binary log to disk before transactions are committed. This is the safest setting but can have a negative impact on performance due to the increased number of disk writes. In the event of a power failure or operating system crash, transactions that are missing from the binary log are only in a prepared state. This permits the automatic recovery routine to roll back the transactions, which guarantees that no transaction is lost from the binary log. sync_binlog=N, where N is a value other than 0 or 1: The binary log is synchronized to disk after N binary log commit groups have been collected. In the event of a power failure or operating system crash, it is possible that the server has committed transactions that have not been flushed to the binary log. This setting can have a negative impact on performance due to the increased number of disk writes. A higher value improves performance, but with an increased risk of data loss.
翻译过来,其实跟上面那个参数有点类似,不过这个参数是控制 binlog 也就是数据文件写到磁盘这个动作的,在官方文档里面,这俩参数也是配合使用的。
-
0 不刷新 binlog,也就是 mysql 只管写数据,至于数据啥时候刷新,由操作系统负责。
-
1 每1次事务都要强制刷新,写入磁盘,可以看出,这是一种非常保险的方式,但是可想而知,性能会比较差,取决于存储介质的读写速度
-
N,每 N 次事务强制刷新一次磁盘。设的大一些,可以得到一定的性能提升,但是遇到系统崩溃,会丢失 N 个事务的数据。
官方对于 这两个参数的推荐值是:
-
sync_binlog=1
-
innodb_flush_log_at_trx_commit=1
换句话说,目前大部分的系统,基于可靠性原则,由可能会使用各种各样的数据存储方式,但是最核心的业务,基本清一色全部是RDBS,就是为了保证数据可靠,宁可牺牲一些性能。当然,mysql通过一定的设置,也是可以实现极高的读写性能的。
这俩参数可以在里面设置,但是我们只是临时用一下,而且我本地用的是docker的mysql,弄配置文件比较麻烦,所以直接在mysql命令行里面设置就可以了。
mysql> set GLOBAL innodb_flush_log_at_trx_commit = 0; mysql> set GLOBAL sync_binlog = 0;
通过以上两个参数的设置,数据库的导入速度已经大幅提升了。如果嫌麻烦,这俩参数足够用了。
需要注意的是,这俩参数会严重影响数据安全,请一定不要在生产数据库上进行设置,后果自负。
另外一些可以设置的地方
innodb_autoextend_increment 表空间自增值
innodb_log_buffer_size log 缓存区大小
innodb_log_file_size binlog 文件大小
bulk_insert_buffer_size 批量写入的数据大小
这些变量全都可以在 mysql 官方的文档里面查到,不同的版本有不同的默认值,最新版本的 mysql,这些变量的默认值已经非常大了,都是几十 MB 级别的,一般不会成为性能瓶颈。笔者自己不怎么用这几个参数。
在导入sql时候使用的参数:
net_buffer_length
Each client thread is associated with a connection buffer and result buffer. Both begin with a size given by net_buffer_length but are dynamically enlarged up to max_allowed_packet bytes as needed. The result buffer shrinks to net_buffer_length after each SQL statement. This variable should not normally be changed, but if you have very little memory, you can set it to the expected length of statements sent by clients. If statements exceed this length, the connection buffer is automatically enlarged. The maximum value to which net_buffer_length can be set is 1MB.max_allowed_packet
The maximum size of one packet or any generated/intermediate string, or any parameter sent by the mysql_stmt_send_long_data() C API function. The default is 4MB. The packet message buffer is initialized to net_buffer_length bytes, but can grow up to max_allowed_packet bytes when needed. This value by default is small, to catch large (possibly incorrect) packets. You must increase this value if you are using large BLOB columns or long strings. It should be as big as the largest BLOB you want to use. The protocol limit for max_allowed_packet is 1GB. The value should be a multiple of 1024; nonmultiples are rounded down to the nearest multiple. When you change the message buffer size by changing the value of the max_allowed_packet variable, you should also change the buffer size on the client side if your client program permits it. The default max_allowed_packet value built in to the client library is 1GB, but individual client programs might override this. For example, mysql and mysqldump have defaults of 16MB and 24MB, respectively. They also enable you to change the client-side value by setting max_allowed_packet on the command line or in an option file. The session value of this variable is read only. The client can receive up to as many bytes as the session value. However, the server will not send to the client more bytes than the current global max_allowed_packet value. (The global value could be less than the session value if the global value is changed after the client connects.)需要注意的事,需要先确定服务端的设置,客户端的设置不能大于服务端设置。
mysql>show variables like 'max_allowed_packet'; mysql>show variables like 'net_buffer_length';事实上,我用的mariadb的docker,这两个值的设置已经非常大了。而且官方也提到,mysql命令行里面的默认设置是足够大的,不过我测试的结果,还是写上去,速度会快一点,不晓得为啥。
mysql -h127.0.0.1 -uroot -proot123 data_base_name --max_allowed_packet=16777216 --net_buffer_length=16384<your_sql_script.sql不过,虽说速度快了很多,但是也是几个小时的功夫才折腾完,但是相比原来几天的速度已经快了很多了。
其他需要注意的事情
上面说的是已经拿到数据,做导入的时候需要注意的事情,其实如果整个数据获取的流程有一定的可控性的,可以考虑在数据导出的地方就做一些处理。
例如,导出数据的时候,为了保证兼容性,一般是每条数据一个 insert 语句的。其实可以在导出的时候,启用扩展插入语法,每条语句插入多个数据。不用担心你的几百 MB 甚至几个 GB 的数据,都被做到一条 insert 里面去了,这里数据导出工具会对数据进行分组,单条 insert 语句处理的数据不会超过某个值(1MB或其他值,请参考自己所使用的工具的手册)。
另外,insert 语句还有个叫做 delayed 参数,这个参数指示 insert 不必等待结果返回,数据插入会在 mysql 的后台执行,所以操作速度很快。当然安全性也不太好保证,所以不推荐使用。
顺便说一句,后面为了方便还是把数据折腾到mongo里面了,数据占的空间大了挺多,但是同样是单线程操作,中间还加了挺多数据处理,但是一小时之内就搞定了。
mongo 公认的速度快,好用,但是对于数据的安全性确实不太好保证,我自己曾经碰到过 mongo 莫名其妙丢数据的情况,还是只有几百万条数据的一个系统。
(* ̄︿ ̄)
参考
/doc/refman/8.0/en/