MSSQL到MySQL迁移 - 使用UCS-2代理对的char编码问题，如何从MSSQL数据库中删除这些？

I have been tasked with migrating a Microsoft SQL Server 2005 database to MySQL 5.6 (these are both database servers runnig locally) and would really appreciate some help.

我的任务是将Microsoft SQL Server 2005数据库迁移到MySQL 5.6(这些都是本地运行的数据库服务器),非常感谢一些帮助。

-MSSQL source database has latin1 collation (so has ISO 8859-1 character set right?) but doesn't have any char/varchar fields (any string field is nvarchar/nchar) so all this data should be using the UCS-2 character set.

-MSSQL源数据库有latin1归类(所以ISO 8859-1字符集对吗?)但是没有任何char / varchar字段(任何字符串字段是nvarchar / nchar)所以所有这些数据都应该使用UCS-2字符组。

-MySQL target database wants the character set UTF-8

-MySQL目标数据库想要字符集UTF-8

I decided to use the database migration toolkit in the latest version of the MySQL workbench. at first it worked fine and migrated everything as expected. But I have been totally tripped up upon encountering UCS-2 surrogate pair characters in the MSSQL database.

我决定在最新版本的MySQL工作台中使用数据库迁移工具包。起初它工作正常,并按预期迁移一切。但是在遇到MSSQL数据库中的UCS-2代理对字符时,我完全被绊倒了。

The migration toolkit copytable program did not provide a very useful error message: "Error during charset conversion of wstring: No error". It also did not provide any field/row information on the problem-causing data and would fail within chunks of 100 rows. So after searching through the 100 rows after the last successful insert I found that the issue seemed to be caused by two UCS-2 characters in one of the nvarchar fields. They are listed as surrogate pairs in the UCS-2 character set. They were specifically the characters DBC0 and DC83 (I got this by looking at the binary data for the field and comparing byte pairs (little endian) with data that was being migrated successfully).

迁移工具包copytable程序没有提供非常有用的错误消息:“在wstring的charset转换期间出错:没有错误”。它也没有提供有关引起问题的数据的任何字段/行信息,并且会在100行的块内失败。因此,在最后一次成功插入后搜索了100行后,我发现问题似乎是由一个nvarchar字段中的两个UCS-2字符引起的。它们在UCS-2字符集中列为代理对。它们特别是字符DBC0和DC83(我通过查看字段的二进制数据并将字节对(little endian)与正在成功迁移的数据进行比较得到了这一点)。

When this surrogate pair was removed from the MSSQL database the row was migrated successfully to MySQL.

当从MSSQL数据库中删除此代理对时,该行已成功迁移到MySQL。

Here is the problem:

这是问题所在:

I have tried to search for these characters in a test MSSQL table (this chartest table is just various test strings an nvarchar field) to prepare a replacement script and keep getting strange results... I must be doing something incorrectly.

我试图在测试MSSQL表中搜索这些字符(这个chartest表只是各种测试字符串和nvarchar字段)来准备替换脚本并不断得到奇怪的结果......我必须做错误的事情。

Searching for

SELECT * FROM chartest WHERE text LIKE NCHAR(0xdc83)

Will return any surrogate pair character (whether or not it uses DC83), but obviously, only if it is the only character (or part of the pair) in that field. This isn't a big deal since I would like to remove any instance of these anyway (I dont like to remove data like this but I think we can afford it).

将返回任何代理对字符(无论它是否使用DC83),但显然,只有它是该字段中唯一的字符(或对的一部分)。这不是什么大问题,因为我想删除这些的任何实例(我不想删除这样的数据,但我认为我们可以负担得起)。

Searching for

SELECT * FROM chartest WHERE text LIKE '%' + (NCHAR(0xdc83))+ '%'

Will return every row! Regardless of whether it even has a unicode character present in the field let alone the DC83 character. Is there a better way to find and replace these characters? Or something else I should try?

将返回每一行!无论它是否在场中都有unicode字符,更不用说DC83字符了。有没有更好的方法来查找和替换这些字符?或者其他我应该尝试的东西?

I have also tried setting the target databse, table, and field character set to UCS-2 but it seems as though it does not make a difference.

我也尝试将目标数据库,表格和字段字符集设置为UCS-2,但似乎没有什么区别。

I should also mention that this migration is using live data (~50GB database!) while one of the sites that feeds it is taken offline so any solutions to this need to have a quick running time...

我还要提一下,这次迁移是使用实时数据(~50GB数据库!),而其中一个提供它的网站是脱机的,因此任何解决方案都需要快速运行...

I would appreciate any suggestions very much! Please let me know if there is any information I have left out.

非常感谢任何建议!如果我遗漏了任何信息,请告诉我。

4 个解决方案

#1

I had this error, and now I have discovered the source of the problem. I had a hard time finding out, so maybe this will be useful to someone, even though I realize, my problem and workaround may not be spot on matching op's original trouble.

我有这个错误,现在我发现了问题的根源。我很难找到,所以也许这对某人有用,即使我意识到,我的问题和解决方法可能不适合匹配op的原始麻烦。

I am migrating data from MSSQL to MySQL, and the content being migrated is html-content from Sitecore CMS (target CMS is Drupal, btw).

我正在将数据从MSSQL迁移到MySQL,正在迁移的内容是来自Sitecore CMS的html内容(目标CMS是Drupal,顺便说一句)。

I've found, that I get this error when converting the database and hitting records, that contain Instagram-embeds. Instagram-embeds work in the way, that the embedded post data is copied to the embed code (instead of being loaded async., et.c. - even the image is included as base64-css...), and the young people nowadays tend to put a lot of emoji's in their image-descriptions (using their iPhones with Emoji keyboard). Emoji's are represented by 4-byte encoded characters, but MySQL utf8 only allows for 3-byte encoded unicode characters.

我发现,在转换数据库并点击包含Instagram嵌入的记录时,我收到此错误。 Instagram嵌入工作的方式,嵌入的帖子数据被复制到嵌入代码(而不是被加载async。,等等 - 甚至图像被包含为base64-css ......),以及年轻人现在往往会在他们的图像描述中使用很多表情符号(使用带有表情符号键盘的iPhone)。表情符号由4字节编码字符表示,但MySQL utf8仅允许3字节编码的unicode字符。

My initial error from running wbcopytables.exe (which is the non-GUI way of doing Migration Wizard in MySQL Workbench) was the

我运行wbcopytables.exe(这是在MySQL Workbench中执行迁移向导的非GUI方式)的初始错误是

Error during charset conversion of wstring: No error

在wstring的字符集转换期间出错:没有错误

but upgrading MySQL Workbench to recent version (from 5.something to 6.x) makes the error a bit more descriptive, hinting table and column (alas, not row):

但是将MySQL Workbench升级到最新版本(从5.something升级到6.x)会使错误更具描述性,提示表和列(唉,不是行):

ERROR: Could not successfully convert UCS-2 string to UTF-8 in table [MyDatabase].[dbo].[MyTable] (column MyColumn). Original string: ...

错误:无法在表[MyDatabase]中将UCS-2字符串成功转换为UTF-8。[dbo]。[MyTable](列MyColumn)。原始字符串:...

Anyway - a solution *could* be to use utf8mb4 which would allow for the emoji's. Read more here.

无论如何 - 解决方案*可以*使用utf8mb4,这将允许表情符号。在这里阅读更多。

But it looks like, it's a bad idea to do this in e.g. my case with Drupal.

但看起来,在例如这样做是不错的主意。我和Drupal的情况。

So - the solution I ended up with was simply to strip these characters in my migrate-script. There is no point in keeping these for users of the site in question, since they are being displayed as rectangles on the webpage anyway. Since you can't search-and-replace with regex in SQL Server, I processed the data using a DAL and c# .NET, and I found the help here (thanks a ton, Jon Skeet) - turns out there is a regex-pattern for matching one half of a surrogate pair in UTF-16. See below (and use the pattern in another language if needed).

所以 - 我最终得到的解决方案就是在我的migrate-script中删除这些字符。没有必要为相关网站的用户保留这些信息,因为它们无论如何都会在网页上显示为矩形。由于您无法在SQL Server中使用正则表达式进行搜索和替换,因此我使用DAL和c#.NET处理数据,并且我在此处找到了帮助(非常感谢Jon Skeet) - 原来有一个正则表达式 - 用于匹配UTF-16中代理对的一半的模式。请参阅下文(如果需要,请使用其他语言的模式)。

var noUcs2SurrogatePairsString = Regex.Replace(stringWithUcs2SurrogatePairs, @"\p{Cs}", string.Empty);

#2

I had a very similar problem today, and I found that it was caused by empty strings, replaced them with NULLs or a value representing no data and the migration worked fine.

我今天遇到了一个非常类似的问题,我发现它是由空字符串引起的,用NULL替换它们或者代表没有数据的值,迁移工作正常。

#3

I solved just editing the "import data script.cmd" where it reads columns "As NVARCHAR" by replacing those with "VARCHAR" only.

我解决了只是编辑“import data script.cmd”的问题,它只读取了“作为NVARCHAR”列,只替换了“VARCHAR”。

Note: My table columns was VARCHAR type already, so... for some stupid reason the migration script improperly cast it to UNICODE (NVARCHAR) type.

注意:我的表列已经是VARCHAR类型,因此......出于某些愚蠢的原因,迁移脚本不正确地将其转换为UNICODE(NVARCHAR)类型。

#4

This issue has now been resolved. I used user Remus Rusanu's suggestion here for finding the rows with these surrogate pair characters using CHARINDEX and have decided to use SUBSTRING to exclude the troublesome characters like so:

此问题现已解决。我在这里使用用户Remus Rusanu的建议,使用CHARINDEX查找带有这些代理对字符的行,并决定使用SUBSTRING排除麻烦的字符,如下所示:

UPDATE test
SET a = SUBSTRING(a,  1,   (CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000)))+1)/2 - 1) -- string before the unwanted character
+ SUBSTRING(a, (CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000)))+1)/2 +1, LEN(a) ) -- string after the unwanted character
WHERE CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000))) % 2 = 1 -- only odd numbered charindexes (to signify match at beginning of byte pair character)

#1