使用unicode缺陷从MySQL迁移到Postgres

时间:2022-05-16 22:34:40

I have been tasked with migrating a database from MySQL to Postgres. However, this database has unicode characters in places that are marked as latin1. So, I'm worried about running something like

我的任务是将数据库从MySQL迁移到Postgres。但是,此数据库在标记为latin1的位置具有unicode字符。所以,我担心会运行类似的东西

mysqldump --compatible=postgresql --default-character-set=utf8 -r databasename.mysql -u root databasename

because I'm not sure if we will run into strangeness from double-encoding unicode characters. That is, encoding unicode characters into unicode

因为我不确定我们是否会因双重编码unicode字符而感到陌生。也就是说,将unicode字符编码为unicode

Does anyone know if I can do this and maintain the unicode as is? Is there an idempotent unicode encoding algorithm that I should pipe this through?

有谁知道我是否可以这样做并保持unicode原样?是否有一个幂等的unicode编码算法,我应该通过它管道?

2 个解决方案

#1


1  

Use can use my tool - it support unicode symbols https://github.com/mihailShumilov/mysql2postgresql

使用可以使用我的工具 - 它支持unicode符号https://github.com/mihailShumilov/mysql2postgresql

#2


0  

As Craig said above, you are in for some fun. Not only are there no good tools for addressing this but the problems are likely beyond any one-size-fits-all tool. The problem is that when you are crossing encodings, how to resolve this is beyond the information available in the data itself and involves understanding how you got there in the first place.

正如克雷格上面所说,你可以享受一些乐趣。不仅没有好的工具来解决这个问题,而且问题可能超出任何一刀切的工具。问题在于,当你穿越编码时,如何解决这个问题超出了数据本身可用的信息,并且首先要了解你是如何实现这一目标的。

Now I am assuming your client apps currently handle the mangled data reasonably well. If not, you have a lot more work on your hands. If they do handle it, the best way to do this is to retrieve data, convert it into UTF8 after processing it the way the client apps do and then updating it. This may take a while, and you should go through all potentially affected data. That is really the limit of what can be automated.

现在我假设您的客户端应用程序当前处理错误的数据。如果没有,你手上还有很多工作要做。如果他们确实处理它,最好的方法是检索数据,在处理客户端应用程序后再将其转换为UTF8然后进行更新。这可能需要一段时间,您应该浏览所有可能受影响的数据。这实际上是可自动化的极限。

If that is not sufficient, you are going to have to either manually correct (ick) or live with the bad data, but in that case, chances are the client apps are already living with bad data.

如果这还不够,您将不得不手动纠正(ick)或使用不良数据,但在这种情况下,客户端应用程序可能已经存在坏数据。

#1


1  

Use can use my tool - it support unicode symbols https://github.com/mihailShumilov/mysql2postgresql

使用可以使用我的工具 - 它支持unicode符号https://github.com/mihailShumilov/mysql2postgresql

#2


0  

As Craig said above, you are in for some fun. Not only are there no good tools for addressing this but the problems are likely beyond any one-size-fits-all tool. The problem is that when you are crossing encodings, how to resolve this is beyond the information available in the data itself and involves understanding how you got there in the first place.

正如克雷格上面所说,你可以享受一些乐趣。不仅没有好的工具来解决这个问题,而且问题可能超出任何一刀切的工具。问题在于,当你穿越编码时,如何解决这个问题超出了数据本身可用的信息,并且首先要了解你是如何实现这一目标的。

Now I am assuming your client apps currently handle the mangled data reasonably well. If not, you have a lot more work on your hands. If they do handle it, the best way to do this is to retrieve data, convert it into UTF8 after processing it the way the client apps do and then updating it. This may take a while, and you should go through all potentially affected data. That is really the limit of what can be automated.

现在我假设您的客户端应用程序当前处理错误的数据。如果没有,你手上还有很多工作要做。如果他们确实处理它,最好的方法是检索数据,在处理客户端应用程序后再将其转换为UTF8然后进行更新。这可能需要一段时间,您应该浏览所有可能受影响的数据。这实际上是可自动化的极限。

If that is not sufficient, you are going to have to either manually correct (ick) or live with the bad data, but in that case, chances are the client apps are already living with bad data.

如果这还不够,您将不得不手动纠正(ick)或使用不良数据,但在这种情况下,客户端应用程序可能已经存在坏数据。