如何删除MySQL utf8字符集不支持的字符?

时间:2023-01-06 11:44:20

How can I remove characters from a string that are not supported by MySQL's utf8 character set? In other words, characters with four bytes, such as "????", that are only supported by MySQL's utf8mb4 character set.

如何从一个不支持MySQL utf8字符集的字符串中删除字符?在其他words,字符等四个bytes,"????",由MySQL's仅支持utf8mb4角色集

For example,

例如,

????C = -2.4‰ ± 0.3‰; ????H = -57‰

should become

应该成为

C = -2.4‰ ± 0.3‰; H = -57‰

I want to load a data file into a MySQL table that has CHARSET=utf8.

我要将一个数据文件加载到一个具有CHARSET=utf8的MySQL表中。

1 个解决方案

#1


9  

MySQL's utf8mb4 encoding is what the world calls UTF-8.

MySQL的utf8mb4编码是世界所称的UTF-8。

MySQL's utf8 encoding is a subset of UTF-8 that only supports characters in the BMP (meaning characters U+0000 to U+FFFF inclusive).

MySQL的utf8编码是UTF-8的一个子集,它只支持BMP中的字符(即U+0000到U+FFFF包含的字符)。

Reference

参考

So, the following will match the unsupported characters in question:

因此,下面将匹配问题中的不受支持的字符:

/[^\N{U+0000}-\N{U+FFFF}]/

Here are three different techniques you can use clean your input:

这里有三种不同的技术,你可以使用清洁你的输入:

1: Remove unsupported characters:

1:删除不支持的字符:

s/[^\N{U+0000}-\N{U+FFFF}]//g;

2: Replace unsupported characters with U+FFFD:

2:用U+FFFD替换不支持的字符:

s/[^\N{U+0000}-\N{U+FFFF}]/\N{REPLACEMENT CHARACTER}/g;

3: Replace unsupported characters using a translation map:

3:使用平移映射替换不支持的字符:

my %translations = (
    "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
    # ...
);

s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;

For example,

例如,

use utf8;                              # Source code is encoded using UTF-8
use open ':std', ':encoding(UTF-8)';   # Terminal and files use UTF-8.

use strict;
use warnings;
use 5.010;               # say, //
use charnames ':full';   # Not needed in 5.16+

my %translations = (
   "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
   # ...
);

$_ = "????C = -2.4‰ ± 0.3‰; ????H = -57‰";
say;

s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
say;

Output:

输出:

????C = -2.4‰ ± 0.3‰; ????H = -57‰
εC = -2.4‰ ± 0.3‰; εH = -57‰

#1


9  

MySQL's utf8mb4 encoding is what the world calls UTF-8.

MySQL的utf8mb4编码是世界所称的UTF-8。

MySQL's utf8 encoding is a subset of UTF-8 that only supports characters in the BMP (meaning characters U+0000 to U+FFFF inclusive).

MySQL的utf8编码是UTF-8的一个子集,它只支持BMP中的字符(即U+0000到U+FFFF包含的字符)。

Reference

参考

So, the following will match the unsupported characters in question:

因此,下面将匹配问题中的不受支持的字符:

/[^\N{U+0000}-\N{U+FFFF}]/

Here are three different techniques you can use clean your input:

这里有三种不同的技术,你可以使用清洁你的输入:

1: Remove unsupported characters:

1:删除不支持的字符:

s/[^\N{U+0000}-\N{U+FFFF}]//g;

2: Replace unsupported characters with U+FFFD:

2:用U+FFFD替换不支持的字符:

s/[^\N{U+0000}-\N{U+FFFF}]/\N{REPLACEMENT CHARACTER}/g;

3: Replace unsupported characters using a translation map:

3:使用平移映射替换不支持的字符:

my %translations = (
    "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
    # ...
);

s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;

For example,

例如,

use utf8;                              # Source code is encoded using UTF-8
use open ':std', ':encoding(UTF-8)';   # Terminal and files use UTF-8.

use strict;
use warnings;
use 5.010;               # say, //
use charnames ':full';   # Not needed in 5.16+

my %translations = (
   "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
   # ...
);

$_ = "????C = -2.4‰ ± 0.3‰; ????H = -57‰";
say;

s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
say;

Output:

输出:

????C = -2.4‰ ± 0.3‰; ????H = -57‰
εC = -2.4‰ ± 0.3‰; εH = -57‰