如何检测是否必须对字符串应用utf8解码或编码?

时间:2023-01-04 19:52:38

I have a feed taken from 3rd party sites, and sometimes I have to apply utf8_decode and other times utf8_encode to get the desired visible output.

我有来自第三方网站的提要,有时我必须应用utf8_decode和其他时间utf8_encode来获得所需的可见输出。

If by mistake the same stuff is applied twice/or the wrong method is used I get something more ugly, this is what I want to change.

如果错误地将相同的东西应用了两次/或者使用了错误的方法我会得到一些更难看的东西,这就是我想要改变的东西。

How can I detect when what have to apply on the string?

如何检测何时应用于字符串?

UPDATE

UPDATE

Actually the content returns UTF-8, but inside there are parts that are not.

实际上内容返回UTF-8,但内部有部分不返回。

5 个解决方案

#1


52  

I can't say I can rely on mb_detect_encoding(). Had some freaky false positives a while back.

我不能说我可以依靠mb_detect_encoding()。过了一段时间里有一些怪异的误报。

The most universal way I found to work well in every case was:

我发现在每种情况下运作良好的最普遍的方式是:

if (preg_match('!!u', $string))
{
   // this is utf-8
}
else 
{
   // definitely not utf-8
}

#2


4  

You can use

您可以使用

  • mb_detect_encoding — Detect character encoding
  • mb_detect_encoding - 检测字符编码

The charset might also be available in the HTTP Response Headers or in the Response data itself.

字符集也可能在HTTP响应标头或响应数据本身中可用。

Example:

例:

var_dump(
    mb_detect_encoding(
        file_get_contents('http://*.com/questions/4407854')
    ),
    $http_response_header
);

Output (codepad):

输出(键盘):

string(5) "UTF-8"
array(9) {
  [0]=>
  string(15) "HTTP/1.1 200 OK"
  [1]=>
  string(33) "Cache-Control: public, max-age=11"
  [2]=>
  string(38) "Content-Type: text/html; charset=utf-8"
  [3]=>
  string(38) "Expires: Fri, 10 Dec 2010 10:40:07 GMT"
  [4]=>
  string(44) "Last-Modified: Fri, 10 Dec 2010 10:39:07 GMT"
  [5]=>
  string(7) "Vary: *"
  [6]=>
  string(35) "Date: Fri, 10 Dec 2010 10:39:55 GMT"
  [7]=>
  string(17) "Connection: close"
  [8]=>
  string(21) "Content-Length: 34119"
}

#3


3  

function str_to_utf8 ($str) {
    $decoded = utf8_decode($str);
    if (mb_detect_encoding($decoded , 'UTF-8', true) === false)
        return $str;
    return $decoded;
}

var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)
var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)

#4


0  

The feed (I guess you mean some kind of xml based feed) should have an attribute in the header telling you what the encoding is. If not, you are out of luck as you don't have a reliable means of identifying the encoding.

feed(我想你的意思是某种基于xml的feed)应该在头文件中有一个属性来告诉你编码是什么。如果没有,您就没有运气,因为您没有可靠的方法来识别编码。

#5


0  

Encoding autotection is not bullet-proof but you can try mb_detect_encoding(). See also mb_check_encoding().

编码自动保护不是防弹,但您可以尝试mb_detect_encoding()。另请参见mb_check_encoding()。

#1


52  

I can't say I can rely on mb_detect_encoding(). Had some freaky false positives a while back.

我不能说我可以依靠mb_detect_encoding()。过了一段时间里有一些怪异的误报。

The most universal way I found to work well in every case was:

我发现在每种情况下运作良好的最普遍的方式是:

if (preg_match('!!u', $string))
{
   // this is utf-8
}
else 
{
   // definitely not utf-8
}

#2


4  

You can use

您可以使用

  • mb_detect_encoding — Detect character encoding
  • mb_detect_encoding - 检测字符编码

The charset might also be available in the HTTP Response Headers or in the Response data itself.

字符集也可能在HTTP响应标头或响应数据本身中可用。

Example:

例:

var_dump(
    mb_detect_encoding(
        file_get_contents('http://*.com/questions/4407854')
    ),
    $http_response_header
);

Output (codepad):

输出(键盘):

string(5) "UTF-8"
array(9) {
  [0]=>
  string(15) "HTTP/1.1 200 OK"
  [1]=>
  string(33) "Cache-Control: public, max-age=11"
  [2]=>
  string(38) "Content-Type: text/html; charset=utf-8"
  [3]=>
  string(38) "Expires: Fri, 10 Dec 2010 10:40:07 GMT"
  [4]=>
  string(44) "Last-Modified: Fri, 10 Dec 2010 10:39:07 GMT"
  [5]=>
  string(7) "Vary: *"
  [6]=>
  string(35) "Date: Fri, 10 Dec 2010 10:39:55 GMT"
  [7]=>
  string(17) "Connection: close"
  [8]=>
  string(21) "Content-Length: 34119"
}

#3


3  

function str_to_utf8 ($str) {
    $decoded = utf8_decode($str);
    if (mb_detect_encoding($decoded , 'UTF-8', true) === false)
        return $str;
    return $decoded;
}

var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)
var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)

#4


0  

The feed (I guess you mean some kind of xml based feed) should have an attribute in the header telling you what the encoding is. If not, you are out of luck as you don't have a reliable means of identifying the encoding.

feed(我想你的意思是某种基于xml的feed)应该在头文件中有一个属性来告诉你编码是什么。如果没有,您就没有运气,因为您没有可靠的方法来识别编码。

#5


0  

Encoding autotection is not bullet-proof but you can try mb_detect_encoding(). See also mb_check_encoding().

编码自动保护不是防弹,但您可以尝试mb_detect_encoding()。另请参见mb_check_encoding()。