PHP截取中文字符串的方法

时间:2022-10-04 06:34:13

php自带的一个截取字符串函数substr,但其只能处理英文、数字却不能截取中文混排的。如果需要在PHP中进行GB2312与UTF-8的互换,需要php_iconv.dll的支持(PHP4中包含此文件)。php5内建支持iconv,更加方便了。不管是uft-8编码转换为gb2312,还是将 gb2312 转换为 uft-8,PHP4.3.1以后的iconv函数很方便,只是需要自己写一个uft8到unicode的转换函数。处理中文的字符串截取函数mb_substr()是在PHP4.0.6后引入的,其自身支持不同编码字符的处理,所以一些新的PHP框架其实已经支持mb_substr()了。


处理函数汇总:

function cutstr($string, $length, $dot = ' ...') 
{ //截字符串函数 GBK,UTF8
$charset = 'utf-8';

if(strlen($string) <= $length)
{ //边界条件
return $string;
}

$string = str_replace(array('&', '"', '<', '>'), array('&', '"', '<', '>'), $string);

$strcut = '';
if(strtolower($charset) == 'utf-8') {

$n = $tn = $noc = 0;
while($n < strlen($string)) {

$t = ord($string[$n]);
if($t == 9 || $t == 10 || (32 <= $t && $t <= 126)) {
$tn = 1; $n++; $noc++;
} elseif(194 <= $t && $t <= 223) {
$tn = 2; $n += 2; $noc += 2;
} elseif(224 <= $t && $t <= 239) {
$tn = 3; $n += 3; $noc += 2;
} elseif(240 <= $t && $t <= 247) {
$tn = 4; $n += 4; $noc += 2;
} elseif(248 <= $t && $t <= 251) {
$tn = 5; $n += 5; $noc += 2;
} elseif($t == 252 || $t == 253) {
$tn = 6; $n += 6; $noc += 2;
} else {
$n++;
}

if($noc >= $length) {
break;
}

}
if($noc > $length)
{
$n -= $tn;
}

$strcut = substr($string, 0, $n);

} else
{
for($i = 0; $i < $length; $i++)
{
$strcut .= ord($string[$i]) > 127 ? $string[$i].$string[++$i] : $string[$i];
}
}

$strcut = str_replace(array('&', '"', '<', '>'), array('&', '"', '<', '>'), $strcut);

return $strcut.$dot;
}

函数二:


function len($string, $sublen = 80, $etc = '...',$break_words = false, $middle = false)
{
$start=0;
$code="UTF-8";
if($code == 'UTF-8')
{
$pa = "/[\x01-\x7f]|[\xc2-\xdf][\x80-\xbf]|\xe0[\xa0-\xbf][\x80-\xbf]|[\xe1-\xef][\x80-\xbf][\x80-\xbf]|\xf0[\x90-\xbf][\x80-\xbf][\x80-\xbf]|[\xf1-\xf7][\x80-\xbf][\x80-\xbf][\x80-\xbf]/";
preg_match_all($pa, $string, $t_string);
if(count($t_string[0]) - $start > $sublen) return join('', array_slice($t_string[0], $start, $sublen))."...";
return join('', array_slice($t_string[0], $start, $sublen));
}
else
{
$start = $start*2;
$sublen = $sublen*2;
$strlen = strlen($string);
$tmpstr = '';
for($i=0; $i<$strlen; $i++)
{
if($i>=$start && $i<($start+$sublen))
{
if(ord(substr($string, $i, 1))>129)
{
$tmpstr.= substr($string, $i, 2);
}
else
{
$tmpstr.= substr($string, $i, 1);
}
}
if(ord(substr($string, $i, 1))>129) $i++;
}
if(strlen($tmpstr)<$strlen ) $tmpstr.= "...";
return $tmpstr;
}
}

函数三(兼容mb_substr):

/**
+----------------------------------------------------------
* 字符串截取,支持中文和其他编码
+----------------------------------------------------------
* @static
* @access public
+----------------------------------------------------------
* @param string $str 需要转换的字符串
* @param string $start 开始位置
* @param string $length 截取长度
* @param string $charset 编码格式
* @param string $suffix 截断显示字符
+----------------------------------------------------------
* @return string
+----------------------------------------------------------
*/
function msubstr($str, $start, $length, $charset="utf-8", $suffix=true)
{
if(function_exists("mb_substr")){
$slice = mb_substr($str, $start, $length, $charset);
}elseif(function_exists('iconv_substr')) {
$slice = iconv_substr($str,$start,$length,$charset);
if(false === $slice) {
$slice = '';
}
}else{
$re['utf-8'] = "/[\x01-\x7f]|[\xc2-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xff][\x80-\xbf]{3}/";
$re['gb2312'] = "/[\x01-\x7f]|[\xb0-\xf7][\xa0-\xfe]/";
$re['gbk'] = "/[\x01-\x7f]|[\x81-\xfe][\x40-\xfe]/";
$re['big5'] = "/[\x01-\x7f]|[\x81-\xfe]([\x40-\x7e]|\xa1-\xfe])/";
preg_match_all($re[$charset], $str, $match);
$slice = join("",array_slice($match[0], $start, $length));
}
return $suffix ? $slice.'...' : $slice;
}



更多知识看这位同仁的博客集合http://www.cnblogs.com/qiantuwuliang/archive/2009/07/16/1525139.html