从字符串[duplicate]中删除所有特殊字符

时间:2022-08-22 12:59:49

Possible Duplicate:
Regular Expression Sanitize (PHP)

可能的重复:正则表达式清理(PHP)

I am facing an issue with URLs, I want to be able to convert titles that could contain anything and have them stripped of all special characters so they only have letters and numbers and of course I would like to replace spaces with hyphens.

我面临着url的问题,我想要能够转换包含任何内容的标题,并让它们去掉所有特殊字符,因此它们只有字母和数字,当然我想用连字符替换空格。

How would this be done? I've heard a lot about regular expressions (regex) being used...

怎么做呢?我听过很多关于正则表达式(regex)的用法……

3 个解决方案

#1


503  

Easy peasy:

容易peasy:

function clean($string) {
   $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.

   return preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.
}

Usage:

用法:

echo clean('a|"bc!@£de^&$f g');

Will output: abcdef-g

将输出:abcdef-g

Edit:

编辑:

Hey, just a quick question, how can I prevent multiple hyphens from being next to each other? and have them replaced with just 1?

嘿,我有个小问题,怎么才能避免多个连字符在一起呢?用1替换它们吗?

function clean($string) {
   $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.
   $string = preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.

   return preg_replace('/-+/', '-', $string); // Replaces multiple hyphens with single one.
}

#2


83  

Update

The solution below has a "SEO friendlier" version:

下面的解决方案有一个“SEO友好”版本:

function hyphenize($string) {
    $dict = array(
        "I'm"      => "I am",
        "thier"    => "their",
        // Add your own replacements here
    );
    return strtolower(
        preg_replace(
          array( '#[\\s-]+#', '#[^A-Za-z0-9\. -]+#' ),
          array( '-', '' ),
          // the full cleanString() can be downloaded from http://www.unexpectedit.com/php/php-clean-string-of-utf8-chars-convert-to-similar-ascii-char
          cleanString(
              str_replace( // preg_replace can be used to support more complicated replacements
                  array_keys($dict),
                  array_values($dict),
                  urldecode($string)
              )
          )
        )
    );
}

function cleanString($text) {
    $utf8 = array(
        '/[áàâãªä]/u'   =>   'a',
        '/[ÁÀÂÃÄ]/u'    =>   'A',
        '/[ÍÌÎÏ]/u'     =>   'I',
        '/[íìîï]/u'     =>   'i',
        '/[éèêë]/u'     =>   'e',
        '/[ÉÈÊË]/u'     =>   'E',
        '/[óòôõºö]/u'   =>   'o',
        '/[ÓÒÔÕÖ]/u'    =>   'O',
        '/[úùûü]/u'     =>   'u',
        '/[ÚÙÛÜ]/u'     =>   'U',
        '/ç/'           =>   'c',
        '/Ç/'           =>   'C',
        '/ñ/'           =>   'n',
        '/Ñ/'           =>   'N',
        '/–/'           =>   '-', // UTF-8 hyphen to "normal" hyphen
        '/[’‘‹›‚]/u'    =>   ' ', // Literally a single quote
        '/[“”«»„]/u'    =>   ' ', // Double quote
        '/ /'           =>   ' ', // nonbreaking space (equiv. to 0x160)
    );
    return preg_replace(array_keys($utf8), array_values($utf8), $text);
}

The rationale for the above functions (which I find way inefficient - the one below is better) is that a service that shall not be named apparently ran spelling checks and keyword recognition on the URLs.

上述函数的基本原理(我发现这种方法效率很低——下面的更好)是,一个不应该被命名的服务显然会对url进行拼写检查和关键字识别。

After losing a long time on a customer's paranoias, I found out they were not imagining things after all -- their SEO experts [I am definitely not one] reported that, say, converting "Viaggi Economy Perù" to viaggi-economy-peru "behaved better" than viaggi-economy-per (the previous "cleaning" removed UTF8 characters; Bogotà became bogot, Medellìn became medelln and so on).

在花了很长时间研究客户的偏执狂之后,我发现他们根本不是在想象事情——他们的SEO专家(我肯定不是)报告说,把“Viaggi Economy Peru”改成了Viaggi - Economy - Peru——Peru——表现得比Viaggi - economyper好(之前的“清理”去掉了UTF8字符);波哥大变成了波哥大,麦德林变成了麦德林等等)。

There were also some common misspellings that seemed to influence the results, and the only explanation that made sense to me is that our URL were being unpacked, the words singled out, and used to drive God knows what ranking algorithms. And those algorithms apparently had been fed with UTF8-cleaned strings, so that "Perù" became "Peru" instead of "Per". "Per" did not match and sort of took it in the neck.

还有一些常见的拼写错误似乎会影响结果,唯一让我觉得有意义的解释是,我们的URL被打开了,单词被挑出来了,用来驱动天晓得什么排名算法。这些算法显然是用utf8清理的字符串输入的,所以“秘鲁”变成了“秘鲁”而不是“Per”。"Per"和"Per"不匹配,有点像是把它戴在脖子上。

In order to both keep UTF8 characters and replace some misspellings, the faster function below became the more accurate (?) function above. $dict needs to be hand tailored, of course.

为了保持UTF8字符并替换一些拼写错误,下面更快的函数变成上面更准确的(?)函数。当然,我们需要手工定制。

Previous answer

A simple approach:

一个简单的方法:

// Remove all characters except A-Z, a-z, 0-9, dots, hyphens and spaces
// Note that the hyphen must go last not to be confused with a range (A-Z)
// and the dot, being special, is escaped with \
$str = preg_replace('/[^A-Za-z0-9\. -]/', '', $str);

// Replace sequences of spaces with hyphen
$str = preg_replace('/  */', '-', $str);

// The above means "a space, followed by a space repeated zero or more times"
// (should be equivalent to / +/)

// You may also want to try this alternative:
$str = preg_replace('/\\s+/', '-', $str);

// where \s+ means "zero or more whitespaces" (a space is not necessarily the
// same as a whitespace) just to be sure and include everything

Note that you might have to first urldecode() the URL, since %20 and + both are actually spaces - I mean, if you have "Never%20gonna%20give%20you%20up" you want it to become Never-gonna-give-you-up, not Never20gonna20give20you20up . You might not need it, but I thought I'd mention the possibility.

请注意,您可能必须首先对URL进行urldecode()解码,因为%20和+实际上都是空格——我的意思是,如果您“从来没有%20 %20将给%20 %20”,那么您希望它成为永远不会给您的东西,而不是永远不会给您的东西。你可能不需要它,但我想我应该提到它的可能性。

So the finished function along with test cases:

所以完成的功能和测试用例:

function hyphenize($string) {
    return 
    ## strtolower(
          preg_replace(
            array('#[\\s-]+#', '#[^A-Za-z0-9\. -]+#'),
            array('-', ''),
        ##     cleanString(
              urldecode($string)
        ##     )
        )
    ## )
    ;
}

print implode("\n", array_map(
    function($s) {
            return $s . ' becomes ' . hyphenize($s);
    },
    array(
    'Never%20gonna%20give%20you%20up',
    "I'm not the man I was",
    "'Légeresse', dit sa majesté",
    )));


Never%20gonna%20give%20you%20up    becomes  never-gonna-give-you-up
I'm not the man I was              becomes  im-not-the-man-I-was
'Légeresse', dit sa majesté        becomes  legeresse-dit-sa-majeste

To handle UTF-8 I used a cleanString implementation found here. It could be simplified and wrapped inside the function here for performance.

为了处理UTF-8,我在这里使用了一个干净的字符串实现。为了提高性能,可以将它简化并封装在函数中。

The function above also implements converting to lowercase - but that's a taste. The code to do so has been commented out.

上面的函数也实现了转换为小写——但这是一种味道。这样做的代码已经被注释掉了。

#3


32  

Here, check out this function:

这里,看看这个函数:

function seo_friendly_url($string){
    $string = str_replace(array('[\', \']'), '', $string);
    $string = preg_replace('/\[.*\]/U', '', $string);
    $string = preg_replace('/&(amp;)?#?[a-z0-9]+;/i', '-', $string);
    $string = htmlentities($string, ENT_COMPAT, 'utf-8');
    $string = preg_replace('/&([a-z])(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig|quot|rsquo);/i', '\\1', $string );
    $string = preg_replace(array('/[^a-z0-9]/i', '/[-]+/') , '-', $string);
    return strtolower(trim($string, '-'));
}

#1


503  

Easy peasy:

容易peasy:

function clean($string) {
   $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.

   return preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.
}

Usage:

用法:

echo clean('a|"bc!@£de^&$f g');

Will output: abcdef-g

将输出:abcdef-g

Edit:

编辑:

Hey, just a quick question, how can I prevent multiple hyphens from being next to each other? and have them replaced with just 1?

嘿,我有个小问题,怎么才能避免多个连字符在一起呢?用1替换它们吗?

function clean($string) {
   $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.
   $string = preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.

   return preg_replace('/-+/', '-', $string); // Replaces multiple hyphens with single one.
}

#2


83  

Update

The solution below has a "SEO friendlier" version:

下面的解决方案有一个“SEO友好”版本:

function hyphenize($string) {
    $dict = array(
        "I'm"      => "I am",
        "thier"    => "their",
        // Add your own replacements here
    );
    return strtolower(
        preg_replace(
          array( '#[\\s-]+#', '#[^A-Za-z0-9\. -]+#' ),
          array( '-', '' ),
          // the full cleanString() can be downloaded from http://www.unexpectedit.com/php/php-clean-string-of-utf8-chars-convert-to-similar-ascii-char
          cleanString(
              str_replace( // preg_replace can be used to support more complicated replacements
                  array_keys($dict),
                  array_values($dict),
                  urldecode($string)
              )
          )
        )
    );
}

function cleanString($text) {
    $utf8 = array(
        '/[áàâãªä]/u'   =>   'a',
        '/[ÁÀÂÃÄ]/u'    =>   'A',
        '/[ÍÌÎÏ]/u'     =>   'I',
        '/[íìîï]/u'     =>   'i',
        '/[éèêë]/u'     =>   'e',
        '/[ÉÈÊË]/u'     =>   'E',
        '/[óòôõºö]/u'   =>   'o',
        '/[ÓÒÔÕÖ]/u'    =>   'O',
        '/[úùûü]/u'     =>   'u',
        '/[ÚÙÛÜ]/u'     =>   'U',
        '/ç/'           =>   'c',
        '/Ç/'           =>   'C',
        '/ñ/'           =>   'n',
        '/Ñ/'           =>   'N',
        '/–/'           =>   '-', // UTF-8 hyphen to "normal" hyphen
        '/[’‘‹›‚]/u'    =>   ' ', // Literally a single quote
        '/[“”«»„]/u'    =>   ' ', // Double quote
        '/ /'           =>   ' ', // nonbreaking space (equiv. to 0x160)
    );
    return preg_replace(array_keys($utf8), array_values($utf8), $text);
}

The rationale for the above functions (which I find way inefficient - the one below is better) is that a service that shall not be named apparently ran spelling checks and keyword recognition on the URLs.

上述函数的基本原理(我发现这种方法效率很低——下面的更好)是,一个不应该被命名的服务显然会对url进行拼写检查和关键字识别。

After losing a long time on a customer's paranoias, I found out they were not imagining things after all -- their SEO experts [I am definitely not one] reported that, say, converting "Viaggi Economy Perù" to viaggi-economy-peru "behaved better" than viaggi-economy-per (the previous "cleaning" removed UTF8 characters; Bogotà became bogot, Medellìn became medelln and so on).

在花了很长时间研究客户的偏执狂之后,我发现他们根本不是在想象事情——他们的SEO专家(我肯定不是)报告说,把“Viaggi Economy Peru”改成了Viaggi - Economy - Peru——Peru——表现得比Viaggi - economyper好(之前的“清理”去掉了UTF8字符);波哥大变成了波哥大,麦德林变成了麦德林等等)。

There were also some common misspellings that seemed to influence the results, and the only explanation that made sense to me is that our URL were being unpacked, the words singled out, and used to drive God knows what ranking algorithms. And those algorithms apparently had been fed with UTF8-cleaned strings, so that "Perù" became "Peru" instead of "Per". "Per" did not match and sort of took it in the neck.

还有一些常见的拼写错误似乎会影响结果,唯一让我觉得有意义的解释是,我们的URL被打开了,单词被挑出来了,用来驱动天晓得什么排名算法。这些算法显然是用utf8清理的字符串输入的,所以“秘鲁”变成了“秘鲁”而不是“Per”。"Per"和"Per"不匹配,有点像是把它戴在脖子上。

In order to both keep UTF8 characters and replace some misspellings, the faster function below became the more accurate (?) function above. $dict needs to be hand tailored, of course.

为了保持UTF8字符并替换一些拼写错误,下面更快的函数变成上面更准确的(?)函数。当然,我们需要手工定制。

Previous answer

A simple approach:

一个简单的方法:

// Remove all characters except A-Z, a-z, 0-9, dots, hyphens and spaces
// Note that the hyphen must go last not to be confused with a range (A-Z)
// and the dot, being special, is escaped with \
$str = preg_replace('/[^A-Za-z0-9\. -]/', '', $str);

// Replace sequences of spaces with hyphen
$str = preg_replace('/  */', '-', $str);

// The above means "a space, followed by a space repeated zero or more times"
// (should be equivalent to / +/)

// You may also want to try this alternative:
$str = preg_replace('/\\s+/', '-', $str);

// where \s+ means "zero or more whitespaces" (a space is not necessarily the
// same as a whitespace) just to be sure and include everything

Note that you might have to first urldecode() the URL, since %20 and + both are actually spaces - I mean, if you have "Never%20gonna%20give%20you%20up" you want it to become Never-gonna-give-you-up, not Never20gonna20give20you20up . You might not need it, but I thought I'd mention the possibility.

请注意,您可能必须首先对URL进行urldecode()解码,因为%20和+实际上都是空格——我的意思是,如果您“从来没有%20 %20将给%20 %20”,那么您希望它成为永远不会给您的东西,而不是永远不会给您的东西。你可能不需要它,但我想我应该提到它的可能性。

So the finished function along with test cases:

所以完成的功能和测试用例:

function hyphenize($string) {
    return 
    ## strtolower(
          preg_replace(
            array('#[\\s-]+#', '#[^A-Za-z0-9\. -]+#'),
            array('-', ''),
        ##     cleanString(
              urldecode($string)
        ##     )
        )
    ## )
    ;
}

print implode("\n", array_map(
    function($s) {
            return $s . ' becomes ' . hyphenize($s);
    },
    array(
    'Never%20gonna%20give%20you%20up',
    "I'm not the man I was",
    "'Légeresse', dit sa majesté",
    )));


Never%20gonna%20give%20you%20up    becomes  never-gonna-give-you-up
I'm not the man I was              becomes  im-not-the-man-I-was
'Légeresse', dit sa majesté        becomes  legeresse-dit-sa-majeste

To handle UTF-8 I used a cleanString implementation found here. It could be simplified and wrapped inside the function here for performance.

为了处理UTF-8,我在这里使用了一个干净的字符串实现。为了提高性能,可以将它简化并封装在函数中。

The function above also implements converting to lowercase - but that's a taste. The code to do so has been commented out.

上面的函数也实现了转换为小写——但这是一种味道。这样做的代码已经被注释掉了。

#3


32  

Here, check out this function:

这里,看看这个函数:

function seo_friendly_url($string){
    $string = str_replace(array('[\', \']'), '', $string);
    $string = preg_replace('/\[.*\]/U', '', $string);
    $string = preg_replace('/&(amp;)?#?[a-z0-9]+;/i', '-', $string);
    $string = htmlentities($string, ENT_COMPAT, 'utf-8');
    $string = preg_replace('/&([a-z])(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig|quot|rsquo);/i', '\\1', $string );
    $string = preg_replace(array('/[^a-z0-9]/i', '/[-]+/') , '-', $string);
    return strtolower(trim($string, '-'));
}