如何在字符串中转换单词smart quotes和em破折号?

时间:2022-05-11 22:49:23

I have a form with a textarea. Users enter a block of text which is stored in a database.

我有一个文本区域的表单。用户输入存储在数据库中的文本块。

Occasionally a user will paste text from Word containing smart quotes or emdashes. Those characters appear in the database as: –, ’, “ ,â€

有时,用户会从包含智能引号或破折号的单词中粘贴文本。这些人物出现在数据库:€”,一个€™,一个€œ,€

What function should I call on the input string to convert smart quotes to regular quotes and emdashes to regular dashes?

我应该在输入字符串中调用什么函数来将智能引号转换为常规引号,而将emdash转换为常规破折号?

I am working in PHP.

我在PHP工作。

Update: Thanks for all of the great responses so far. The page on Joel's site about encodings is very informative: http://www.joelonsoftware.com/articles/Unicode.html

更新:感谢到目前为止所有的回复。Joel网站上关于编码的页面信息量非常大:http://www.joelonsoftware.com/articles/Unicode.html。

Some notes on my environment:

关于我的环境:

The MySQL database is using UTF-8 encoding. Likewise, the HTML pages that display the content are using UTF-8 (Update:) by explicitly setting the meta content-type.

MySQL数据库使用UTF-8编码。同样,显示内容的HTML页面也使用UTF-8 (Update:)显式设置元内容类型。

On those pages the smart quotes and emdashes appear as a diamond with question mark.

在这些页面上,醒目的引号和破折号以带有问号的菱形出现。

Solution:

解决方案:

Thanks again for the responses. The solution was twofold:

再次感谢您的回复。解决方案是两方面的:

  1. Make sure the database and HTML files were explicitly set to use UTF-8 encoding.
  2. 确保数据库和HTML文件被显式设置为使用UTF-8编码。
  3. Use htmlspecialchars() instead of htmlentities().
  4. 使用htmlspecialchars()而不是htmlentities()。

13 个解决方案

#1


15  

This sounds like a Unicode issue. Joel Spolsky has a good jumping off point on the topic: http://www.joelonsoftware.com/articles/Unicode.html

这听起来像是Unicode的问题。乔尔·斯波尔斯基(Joel Spolsky)在这个话题上有一个很好的切入点:http://www.joelonsoftware.com/articles/Unicode.html

#2


9  

The mysql database is using UTF-8 encoding. Likewise, the html pages that display the content are using UTF-8.

mysql数据库使用UTF-8编码。同样,显示内容的html页面也使用UTF-8。

The content of the HTML can be in UTF-8, yes, but are you explicitly setting the content type (encoding) of your HTML pages (generated via PHP?) to UTF-8 as well? Try returning a Content-Type header of "text/html;charset=utf-8" or add <meta> tags to your HTMLs:

HTML的内容可以是UTF-8,是的,但是您是否也显式地将HTML页面的内容类型(编码)(通过PHP生成?)设置为UTF-8 ?尝试返回“text/html;charset=utf-8”的内容类型头,或向HTMLs添加 标记:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>

That way, the content type of the data submitted to PHP will also be the same.

这样,提交给PHP的数据的内容类型也是相同的。

I had a similar issue and adding the <meta> tag worked for me.

我也遇到过类似的问题,添加 标签对我很有效。

#3


4  

It sounds like the real problem is that your database is not using the same character encoding as your page (which should probably be UTF-8). In that case, if any user submits a non-ASCII character you'll probably see weird characters in the database. Finding and fixing just a few of them (curly quotes and em dashes) isn't going to solve the real problem.

听起来,真正的问题是您的数据库没有使用与页面相同的字符编码(可能应该是UTF-8)。在这种情况下,如果任何用户提交一个非ascii字符,您可能会在数据库中看到奇怪的字符。找到并修复其中的一些(卷引号和em破折号)并不能解决真正的问题。

Here is some info on migrating your database to another character encoding, at least for a MySQL database.

下面是一些关于将数据库迁移到另一个字符编码的信息,至少对于MySQL数据库是这样。

#4


2  

This is an unfortunately all-too-common problem, not helped by PHP's very poor handling of character sets.

不幸的是,这是一个非常常见的问题,PHP对字符集的处理非常糟糕。

What we do is force the text through iconv

我们所做的就是通过iconv来强迫文本!

// Convert input data to UTF8, ignore any odd (MS Word..) chars
// that don't translate
$input = iconv("ISO-8859-1","UTF-8//IGNORE",$input);

The //IGNORE flag means that anything that can't be translated will be thrown away.

//忽略标志意味着任何不能翻译的内容都将被丢弃。

If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded.

如果您附加了字符串//忽略,那么不能在目标字符集中表示的字符将被无声地丢弃。

#5


1  

We would often use standard string replace functions for that. Even though the nature of ASCII/Unicode in that context is pretty murky, it works. Just make sure your php file is saved in the right encoding format, etc.

我们经常使用标准字符串替换函数。虽然ASCII/Unicode在这方面的本质相当模糊,但它仍然有效。只要确保您的php文件以正确的编码格式保存,等等。

#6


1  

In my experience, it's easier to just accept the smart quotes and make sure you're using the same encoding everywhere. To start, add this to your form tag: accept-charset="utf-8"

根据我的经验,接受智能报价并确保你在任何地方都使用相同的编码是比较容易的。首先,将其添加到表单标记:accept-charset="utf-8"

#7


1  

You could try mb_ convert_encoding from ISO-8859-1 to UTF-8.

您可以尝试从ISO-8859-1到UTF-8的mb_ convert_encoding。

$str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');

This assumes you want UTF-8, and convert can find reasonable replacements... if not, mb_str_replace or preg_replace them yourself.

假设您需要UTF-8,并且转换可以找到合理的替换…如果没有,mb_str_replace或preg_replace将自己替换它们。

#8


1  

You have to be sure your database connection is configured to accept and provide UTF-8 from and to the client (otherwise it will convert to the "default", which is usually latin1).

您必须确保您的数据库连接被配置为接受并从客户端和客户端提供UTF-8(否则它将转换为“默认”,通常是latin1)。

In practice this means running a query SET NAMES 'utf8';

实际上,这意味着运行一个查询集名称“utf8”;

http://www.phpwact.org/php/i18n/utf-8/mysql

http://www.phpwact.org/php/i18n/utf-8/mysql

Also, smart quotes are part of the windows-1252 character set, not iso-8859-1 (latin-1). Not very relevant to your problem, but just FYI. The euro symbol is in there as well.

另外,smart quotes是windows-1252字符集的一部分,而不是iso-8859-1 (latin-1)。和你的问题不太相关,但我要告诉你。欧元的象征也在其中。

#9


1  

the problem is on the mysql charset, I fixed my issues with this line of code.

问题在mysql字符集上,我用这行代码解决了我的问题。

mysql_set_charset('utf8',$link); 

#10


1  

You have to manually change the collation of individual columns to UTF8; changing the database overall won't alter these.

您必须手动将各个列的排序规则更改为UTF8;改变数据库整体不会改变这些。

#11


1  

If you were looking to escape these characters for the web while preserving their appearance, so your strings will appear like this: “It’s nice!” rather than "It's boring"...

如果您希望在保存这些字符的外观的同时为web转义这些字符,那么您的字符串将如下所示:“这很好!”而不是“很无聊”……

You can do this by using your own custom htmlEncode function in place of PHP's htmlentities():

您可以使用自己的自定义htmlEncode函数来替代PHP的htmlentities():

$trans_tbl = false;

function htmlEncode($text) {

  global $trans_tbl;

  // create translation table once
  if(!$trans_tbl) {
    // start with the default set of conversions and add more.

    $trans_tbl = get_html_translation_table(HTML_ENTITIES); 

    $trans_tbl[chr(130)] = '&sbquo;';    // Single Low-9 Quotation Mark
    $trans_tbl[chr(131)] = '&fnof;';    // Latin Small Letter F With Hook
    $trans_tbl[chr(132)] = '&bdquo;';    // Double Low-9 Quotation Mark
    $trans_tbl[chr(133)] = '&hellip;';    // Horizontal Ellipsis
    $trans_tbl[chr(134)] = '&dagger;';    // Dagger
    $trans_tbl[chr(135)] = '&Dagger;';    // Double Dagger
    $trans_tbl[chr(136)] = '&circ;';    // Modifier Letter Circumflex Accent
    $trans_tbl[chr(137)] = '&permil;';    // Per Mille Sign
    $trans_tbl[chr(138)] = '&Scaron;';    // Latin Capital Letter S With Caron
    $trans_tbl[chr(139)] = '&lsaquo;';    // Single Left-Pointing Angle Quotation Mark
    $trans_tbl[chr(140)] = '&OElig;';    // Latin Capital Ligature OE

    // smart single/ double quotes (from MS)
    $trans_tbl[chr(145)] = '&lsquo;'; 
    $trans_tbl[chr(146)] = '&rsquo;'; 
    $trans_tbl[chr(147)] = '&ldquo;'; 
    $trans_tbl[chr(148)] = '&rdquo;'; 

    $trans_tbl[chr(149)] = '&bull;';    // Bullet
    $trans_tbl[chr(150)] = '&ndash;';    // En Dash
    $trans_tbl[chr(151)] = '&mdash;';    // Em Dash
    $trans_tbl[chr(152)] = '&tilde;';    // Small Tilde
    $trans_tbl[chr(153)] = '&trade;';    // Trade Mark Sign
    $trans_tbl[chr(154)] = '&scaron;';    // Latin Small Letter S With Caron
    $trans_tbl[chr(155)] = '&rsaquo;';    // Single Right-Pointing Angle Quotation Mark
    $trans_tbl[chr(156)] = '&oelig;';    // Latin Small Ligature OE
    $trans_tbl[chr(159)] = '&Yuml;';    // Latin Capital Letter Y With Diaeresis

    ksort($trans_tbl);
  }

  // escape HTML      
  return strtr($text, $trans_tbl); 
}

#12


0  

This may not be the best solution, but I'd try testing to find out what PHP sees. Let's say it sees "–" (there are a few other possibilities, like simple "“" or maybe "&#8220;"). Then do a str_replace to get rid of all of those and replace them with normal quotes, before stuffing the answer in a database.

这可能不是最好的解决方案,但我将尝试测试,看看PHP看到了什么。假设它看到“€””(还有一些其他的可能性,比如简单”“”或者“& # 8220;”)。然后,在将答案填入数据库之前,执行str_replace,以删除所有这些数据,并用普通引号替换它们。

The better solution would probably involve making the end-to-end data passing all UTF-8, as people are trying to help with in other answers.

更好的解决方案可能包括使端到端数据传递所有UTF-8,因为人们正在尝试在其他答案中提供帮助。

#13


0  

Actually the problem is not happening in PHP but it is happening in JavaScript, it is due to copy/paste from Word, so you need to solve your problem in JavaScript before you pass your text to PHP, Please see this answer https://*.com/a/6219023/1857295.

实际上,问题不是发生在PHP中,而是发生在JavaScript中,这是由于Word的复制/粘贴,所以在将文本传递给PHP之前,需要用JavaScript解决问题,请参见https://*.com/a/6219023/1857295。

#1


15  

This sounds like a Unicode issue. Joel Spolsky has a good jumping off point on the topic: http://www.joelonsoftware.com/articles/Unicode.html

这听起来像是Unicode的问题。乔尔·斯波尔斯基(Joel Spolsky)在这个话题上有一个很好的切入点:http://www.joelonsoftware.com/articles/Unicode.html

#2


9  

The mysql database is using UTF-8 encoding. Likewise, the html pages that display the content are using UTF-8.

mysql数据库使用UTF-8编码。同样,显示内容的html页面也使用UTF-8。

The content of the HTML can be in UTF-8, yes, but are you explicitly setting the content type (encoding) of your HTML pages (generated via PHP?) to UTF-8 as well? Try returning a Content-Type header of "text/html;charset=utf-8" or add <meta> tags to your HTMLs:

HTML的内容可以是UTF-8,是的,但是您是否也显式地将HTML页面的内容类型(编码)(通过PHP生成?)设置为UTF-8 ?尝试返回“text/html;charset=utf-8”的内容类型头,或向HTMLs添加 标记:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>

That way, the content type of the data submitted to PHP will also be the same.

这样,提交给PHP的数据的内容类型也是相同的。

I had a similar issue and adding the <meta> tag worked for me.

我也遇到过类似的问题,添加 标签对我很有效。

#3


4  

It sounds like the real problem is that your database is not using the same character encoding as your page (which should probably be UTF-8). In that case, if any user submits a non-ASCII character you'll probably see weird characters in the database. Finding and fixing just a few of them (curly quotes and em dashes) isn't going to solve the real problem.

听起来,真正的问题是您的数据库没有使用与页面相同的字符编码(可能应该是UTF-8)。在这种情况下,如果任何用户提交一个非ascii字符,您可能会在数据库中看到奇怪的字符。找到并修复其中的一些(卷引号和em破折号)并不能解决真正的问题。

Here is some info on migrating your database to another character encoding, at least for a MySQL database.

下面是一些关于将数据库迁移到另一个字符编码的信息,至少对于MySQL数据库是这样。

#4


2  

This is an unfortunately all-too-common problem, not helped by PHP's very poor handling of character sets.

不幸的是,这是一个非常常见的问题,PHP对字符集的处理非常糟糕。

What we do is force the text through iconv

我们所做的就是通过iconv来强迫文本!

// Convert input data to UTF8, ignore any odd (MS Word..) chars
// that don't translate
$input = iconv("ISO-8859-1","UTF-8//IGNORE",$input);

The //IGNORE flag means that anything that can't be translated will be thrown away.

//忽略标志意味着任何不能翻译的内容都将被丢弃。

If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded.

如果您附加了字符串//忽略,那么不能在目标字符集中表示的字符将被无声地丢弃。

#5


1  

We would often use standard string replace functions for that. Even though the nature of ASCII/Unicode in that context is pretty murky, it works. Just make sure your php file is saved in the right encoding format, etc.

我们经常使用标准字符串替换函数。虽然ASCII/Unicode在这方面的本质相当模糊,但它仍然有效。只要确保您的php文件以正确的编码格式保存,等等。

#6


1  

In my experience, it's easier to just accept the smart quotes and make sure you're using the same encoding everywhere. To start, add this to your form tag: accept-charset="utf-8"

根据我的经验,接受智能报价并确保你在任何地方都使用相同的编码是比较容易的。首先,将其添加到表单标记:accept-charset="utf-8"

#7


1  

You could try mb_ convert_encoding from ISO-8859-1 to UTF-8.

您可以尝试从ISO-8859-1到UTF-8的mb_ convert_encoding。

$str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');

This assumes you want UTF-8, and convert can find reasonable replacements... if not, mb_str_replace or preg_replace them yourself.

假设您需要UTF-8,并且转换可以找到合理的替换…如果没有,mb_str_replace或preg_replace将自己替换它们。

#8


1  

You have to be sure your database connection is configured to accept and provide UTF-8 from and to the client (otherwise it will convert to the "default", which is usually latin1).

您必须确保您的数据库连接被配置为接受并从客户端和客户端提供UTF-8(否则它将转换为“默认”,通常是latin1)。

In practice this means running a query SET NAMES 'utf8';

实际上,这意味着运行一个查询集名称“utf8”;

http://www.phpwact.org/php/i18n/utf-8/mysql

http://www.phpwact.org/php/i18n/utf-8/mysql

Also, smart quotes are part of the windows-1252 character set, not iso-8859-1 (latin-1). Not very relevant to your problem, but just FYI. The euro symbol is in there as well.

另外,smart quotes是windows-1252字符集的一部分,而不是iso-8859-1 (latin-1)。和你的问题不太相关,但我要告诉你。欧元的象征也在其中。

#9


1  

the problem is on the mysql charset, I fixed my issues with this line of code.

问题在mysql字符集上,我用这行代码解决了我的问题。

mysql_set_charset('utf8',$link); 

#10


1  

You have to manually change the collation of individual columns to UTF8; changing the database overall won't alter these.

您必须手动将各个列的排序规则更改为UTF8;改变数据库整体不会改变这些。

#11


1  

If you were looking to escape these characters for the web while preserving their appearance, so your strings will appear like this: “It’s nice!” rather than "It's boring"...

如果您希望在保存这些字符的外观的同时为web转义这些字符,那么您的字符串将如下所示:“这很好!”而不是“很无聊”……

You can do this by using your own custom htmlEncode function in place of PHP's htmlentities():

您可以使用自己的自定义htmlEncode函数来替代PHP的htmlentities():

$trans_tbl = false;

function htmlEncode($text) {

  global $trans_tbl;

  // create translation table once
  if(!$trans_tbl) {
    // start with the default set of conversions and add more.

    $trans_tbl = get_html_translation_table(HTML_ENTITIES); 

    $trans_tbl[chr(130)] = '&sbquo;';    // Single Low-9 Quotation Mark
    $trans_tbl[chr(131)] = '&fnof;';    // Latin Small Letter F With Hook
    $trans_tbl[chr(132)] = '&bdquo;';    // Double Low-9 Quotation Mark
    $trans_tbl[chr(133)] = '&hellip;';    // Horizontal Ellipsis
    $trans_tbl[chr(134)] = '&dagger;';    // Dagger
    $trans_tbl[chr(135)] = '&Dagger;';    // Double Dagger
    $trans_tbl[chr(136)] = '&circ;';    // Modifier Letter Circumflex Accent
    $trans_tbl[chr(137)] = '&permil;';    // Per Mille Sign
    $trans_tbl[chr(138)] = '&Scaron;';    // Latin Capital Letter S With Caron
    $trans_tbl[chr(139)] = '&lsaquo;';    // Single Left-Pointing Angle Quotation Mark
    $trans_tbl[chr(140)] = '&OElig;';    // Latin Capital Ligature OE

    // smart single/ double quotes (from MS)
    $trans_tbl[chr(145)] = '&lsquo;'; 
    $trans_tbl[chr(146)] = '&rsquo;'; 
    $trans_tbl[chr(147)] = '&ldquo;'; 
    $trans_tbl[chr(148)] = '&rdquo;'; 

    $trans_tbl[chr(149)] = '&bull;';    // Bullet
    $trans_tbl[chr(150)] = '&ndash;';    // En Dash
    $trans_tbl[chr(151)] = '&mdash;';    // Em Dash
    $trans_tbl[chr(152)] = '&tilde;';    // Small Tilde
    $trans_tbl[chr(153)] = '&trade;';    // Trade Mark Sign
    $trans_tbl[chr(154)] = '&scaron;';    // Latin Small Letter S With Caron
    $trans_tbl[chr(155)] = '&rsaquo;';    // Single Right-Pointing Angle Quotation Mark
    $trans_tbl[chr(156)] = '&oelig;';    // Latin Small Ligature OE
    $trans_tbl[chr(159)] = '&Yuml;';    // Latin Capital Letter Y With Diaeresis

    ksort($trans_tbl);
  }

  // escape HTML      
  return strtr($text, $trans_tbl); 
}

#12


0  

This may not be the best solution, but I'd try testing to find out what PHP sees. Let's say it sees "–" (there are a few other possibilities, like simple "“" or maybe "&#8220;"). Then do a str_replace to get rid of all of those and replace them with normal quotes, before stuffing the answer in a database.

这可能不是最好的解决方案,但我将尝试测试,看看PHP看到了什么。假设它看到“€””(还有一些其他的可能性,比如简单”“”或者“& # 8220;”)。然后,在将答案填入数据库之前,执行str_replace,以删除所有这些数据,并用普通引号替换它们。

The better solution would probably involve making the end-to-end data passing all UTF-8, as people are trying to help with in other answers.

更好的解决方案可能包括使端到端数据传递所有UTF-8,因为人们正在尝试在其他答案中提供帮助。

#13


0  

Actually the problem is not happening in PHP but it is happening in JavaScript, it is due to copy/paste from Word, so you need to solve your problem in JavaScript before you pass your text to PHP, Please see this answer https://*.com/a/6219023/1857295.

实际上,问题不是发生在PHP中,而是发生在JavaScript中,这是由于Word的复制/粘贴,所以在将文本传递给PHP之前,需要用JavaScript解决问题,请参见https://*.com/a/6219023/1857295。