当打开Excel和TextEdit时,UTF8 CSV文件的编码问题

时间:2023-01-05 18:51:33

I recently added a CSV-download button that takes data from database (Postgres) an array from server (Ruby on Rails), and turns it into a CSV file on the client side (Javascript, HTML5). I'm currently testing the CSV file and I am coming across some encoding issues.

我最近添加了一个CSV-download按钮,它从数据库(Postgres)获取数据,从服务器(Ruby on Rails)获取数据,并将其转换为客户端(Javascript、HTML5)的CSV文件。我目前正在测试CSV文件,遇到了一些编码问题。

When I view the CSV file via 'less', the file appears fine. But when I open the file in Excel OR TextEdit, I start seeing weird characters like

当我通过“less”查看CSV文件时,文件显示良好。但是当我在Excel或TextEdit中打开文件时,我开始看到一些奇怪的字符。

—, â€, “

€€”,一个€œ

appear in the text. Basically, I see the characters that are described here: http://digwp.com/2011/07/clean-up-weird-characters-in-database/

出现在文本中。基本上,我看到了这里描述的字符:http://digwp.com/2011/07/clean- strange -characters-in database/

I read that this sort of issue can arise when the Database encoding setting is set to the wrong one. BUT, the database that I am using is set to use UTF8 encoding. And when I debug through the JS codes that create the CSV file, the text appear normal. (This could be a Chrome ability, and less capability)

我读到,当数据库编码设置设置为错误时,这种问题就会出现。但是,我正在使用的数据库设置为使用UTF8编码。当我调试创建CSV文件的JS代码时,文本显示正常。(这可能是Chrome的能力,而不是Chrome的能力)

I'm feeling frustrated because the only thing I am learning from my online search is that there could be many reasons why encoding is not working, I'm not sure which part is at fault (so excuse me as I initially tag numerous things), and nothing I tried has shed new light on my problem.

我感到沮丧,因为我学习的唯一在线搜索的原因可能有很多编码是不工作,我不确定是哪一部分过错(对不起我最初标记很多东西),并没有发生什么我试着揭示我的问题。

For reference, here's the JavaScript snippet that creates the CSV file!

作为参考,下面是创建CSV文件的JavaScript代码片段!

$(document).ready(function() {
var csvData = <%= raw to_csv(@view_scope, clicks_post).as_json %>;
var csvContent = "data:text/csv;charset=utf-8,";
csvData.forEach(function(infoArray, index){
  var dataString = infoArray.join(",");
  csvContent += dataString+ "\n";
}); 
var encodedUri = encodeURI(csvContent);
var button = $('<a>');
button.text('Download CSV');
button.addClass("button right");
button.attr('href', encodedUri);
button.attr('target','_blank');
button.attr('download','<%=title%>_25_posts.csv');
$("#<%=title%>_download_action").append(button);
});

6 个解决方案

#1


36  

As @jlarson updated with information that Mac was the biggest culprit we might get some further. Office for Mac has, at least 2011 and back, rather poor support for reading Unicode formats when importing files.

正如@jlarson更新的信息,Mac是最大的罪魁祸首,我们可能会得到更多。至少在2011年和2011年,Office for Mac在导入文件时对读取Unicode格式的支持都很差。

Support for UTF-8 seems to be close to non-existent, have read a tiny few comments about it working, whilst the majority say it does not. Unfortunately I do not have any Mac to test on. So again: The files themselves should be OK as UTF-8, but the import halts the process.

对UTF-8的支持似乎几乎不存在,已经阅读了一些关于它工作的评论,而大多数人说它不工作。不幸的是,我没有任何Mac要测试。同样:文件本身应该可以作为UTF-8,但是导入会中止进程。

Wrote up a quick test in Javascript for exporting percent escaped UTF-16 little and big endian, with- / without BOM etc.

编写了一个Javascript快速测试,用于导出% UTF-16小和大的endian,带有- /没有BOM等等。

Code should probably be refactored but should be OK for testing. It might work better then UTF-8. Of course this also usually means bigger data transfers as any glyph is two or four bytes.

代码应该被重构,但是应该可以进行测试。它可能比UTF-8更好。当然,这通常意味着更大的数据传输,因为任何字形都是2或4个字节。

You can find a fiddle here:

你可以在这里找到小提琴

Unicode export sample Fiddle

Unicode出口样品小提琴

Note that it does not handle CSV in any particular way. It is mainly meant for pure conversion to data URL having UTF-8, UTF-16 big/little endian and +/- BOM. There is one option in the fiddle to replace commas with tabs, – but believe that would be rather hackish and fragile solution if it works.

注意,它不以任何特定的方式处理CSV。它主要用于纯转换为具有UTF-8、UTF-16 big/little endian和+/- BOM的数据URL。在古提琴中有一个选项可以用制表符代替逗号,但相信这是一个相当陈腐和脆弱的解决方案,如果它有效的话。


Typically use like:

通常使用:

// Initiate
encoder = new DataEnc({
    mime   : 'text/csv',
    charset: 'UTF-16BE',
    bom    : true
});

// Convert data to percent escaped text
encoder.enc(data);

// Get result
var result = encoder.pay();

There is two result properties of the object:

该对象有两个结果属性:

1.) encoder.lead

1)encoder.lead

This is the mime-type, charset etc. for data URL. Built from options passed to initializer, or one can also say .config({ ... new conf ...}).intro() to re-build.

这是mime类型、字符集等数据URL。从传递给初始化器的选项中构建,或者也可以说.config({…新的conf…).intro()来重新构建。

data:[<MIME-type>][;charset=<encoding>][;base64]

You can specify base64, but there is no base64 conversion (at least not this far).

您可以指定base64,但是没有base64转换(至少目前还没有)。

2.) encoder.buf

2)encoder.buf

This is a string with the percent escaped data.

这是一个带有转义数据百分比的字符串。

The .pay() function simply return 1.) and 2.) as one.

函数的作用是:返回1.)和2.)。


Main code:


function DataEnc(a) {
    this.config(a);
    this.intro();
}
/*
* http://www.iana.org/assignments/character-sets/character-sets.xhtml
* */
DataEnc._enctype = {
        u8    : ['u8', 'utf8'],
        // RFC-2781, Big endian should be presumed if none given
        u16be : ['u16', 'u16be', 'utf16', 'utf16be', 'ucs2', 'ucs2be'],
        u16le : ['u16le', 'utf16le', 'ucs2le']
};
DataEnc._BOM = {
        'none'     : '',
        'UTF-8'    : '%ef%bb%bf', // Discouraged
        'UTF-16BE' : '%fe%ff',
        'UTF-16LE' : '%ff%fe'
};
DataEnc.prototype = {
    // Basic setup
    config : function(a) {
        var opt = {
            charset: 'u8',
            mime   : 'text/csv',
            base64 : 0,
            bom    : 0
        };
        a = a || {};
        this.charset = typeof a.charset !== 'undefined' ?
                        a.charset : opt.charset;
        this.base64 = typeof a.base64 !== 'undefined' ? a.base64 : opt.base64;
        this.mime = typeof a.mime !== 'undefined' ? a.mime : opt.mime;
        this.bom = typeof a.bom !== 'undefined' ? a.bom : opt.bom;

        this.enc = this.utf8;
        this.buf = '';
        this.lead = '';
        return this;
    },
    // Create lead based on config
    // data:[<MIME-type>][;charset=<encoding>][;base64],<data>
    intro : function() {
        var
            g = [],
            c = this.charset || '',
            b = 'none'
        ;
        if (this.mime && this.mime !== '')
            g.push(this.mime);
        if (c !== '') {
            c = c.replace(/[-\s]/g, '').toLowerCase();
            if (DataEnc._enctype.u8.indexOf(c) > -1) {
                c = 'UTF-8';
                if (this.bom)
                    b = c;
                this.enc = this.utf8;
            } else if (DataEnc._enctype.u16be.indexOf(c) > -1) {
                c = 'UTF-16BE';
                if (this.bom)
                    b = c;
                this.enc = this.utf16be;
            } else if (DataEnc._enctype.u16le.indexOf(c) > -1) {
                c = 'UTF-16LE';
                if (this.bom)
                    b = c;
                this.enc = this.utf16le;
            } else {
                if (c === 'copy')
                    c = '';
                this.enc = this.copy;
            }
        }
        if (c !== '')
            g.push('charset=' + c);
        if (this.base64)
            g.push('base64');
        this.lead = 'data:' + g.join(';') + ',' + DataEnc._BOM[b];
        return this;
    },
    // Deliver
    pay : function() {
        return this.lead + this.buf;
    },
    // UTF-16BE
    utf16be : function(t) { // U+0500 => %05%00
        var i, c, buf = [];
        for (i = 0; i < t.length; ++i) {
            if ((c = t.charCodeAt(i)) > 0xff) {
                buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
            } else {
                buf.push('00');
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
            }
        }
        this.buf += '%' + buf.join('%');
        // Note the hex array is returned, not string with '%'
        // Might be useful if one want to loop over the data.
        return buf;
    },
    // UTF-16LE
    utf16le : function(t) { // U+0500 => %00%05
        var i, c, buf = [];
        for (i = 0; i < t.length; ++i) {
            if ((c = t.charCodeAt(i)) > 0xff) {
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
                buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));
            } else {
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
                buf.push('00');
            }
        }
        this.buf += '%' + buf.join('%');
        // Note the hex array is returned, not string with '%'
        // Might be useful if one want to loop over the data.
        return buf;
    },
    // UTF-8
    utf8 : function(t) {
        this.buf += encodeURIComponent(t);
        return this;
    },
    // Direct copy
    copy : function(t) {
        this.buf += t;
        return this;
    }
};

Previous answer:


I do not have any setup to replicate yours, but if your case is the same as @jlarson then the resulting file should be correct.

我没有任何设置来复制您的,但是如果您的案例与@jlarson相同,那么结果文件应该是正确的。

This answer became somewhat long, (fun topic you say?), but discuss various aspects around the question, what is (likely) happening, and how to actually check what is going on in various ways.

这个答案有点长(你说的很有趣),但是围绕这个问题讨论不同的方面,什么(可能)正在发生,以及如何以不同的方式检查正在发生的事情。

TL;DR:

The text is likely imported as ISO-8859-1, Windows-1252, or the like, and not as UTF-8. Force application to read file as UTF-8 by using import or other means.

文本可能被导入为ISO-8859-1、Windows-1252或类似的格式,而不是UTF-8。强制应用程序使用导入或其他方法将文件读为UTF-8。


PS: The UniSearcher is a nice tool to have available on this journey.

在这次旅行中,UniSearcher是一个很好的工具。

The long way around

The "easiest" way to be 100% sure what we are looking at is to use a hex-editor on the result. Alternatively use hexdump, xxd or the like from command line to view the file. In this case the byte sequence should be that of UTF-8 as delivered from the script.

要百分百确定我们要查看的内容,“最简单”的方法是在结果上使用十六进制编辑器。或者使用hexdump、xxd或类似的命令行来查看文件。在这种情况下,字节序列应该是来自脚本的UTF-8。

As an example if we take the script of jlarson it takes the data Array:

以jlarson的脚本为例它采用了数据数组:

data = ['name', 'city', 'state'],
       ['\u0500\u05E1\u0E01\u1054', 'seattle', 'washington']

This one is merged into the string:

这个合并到字符串中:

 name,city,state<newline>
 \u0500\u05E1\u0E01\u1054,seattle,washington<newline>

which translates by Unicode to:

由Unicode译成:

 name,city,state<newline>
 Ԁסกၔ,seattle,washington<newline>

As UTF-8 uses ASCII as base (bytes with highest bit not set are the same as in ASCII) the only special sequence in the test data is "Ԁסกၔ" which in turn, is:

utf - 8使用ASCII作为基地(与不设置了最高位字节是相同的如ASCII)唯一的特殊序列的测试数据是“Ԁסกၔ”反过来,是:

Code-point  Glyph      UTF-8
----------------------------
    U+0500    Ԁ        d4 80
    U+05E1    ס        d7 a1
    U+0E01    ก     e0 b8 81
    U+1054    ၔ     e1 81 94

Looking at the hex-dump of the downloaded file:

查看下载文件的十六进制转储:

0000000: 6e61 6d65 2c63 6974 792c 7374 6174 650a  name,city,state.
0000010: d480 d7a1 e0b8 81e1 8194 2c73 6561 7474  ..........,seatt
0000020: 6c65 2c77 6173 6869 6e67 746f 6e0a       le,washington.

On second line we find d480 d7a1 e0b8 81e1 8194 which match up with the above:

在第二行,我们发现d480 d7a1 e0b8 81e1 8194与上面的匹配:

0000010: d480  d7a1  e0b8 81  e1 8194 2c73 6561 7474  ..........,seatt
         |   | |   | |     |  |     |  | |  | |  | |
         +-+-+ +-+-+ +--+--+  +--+--+  | |  | |  | |
           |     |      |        |     | |  | |  | |
           Ԁ     ס      ก        ၔ     , s  e a  t t

None of the other characters is mangled either.

其他的角色也没有一个被打乱。

Do similar tests if you want. The result should be the similar.

如果你想做类似的测试。结果应该是相似的。


By sample provided —, â€, “

We can also have a look at the sample provided in the question. It is likely to assume that the text is represented in Excel / TextEdit by code-page 1252.

我们也可以看看问题中提供的示例。它可能假设文本在Excel / TextEdit中由代码页1252表示。

To quote Wikipedia on Windows-1252:

在Windows-1252上引用*:

Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages. In LaTeX packages, it is referred to as "ansinew".

Windows s-1252或CP-1252是拉丁字母的字符编码,默认情况下在微软Windows的遗留组件中使用英语和其他一些西方语言。它是Windows代码页组中的一个版本。在乳胶包装中,它被称为“ansinew”。

Retrieving the original bytes

To translate it back into it's original form we can look at the code page layout, from which we get:

要把它转换成原来的形式,我们可以看看代码页布局,从中我们可以得到:

Character:   <â>  <€>  <”>  <,>  < >  <â>  <€>  < >  <,>  < >  <â>  <€>  <œ>
U.Hex    :    e2 20ac 201d   2c   20   e2 20ac   9d   2c   20   e2 20ac  153
T.Hex    :    e2   80   94   2c   20   e2   80   9d*  2c   20   e2   80   9c
  • U is short for Unicode
  • U是Unicode的缩写
  • T is short for Translated
  • T是翻译的简称

For example:

例如:

â => Unicode 0xe2   => CP-1252 0xe2
” => Unicode 0x201d => CP-1252 0x94
€ => Unicode 0x20ac => CP-1252 0x80

Special cases like 9d does not have a corresponding code-point in CP-1252, these we simply copy directly.

像9d这样的特殊情况在CP-1252中没有对应的代码点,这些我们直接复制。

Note: If one look at mangled string by copying the text to a file and doing a hex-dump, save the file with for example UTF-16 encoding to get the Unicode values as represented in the table. E.g. in Vim:

注意:如果您通过将文本复制到文件并执行十六进制转储来查看已损坏的字符串,那么请使用UTF-16编码保存文件,以获得表中所示的Unicode值。例如在Vim中:

set fenc=utf-16
# Or
set fenc=ucs-2

Bytes to UTF-8

We then combine the result, the T.Hex line, into UTF-8. In UTF-8 sequences the bytes are represented by a leading byte telling us how many subsequent bytes make the glyph. For example if a byte has the binary value 110x xxxx we know that this byte and the next represent one code-point. A total of two. 1110 xxxx tells us it is three and so on. ASCII values does not have the high bit set, as such any byte matching 0xxx xxxx is a standalone. A total of one byte.

然后我们把结果,T。十六进制线,成utf - 8。在UTF-8序列中,字节由一个前导字节表示,告诉我们有多少后续字节构成字形。例如,如果一个字节有二进制值110x xxxx,我们知道这个字节和下一个字节代表一个代码点。一共有两个。xxxx告诉我们是3,依此类推。ASCII值没有高位设置,因此匹配0xxx xxxx的任何字节都是独立的。总共一个字节。

0xe2 = 1110 0010bin => 3 bytes => 0xe28094 (em-dash)  —
0x2c = 0010 1100bin => 1 byte  => 0x2c     (comma)    ,
0x2c = 0010 0000bin => 1 byte  => 0x20     (space)   
0xe2 = 1110 0010bin => 3 bytes => 0xe2809d (right-dq) ”
0x2c = 0010 1100bin => 1 byte  => 0x2c     (comma)    ,
0x2c = 0010 0000bin => 1 byte  => 0x20     (space)   
0xe2 = 1110 0010bin => 3 bytes => 0xe2809c (left-dq)  “

Conclusion; The original UTF-8 string was:

结论;最初的UTF-8字符串是:

—, ”, “

Mangling it back

We can also do the reverse. The original string as bytes:

我们也可以反过来做。原始字符串为字节:

UTF-8: e2 80 94 2c 20 e2 80 9d 2c 20 e2 80 9c

Corresponding values in cp-1252:

相应的值在cp - 1252:

e2 => â
80 => €
94 => ”
2c => ,
20 => <space>
...

and so on, result:

等等,结果:

—, â€, “

Importing to MS Excel

In other words: The issue at hand could be how to import UTF-8 text files into MS Excel, and some other applications. In Excel this can be done in various ways.

换句话说:手头的问题可能是如何将UTF-8文本文件导入MS Excel,以及其他一些应用程序。在Excel中,这可以通过各种方式实现。

  • Method one:
  • 方法一:

Do not save the file with an extension recognized by the application, like .csv, or .txt, but omit it completely or make something up.

不要使用应用程序识别的扩展名来保存文件,比如.csv或.txt,但是要完全忽略它或编造一些东西。

As an example save the file as "testfile", with no extension. Then in Excel open the file, confirm that we actually want to open this file, and voilà we get served with the encoding option. Select UTF-8, and file should be correctly read.

例如,将文件保存为“testfile”,没有扩展名。然后在Excel中打开文件,确认我们确实想打开这个文件,这样我们就得到了编码选项。选择UTF-8,应该正确读取文件。

  • Method two:
  • 方法二:

Use import data instead of open file. Something like:

使用导入数据而不是打开文件。喜欢的东西:

Data -> Import External Data -> Import Data

Select encoding and proceed.

选择编码和处理问题。

Check that Excel and selected font actually supports the glyph

We can also test the font support for the Unicode characters by using the, sometimes, friendlier clipboard. For example, copy text from this page into Excel:

我们还可以使用更友好的剪贴板来测试Unicode字符的字体支持。例如,将本页面中的文本复制到Excel中:

If support for the code points exist, the text should render fine.

如果存在对代码点的支持,则文本应该呈现良好。


Linux

On Linux, which is primarily UTF-8 in userland this should not be an issue. Using Libre Office Calc, Vim, etc. show the files correctly rendered.

在Linux上,主要是用户环境中的UTF-8,这应该不是问题。使用Libre Office Calc、Vim等显示正确呈现的文件。


Why it works (or should)

encodeURI from the spec states, (also read sec-15.1.3):

来自spec状态的encodeURI,(也读secl -15.1.3):

The encodeURI function computes a new version of a URI in which each instance of certain characters is replaced by one, two, three, or four escape sequences representing the UTF-8 encoding of the character.

encodeURI函数计算一个URI的新版本,其中特定字符的每个实例都被一个、两个、三个或四个转义序列替换,这些转义序列表示字符的UTF-8编码。

We can simply test this in our console by, for example saying:

我们可以简单地在我们的控制台中进行测试,例如:

>> encodeURI('Ԁסกၔ,seattle,washington')
<< "%D4%80%D7%A1%E0%B8%81%E1%81%94,seattle,washington"

As we register the escape sequences are equal to the ones in the hex dump above:

当我们注册转义序列时等于上面十六进制转储中的序列:

%D4%80%D7%A1%E0%B8%81%E1%81%94 (encodeURI in log)
 d4 80 d7 a1 e0 b8 81 e1 81 94 (hex-dump of file)

or, testing a 4-byte code:

或者,测试一个4字节的代码:

>> encodeURI('????')
<< "%F3%B1%80%81"

If this is does not comply

If nothing of this apply it could help if you added

如果这些都不适用,那么如果您添加了这些内容,就会有所帮助

  1. Sample of expected input vs mangled output, (copy paste).
  2. 期望输入与错误输出的样本(复制粘贴)。
  3. Sample hex-dump of original data vs result file.
  4. 原始数据与结果文件的十六进制转储。

#2


5  

I ran into exactly this yesterday. I was developing a button that exports the contents of an HTML table as a CSV download. The functionality of the button itself is almost identical to yours – on click I read the text from the table and create a data URI with the CSV content.

我昨天碰巧遇到了这件事。我正在开发一个按钮,将HTML表的内容导出为CSV下载。按钮本身的功能与您的几乎完全相同——单击时,我从表中读取文本,并使用CSV内容创建一个数据URI。

When I tried to open the resulting file in Excel it was clear that the "£" symbol was getting read incorrectly. The 2 byte UTF-8 representation was being processed as ASCII resulting in an unwanted garbage character. Some Googling indicated this was a known issue with Excel.

当我试图打开结果文件在Excel中很明显,“£”符号读取错误。2字节的UTF-8表示被处理为ASCII,从而导致不需要的垃圾字符。一些谷歌人指出,这是Excel的一个已知问题。

I tried adding the byte order mark at the start of the string – Excel just interpreted it as ASCII data. I then tried various things to convert the UTF-8 string to ASCII (such as csvData.replace('\u00a3', '\xa3')) but I found that any time the data is coerced to a JavaScript string it will become UTF-8 again. The trick is to convert it to binary and then Base64 encode it without converting back to a string along the way.

我尝试在字符串开头添加字节顺序标记—Excel只是将它解释为ASCII数据。然后我尝试了各种方法将UTF-8字符串转换为ASCII(比如csvData)。替换('\u00a3'、'\xa3')但我发现,每当数据被强制到JavaScript字符串中时,它将再次变成UTF-8。诀窍是将它转换成二进制,然后对它进行Base64编码,而不需要一路转换回字符串。

I already had CryptoJS in my app (used for HMAC authentication against a REST API) and I was able to use that to create an ASCII encoded byte sequence from the original string then Base64 encode it and create a data URI. This worked and the resulting file when opened in Excel does not display any unwanted characters.

我的应用程序中已经有了加密js(用于针对REST API的HMAC认证),我可以使用它从原始字符串创建一个ASCII编码的字节序列,然后对其进行Base64编码并创建一个数据URI。这是可行的,在Excel中打开的结果文件不会显示任何不需要的字符。

The essential bit of code that does the conversion is:

进行转换的基本代码是:

var csvHeader = 'data:text/csv;charset=iso-8859-1;base64,'
var encodedCsv =  CryptoJS.enc.Latin1.parse(csvData).toString(CryptoJS.enc.Base64)
var dataURI = csvHeader + encodedCsv

Where csvData is your CSV string.

其中csvData就是你的CSV字符串。

There are probably ways to do the same thing without CryptoJS if you don't want to bring in that library but this at least shows it is possible.

如果不希望引入这个库,可能有很多方法可以在不使用隐js的情况下执行相同的操作,但这至少表明这是可能的。

#3


3  

Excel likes Unicode in UTF-16 LE with BOM encoding. Output the correct BOM (FF FE), then convert all your data from UTF-8 to UTF-16 LE.

Excel喜欢使用UTF-16编码的Unicode。输出正确的BOM (FF FE),然后将所有数据从UTF-8转换为UTF-16 LE。

Windows uses UTF-16 LE internally, so some applications work better with UTF-16 than with UTF-8.

Windows内部使用UTF-16 LE,因此有些应用程序使用UTF-16比使用UTF-8更好。

I haven't tried to do that in JS, but there're various scripts on the web to convert UTF-8 to UTF-16. Conversion between UTF variations is pretty easy and takes just a dozen of lines.

我还没有尝试在JS中实现这一点,但是web上有各种脚本可以将UTF-8转换为UTF-16。UTF变体之间的转换非常容易,只需要十几行代码。

#4


2  

I was having a similar issue with data that was pulled into Javascript from a Sharepoint list. It turned out to be something called a "Zero Width Space" character and it was being displayed as †when it was brought into Excel. Apparently, Sharepoint inserts these sometimes when a user hits 'backspace'.

我对从Sharepoint列表中提取到Javascript中的数据也有类似的问题。它变成了一个叫做“零宽度的空间”的性格和它被显示为一个€当它被带进Excel。显然,Sharepoint有时会在用户点击“backspace”时插入这些内容。

I replaced them with this quickfix:

我用这个quickfix替换了它们:

var mystring = myString.replace(/\u200B/g,'');

It looks like you may have other hidden characters in there. I found the codepoint for the zero-width character in mine by looking at the output string in the Chrome inspector. The inspector couldn't render the character so it replaced it with a red dot. When you hover your mouse over that red dot, it gives you the codepoint (eg. \u200B) and you can just sub in the various codepoints to the invisible characters and remove them that way.

看起来你可能还有其他隐藏的字符。通过查看Chrome检查器中的输出字符串,我找到了我的零宽度字符的代码点。检查员无法渲染这个字符,所以用一个红点替换了它。当你把鼠标悬停在那个红点上时,它会给你一个代码点(如。你可以将不同的代码点插入到不可见的字符中,然后以这种方式删除它们。

#5


0  

It could be a problem in your server encoding.

这可能是服务器编码中的问题。

You could try (assuming locale english US) if you are running Linux:

如果您正在运行Linux,您可以尝试(假设是locale english US):

sudo locale-gen en_US en_US.UTF-8
dpkg-reconfigure locales

#6


0  

button.href = 'data:' + mimeType + ';charset=UTF-8,%ef%bb%bf' + encodedUri;

this should do the trick

这应该会奏效。

#1


36  

As @jlarson updated with information that Mac was the biggest culprit we might get some further. Office for Mac has, at least 2011 and back, rather poor support for reading Unicode formats when importing files.

正如@jlarson更新的信息,Mac是最大的罪魁祸首,我们可能会得到更多。至少在2011年和2011年,Office for Mac在导入文件时对读取Unicode格式的支持都很差。

Support for UTF-8 seems to be close to non-existent, have read a tiny few comments about it working, whilst the majority say it does not. Unfortunately I do not have any Mac to test on. So again: The files themselves should be OK as UTF-8, but the import halts the process.

对UTF-8的支持似乎几乎不存在,已经阅读了一些关于它工作的评论,而大多数人说它不工作。不幸的是,我没有任何Mac要测试。同样:文件本身应该可以作为UTF-8,但是导入会中止进程。

Wrote up a quick test in Javascript for exporting percent escaped UTF-16 little and big endian, with- / without BOM etc.

编写了一个Javascript快速测试,用于导出% UTF-16小和大的endian,带有- /没有BOM等等。

Code should probably be refactored but should be OK for testing. It might work better then UTF-8. Of course this also usually means bigger data transfers as any glyph is two or four bytes.

代码应该被重构,但是应该可以进行测试。它可能比UTF-8更好。当然,这通常意味着更大的数据传输,因为任何字形都是2或4个字节。

You can find a fiddle here:

你可以在这里找到小提琴

Unicode export sample Fiddle

Unicode出口样品小提琴

Note that it does not handle CSV in any particular way. It is mainly meant for pure conversion to data URL having UTF-8, UTF-16 big/little endian and +/- BOM. There is one option in the fiddle to replace commas with tabs, – but believe that would be rather hackish and fragile solution if it works.

注意,它不以任何特定的方式处理CSV。它主要用于纯转换为具有UTF-8、UTF-16 big/little endian和+/- BOM的数据URL。在古提琴中有一个选项可以用制表符代替逗号,但相信这是一个相当陈腐和脆弱的解决方案,如果它有效的话。


Typically use like:

通常使用:

// Initiate
encoder = new DataEnc({
    mime   : 'text/csv',
    charset: 'UTF-16BE',
    bom    : true
});

// Convert data to percent escaped text
encoder.enc(data);

// Get result
var result = encoder.pay();

There is two result properties of the object:

该对象有两个结果属性:

1.) encoder.lead

1)encoder.lead

This is the mime-type, charset etc. for data URL. Built from options passed to initializer, or one can also say .config({ ... new conf ...}).intro() to re-build.

这是mime类型、字符集等数据URL。从传递给初始化器的选项中构建,或者也可以说.config({…新的conf…).intro()来重新构建。

data:[<MIME-type>][;charset=<encoding>][;base64]

You can specify base64, but there is no base64 conversion (at least not this far).

您可以指定base64,但是没有base64转换(至少目前还没有)。

2.) encoder.buf

2)encoder.buf

This is a string with the percent escaped data.

这是一个带有转义数据百分比的字符串。

The .pay() function simply return 1.) and 2.) as one.

函数的作用是:返回1.)和2.)。


Main code:


function DataEnc(a) {
    this.config(a);
    this.intro();
}
/*
* http://www.iana.org/assignments/character-sets/character-sets.xhtml
* */
DataEnc._enctype = {
        u8    : ['u8', 'utf8'],
        // RFC-2781, Big endian should be presumed if none given
        u16be : ['u16', 'u16be', 'utf16', 'utf16be', 'ucs2', 'ucs2be'],
        u16le : ['u16le', 'utf16le', 'ucs2le']
};
DataEnc._BOM = {
        'none'     : '',
        'UTF-8'    : '%ef%bb%bf', // Discouraged
        'UTF-16BE' : '%fe%ff',
        'UTF-16LE' : '%ff%fe'
};
DataEnc.prototype = {
    // Basic setup
    config : function(a) {
        var opt = {
            charset: 'u8',
            mime   : 'text/csv',
            base64 : 0,
            bom    : 0
        };
        a = a || {};
        this.charset = typeof a.charset !== 'undefined' ?
                        a.charset : opt.charset;
        this.base64 = typeof a.base64 !== 'undefined' ? a.base64 : opt.base64;
        this.mime = typeof a.mime !== 'undefined' ? a.mime : opt.mime;
        this.bom = typeof a.bom !== 'undefined' ? a.bom : opt.bom;

        this.enc = this.utf8;
        this.buf = '';
        this.lead = '';
        return this;
    },
    // Create lead based on config
    // data:[<MIME-type>][;charset=<encoding>][;base64],<data>
    intro : function() {
        var
            g = [],
            c = this.charset || '',
            b = 'none'
        ;
        if (this.mime && this.mime !== '')
            g.push(this.mime);
        if (c !== '') {
            c = c.replace(/[-\s]/g, '').toLowerCase();
            if (DataEnc._enctype.u8.indexOf(c) > -1) {
                c = 'UTF-8';
                if (this.bom)
                    b = c;
                this.enc = this.utf8;
            } else if (DataEnc._enctype.u16be.indexOf(c) > -1) {
                c = 'UTF-16BE';
                if (this.bom)
                    b = c;
                this.enc = this.utf16be;
            } else if (DataEnc._enctype.u16le.indexOf(c) > -1) {
                c = 'UTF-16LE';
                if (this.bom)
                    b = c;
                this.enc = this.utf16le;
            } else {
                if (c === 'copy')
                    c = '';
                this.enc = this.copy;
            }
        }
        if (c !== '')
            g.push('charset=' + c);
        if (this.base64)
            g.push('base64');
        this.lead = 'data:' + g.join(';') + ',' + DataEnc._BOM[b];
        return this;
    },
    // Deliver
    pay : function() {
        return this.lead + this.buf;
    },
    // UTF-16BE
    utf16be : function(t) { // U+0500 => %05%00
        var i, c, buf = [];
        for (i = 0; i < t.length; ++i) {
            if ((c = t.charCodeAt(i)) > 0xff) {
                buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
            } else {
                buf.push('00');
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
            }
        }
        this.buf += '%' + buf.join('%');
        // Note the hex array is returned, not string with '%'
        // Might be useful if one want to loop over the data.
        return buf;
    },
    // UTF-16LE
    utf16le : function(t) { // U+0500 => %00%05
        var i, c, buf = [];
        for (i = 0; i < t.length; ++i) {
            if ((c = t.charCodeAt(i)) > 0xff) {
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
                buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));
            } else {
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
                buf.push('00');
            }
        }
        this.buf += '%' + buf.join('%');
        // Note the hex array is returned, not string with '%'
        // Might be useful if one want to loop over the data.
        return buf;
    },
    // UTF-8
    utf8 : function(t) {
        this.buf += encodeURIComponent(t);
        return this;
    },
    // Direct copy
    copy : function(t) {
        this.buf += t;
        return this;
    }
};

Previous answer:


I do not have any setup to replicate yours, but if your case is the same as @jlarson then the resulting file should be correct.

我没有任何设置来复制您的,但是如果您的案例与@jlarson相同,那么结果文件应该是正确的。

This answer became somewhat long, (fun topic you say?), but discuss various aspects around the question, what is (likely) happening, and how to actually check what is going on in various ways.

这个答案有点长(你说的很有趣),但是围绕这个问题讨论不同的方面,什么(可能)正在发生,以及如何以不同的方式检查正在发生的事情。

TL;DR:

The text is likely imported as ISO-8859-1, Windows-1252, or the like, and not as UTF-8. Force application to read file as UTF-8 by using import or other means.

文本可能被导入为ISO-8859-1、Windows-1252或类似的格式,而不是UTF-8。强制应用程序使用导入或其他方法将文件读为UTF-8。


PS: The UniSearcher is a nice tool to have available on this journey.

在这次旅行中,UniSearcher是一个很好的工具。

The long way around

The "easiest" way to be 100% sure what we are looking at is to use a hex-editor on the result. Alternatively use hexdump, xxd or the like from command line to view the file. In this case the byte sequence should be that of UTF-8 as delivered from the script.

要百分百确定我们要查看的内容,“最简单”的方法是在结果上使用十六进制编辑器。或者使用hexdump、xxd或类似的命令行来查看文件。在这种情况下,字节序列应该是来自脚本的UTF-8。

As an example if we take the script of jlarson it takes the data Array:

以jlarson的脚本为例它采用了数据数组:

data = ['name', 'city', 'state'],
       ['\u0500\u05E1\u0E01\u1054', 'seattle', 'washington']

This one is merged into the string:

这个合并到字符串中:

 name,city,state<newline>
 \u0500\u05E1\u0E01\u1054,seattle,washington<newline>

which translates by Unicode to:

由Unicode译成:

 name,city,state<newline>
 Ԁסกၔ,seattle,washington<newline>

As UTF-8 uses ASCII as base (bytes with highest bit not set are the same as in ASCII) the only special sequence in the test data is "Ԁסกၔ" which in turn, is:

utf - 8使用ASCII作为基地(与不设置了最高位字节是相同的如ASCII)唯一的特殊序列的测试数据是“Ԁסกၔ”反过来,是:

Code-point  Glyph      UTF-8
----------------------------
    U+0500    Ԁ        d4 80
    U+05E1    ס        d7 a1
    U+0E01    ก     e0 b8 81
    U+1054    ၔ     e1 81 94

Looking at the hex-dump of the downloaded file:

查看下载文件的十六进制转储:

0000000: 6e61 6d65 2c63 6974 792c 7374 6174 650a  name,city,state.
0000010: d480 d7a1 e0b8 81e1 8194 2c73 6561 7474  ..........,seatt
0000020: 6c65 2c77 6173 6869 6e67 746f 6e0a       le,washington.

On second line we find d480 d7a1 e0b8 81e1 8194 which match up with the above:

在第二行,我们发现d480 d7a1 e0b8 81e1 8194与上面的匹配:

0000010: d480  d7a1  e0b8 81  e1 8194 2c73 6561 7474  ..........,seatt
         |   | |   | |     |  |     |  | |  | |  | |
         +-+-+ +-+-+ +--+--+  +--+--+  | |  | |  | |
           |     |      |        |     | |  | |  | |
           Ԁ     ס      ก        ၔ     , s  e a  t t

None of the other characters is mangled either.

其他的角色也没有一个被打乱。

Do similar tests if you want. The result should be the similar.

如果你想做类似的测试。结果应该是相似的。


By sample provided —, â€, “

We can also have a look at the sample provided in the question. It is likely to assume that the text is represented in Excel / TextEdit by code-page 1252.

我们也可以看看问题中提供的示例。它可能假设文本在Excel / TextEdit中由代码页1252表示。

To quote Wikipedia on Windows-1252:

在Windows-1252上引用*:

Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages. In LaTeX packages, it is referred to as "ansinew".

Windows s-1252或CP-1252是拉丁字母的字符编码,默认情况下在微软Windows的遗留组件中使用英语和其他一些西方语言。它是Windows代码页组中的一个版本。在乳胶包装中,它被称为“ansinew”。

Retrieving the original bytes

To translate it back into it's original form we can look at the code page layout, from which we get:

要把它转换成原来的形式,我们可以看看代码页布局,从中我们可以得到:

Character:   <â>  <€>  <”>  <,>  < >  <â>  <€>  < >  <,>  < >  <â>  <€>  <œ>
U.Hex    :    e2 20ac 201d   2c   20   e2 20ac   9d   2c   20   e2 20ac  153
T.Hex    :    e2   80   94   2c   20   e2   80   9d*  2c   20   e2   80   9c
  • U is short for Unicode
  • U是Unicode的缩写
  • T is short for Translated
  • T是翻译的简称

For example:

例如:

â => Unicode 0xe2   => CP-1252 0xe2
” => Unicode 0x201d => CP-1252 0x94
€ => Unicode 0x20ac => CP-1252 0x80

Special cases like 9d does not have a corresponding code-point in CP-1252, these we simply copy directly.

像9d这样的特殊情况在CP-1252中没有对应的代码点,这些我们直接复制。

Note: If one look at mangled string by copying the text to a file and doing a hex-dump, save the file with for example UTF-16 encoding to get the Unicode values as represented in the table. E.g. in Vim:

注意:如果您通过将文本复制到文件并执行十六进制转储来查看已损坏的字符串,那么请使用UTF-16编码保存文件,以获得表中所示的Unicode值。例如在Vim中:

set fenc=utf-16
# Or
set fenc=ucs-2

Bytes to UTF-8

We then combine the result, the T.Hex line, into UTF-8. In UTF-8 sequences the bytes are represented by a leading byte telling us how many subsequent bytes make the glyph. For example if a byte has the binary value 110x xxxx we know that this byte and the next represent one code-point. A total of two. 1110 xxxx tells us it is three and so on. ASCII values does not have the high bit set, as such any byte matching 0xxx xxxx is a standalone. A total of one byte.

然后我们把结果,T。十六进制线,成utf - 8。在UTF-8序列中,字节由一个前导字节表示,告诉我们有多少后续字节构成字形。例如,如果一个字节有二进制值110x xxxx,我们知道这个字节和下一个字节代表一个代码点。一共有两个。xxxx告诉我们是3,依此类推。ASCII值没有高位设置,因此匹配0xxx xxxx的任何字节都是独立的。总共一个字节。

0xe2 = 1110 0010bin => 3 bytes => 0xe28094 (em-dash)  —
0x2c = 0010 1100bin => 1 byte  => 0x2c     (comma)    ,
0x2c = 0010 0000bin => 1 byte  => 0x20     (space)   
0xe2 = 1110 0010bin => 3 bytes => 0xe2809d (right-dq) ”
0x2c = 0010 1100bin => 1 byte  => 0x2c     (comma)    ,
0x2c = 0010 0000bin => 1 byte  => 0x20     (space)   
0xe2 = 1110 0010bin => 3 bytes => 0xe2809c (left-dq)  “

Conclusion; The original UTF-8 string was:

结论;最初的UTF-8字符串是:

—, ”, “

Mangling it back

We can also do the reverse. The original string as bytes:

我们也可以反过来做。原始字符串为字节:

UTF-8: e2 80 94 2c 20 e2 80 9d 2c 20 e2 80 9c

Corresponding values in cp-1252:

相应的值在cp - 1252:

e2 => â
80 => €
94 => ”
2c => ,
20 => <space>
...

and so on, result:

等等,结果:

—, â€, “

Importing to MS Excel

In other words: The issue at hand could be how to import UTF-8 text files into MS Excel, and some other applications. In Excel this can be done in various ways.

换句话说:手头的问题可能是如何将UTF-8文本文件导入MS Excel,以及其他一些应用程序。在Excel中,这可以通过各种方式实现。

  • Method one:
  • 方法一:

Do not save the file with an extension recognized by the application, like .csv, or .txt, but omit it completely or make something up.

不要使用应用程序识别的扩展名来保存文件,比如.csv或.txt,但是要完全忽略它或编造一些东西。

As an example save the file as "testfile", with no extension. Then in Excel open the file, confirm that we actually want to open this file, and voilà we get served with the encoding option. Select UTF-8, and file should be correctly read.

例如,将文件保存为“testfile”,没有扩展名。然后在Excel中打开文件,确认我们确实想打开这个文件,这样我们就得到了编码选项。选择UTF-8,应该正确读取文件。

  • Method two:
  • 方法二:

Use import data instead of open file. Something like:

使用导入数据而不是打开文件。喜欢的东西:

Data -> Import External Data -> Import Data

Select encoding and proceed.

选择编码和处理问题。

Check that Excel and selected font actually supports the glyph

We can also test the font support for the Unicode characters by using the, sometimes, friendlier clipboard. For example, copy text from this page into Excel:

我们还可以使用更友好的剪贴板来测试Unicode字符的字体支持。例如,将本页面中的文本复制到Excel中:

If support for the code points exist, the text should render fine.

如果存在对代码点的支持,则文本应该呈现良好。


Linux

On Linux, which is primarily UTF-8 in userland this should not be an issue. Using Libre Office Calc, Vim, etc. show the files correctly rendered.

在Linux上,主要是用户环境中的UTF-8,这应该不是问题。使用Libre Office Calc、Vim等显示正确呈现的文件。


Why it works (or should)

encodeURI from the spec states, (also read sec-15.1.3):

来自spec状态的encodeURI,(也读secl -15.1.3):

The encodeURI function computes a new version of a URI in which each instance of certain characters is replaced by one, two, three, or four escape sequences representing the UTF-8 encoding of the character.

encodeURI函数计算一个URI的新版本,其中特定字符的每个实例都被一个、两个、三个或四个转义序列替换,这些转义序列表示字符的UTF-8编码。

We can simply test this in our console by, for example saying:

我们可以简单地在我们的控制台中进行测试,例如:

>> encodeURI('Ԁסกၔ,seattle,washington')
<< "%D4%80%D7%A1%E0%B8%81%E1%81%94,seattle,washington"

As we register the escape sequences are equal to the ones in the hex dump above:

当我们注册转义序列时等于上面十六进制转储中的序列:

%D4%80%D7%A1%E0%B8%81%E1%81%94 (encodeURI in log)
 d4 80 d7 a1 e0 b8 81 e1 81 94 (hex-dump of file)

or, testing a 4-byte code:

或者,测试一个4字节的代码:

>> encodeURI('????')
<< "%F3%B1%80%81"

If this is does not comply

If nothing of this apply it could help if you added

如果这些都不适用,那么如果您添加了这些内容,就会有所帮助

  1. Sample of expected input vs mangled output, (copy paste).
  2. 期望输入与错误输出的样本(复制粘贴)。
  3. Sample hex-dump of original data vs result file.
  4. 原始数据与结果文件的十六进制转储。

#2


5  

I ran into exactly this yesterday. I was developing a button that exports the contents of an HTML table as a CSV download. The functionality of the button itself is almost identical to yours – on click I read the text from the table and create a data URI with the CSV content.

我昨天碰巧遇到了这件事。我正在开发一个按钮,将HTML表的内容导出为CSV下载。按钮本身的功能与您的几乎完全相同——单击时,我从表中读取文本,并使用CSV内容创建一个数据URI。

When I tried to open the resulting file in Excel it was clear that the "£" symbol was getting read incorrectly. The 2 byte UTF-8 representation was being processed as ASCII resulting in an unwanted garbage character. Some Googling indicated this was a known issue with Excel.

当我试图打开结果文件在Excel中很明显,“£”符号读取错误。2字节的UTF-8表示被处理为ASCII,从而导致不需要的垃圾字符。一些谷歌人指出,这是Excel的一个已知问题。

I tried adding the byte order mark at the start of the string – Excel just interpreted it as ASCII data. I then tried various things to convert the UTF-8 string to ASCII (such as csvData.replace('\u00a3', '\xa3')) but I found that any time the data is coerced to a JavaScript string it will become UTF-8 again. The trick is to convert it to binary and then Base64 encode it without converting back to a string along the way.

我尝试在字符串开头添加字节顺序标记—Excel只是将它解释为ASCII数据。然后我尝试了各种方法将UTF-8字符串转换为ASCII(比如csvData)。替换('\u00a3'、'\xa3')但我发现,每当数据被强制到JavaScript字符串中时,它将再次变成UTF-8。诀窍是将它转换成二进制,然后对它进行Base64编码,而不需要一路转换回字符串。

I already had CryptoJS in my app (used for HMAC authentication against a REST API) and I was able to use that to create an ASCII encoded byte sequence from the original string then Base64 encode it and create a data URI. This worked and the resulting file when opened in Excel does not display any unwanted characters.

我的应用程序中已经有了加密js(用于针对REST API的HMAC认证),我可以使用它从原始字符串创建一个ASCII编码的字节序列,然后对其进行Base64编码并创建一个数据URI。这是可行的,在Excel中打开的结果文件不会显示任何不需要的字符。

The essential bit of code that does the conversion is:

进行转换的基本代码是:

var csvHeader = 'data:text/csv;charset=iso-8859-1;base64,'
var encodedCsv =  CryptoJS.enc.Latin1.parse(csvData).toString(CryptoJS.enc.Base64)
var dataURI = csvHeader + encodedCsv

Where csvData is your CSV string.

其中csvData就是你的CSV字符串。

There are probably ways to do the same thing without CryptoJS if you don't want to bring in that library but this at least shows it is possible.

如果不希望引入这个库,可能有很多方法可以在不使用隐js的情况下执行相同的操作,但这至少表明这是可能的。

#3


3  

Excel likes Unicode in UTF-16 LE with BOM encoding. Output the correct BOM (FF FE), then convert all your data from UTF-8 to UTF-16 LE.

Excel喜欢使用UTF-16编码的Unicode。输出正确的BOM (FF FE),然后将所有数据从UTF-8转换为UTF-16 LE。

Windows uses UTF-16 LE internally, so some applications work better with UTF-16 than with UTF-8.

Windows内部使用UTF-16 LE,因此有些应用程序使用UTF-16比使用UTF-8更好。

I haven't tried to do that in JS, but there're various scripts on the web to convert UTF-8 to UTF-16. Conversion between UTF variations is pretty easy and takes just a dozen of lines.

我还没有尝试在JS中实现这一点,但是web上有各种脚本可以将UTF-8转换为UTF-16。UTF变体之间的转换非常容易,只需要十几行代码。

#4


2  

I was having a similar issue with data that was pulled into Javascript from a Sharepoint list. It turned out to be something called a "Zero Width Space" character and it was being displayed as †when it was brought into Excel. Apparently, Sharepoint inserts these sometimes when a user hits 'backspace'.

我对从Sharepoint列表中提取到Javascript中的数据也有类似的问题。它变成了一个叫做“零宽度的空间”的性格和它被显示为一个€当它被带进Excel。显然,Sharepoint有时会在用户点击“backspace”时插入这些内容。

I replaced them with this quickfix:

我用这个quickfix替换了它们:

var mystring = myString.replace(/\u200B/g,'');

It looks like you may have other hidden characters in there. I found the codepoint for the zero-width character in mine by looking at the output string in the Chrome inspector. The inspector couldn't render the character so it replaced it with a red dot. When you hover your mouse over that red dot, it gives you the codepoint (eg. \u200B) and you can just sub in the various codepoints to the invisible characters and remove them that way.

看起来你可能还有其他隐藏的字符。通过查看Chrome检查器中的输出字符串,我找到了我的零宽度字符的代码点。检查员无法渲染这个字符,所以用一个红点替换了它。当你把鼠标悬停在那个红点上时,它会给你一个代码点(如。你可以将不同的代码点插入到不可见的字符中,然后以这种方式删除它们。

#5


0  

It could be a problem in your server encoding.

这可能是服务器编码中的问题。

You could try (assuming locale english US) if you are running Linux:

如果您正在运行Linux,您可以尝试(假设是locale english US):

sudo locale-gen en_US en_US.UTF-8
dpkg-reconfigure locales

#6


0  

button.href = 'data:' + mimeType + ';charset=UTF-8,%ef%bb%bf' + encodedUri;

this should do the trick

这应该会奏效。