如何查找特定字符串是否具有unicode字符(尤其是双字节字符)

时间:2022-06-11 02:28:31

To be more precise, I need to know whether (and if possible, how) I can find whether a given string has double byte characters or not. Basically, I need to open a pop-up to display a given text which can contain double byte characters, like Chinese or Japanese. In this case, we need to adjust the window size than it would be for English or ASCII. Anyone has a clue?

更确切地说,我需要知道(如果可能的话)我是否可以找到给定字符串是否具有双字节字符。基本上,我需要打开一个弹出窗口来显示一个给定的文本,它可以包含双字节字符,如中文或日文。在这种情况下,我们需要调整窗口大小,而不是英文或ASCII。有人有线索吗?

6 个解决方案

#1


26  

JavaScript holds text internally as UCS-2, which can encode a fairly extensive subset of Unicode.

JavaScript将文本内部保存为UCS-2,它可以编码相当广泛的Unicode子集。

But that's not really germane to your question. One solution might be to loop through the string and examine the character codes at each position:

但这与你的问题没有密切关系。一种解决方案可能是遍历字符串并检查每个位置的字符代码:

function isDoubleByte(str) {
    for (var i = 0, n = str.length; i < n; i++) {
        if (str.charCodeAt( i ) > 255) { return true; }
    }
    return false;
}

This might not be as fast as you would like.

这可能没有您想要的那么快。

#2


26  

I used mikesamuel answer on this one. However I noticed perhaps because of this form that there should only be one escape slash before the u, e.g. \u and not \\u to make this work correctly.

我在这个问题上使用了mikesamuel的答案。但是我注意到也许是因为这种形式在你之前应该只有一个逃避斜线,例如\ u而不是\\ u使这项工作正常。

function containsNonLatinCodepoints(s) {
    return /[^\u0000-\u00ff]/.test(s);
}

Works for me :)

适合我:)

#3


7  

I have benchmarked the two functions in the top answers and thought I would share the results. Here is the test code I used:

我已在最佳答案中对这两个函数进行了基准测试,并认为我会分享结果。这是我使用的测试代码:

const text1 = `The Chinese Wikipedia was established along with 12 other Wikipedias in May 2001. 中文維基百科的副標題是「海納百川,有容乃大」,這是中国的清朝政治家林则徐(1785年-1850年)於1839年為`;

const regex = /[^\u0000-\u00ff]/; // Small performance gain from pre-compiling the regex
function containsNonLatinCodepoints(s) {
    return regex.test(s);
}

function isDoubleByte(str) {
    for (var i = 0, n = str.length; i < n; i++) {
        if (str.charCodeAt( i ) > 255) { return true; }
    }
    return false;
}

function benchmark(fn, str) {
    let startTime = new Date();
    for (let i = 0; i < 10000000; i++) {
        fn(str);
    }   
    let endTime = new Date();

    return endTime.getTime() - startTime.getTime();
}

console.info('isDoubleByte => ' + benchmark(isDoubleByte, text1));
console.info('containsNonLatinCodepoints => ' + benchmark(containsNonLatinCodepoints, text1));

When running this I got:

在运行时我得到了:

isDoubleByte => 2421
containsNonLatinCodepoints => 868

So for this particular string the regex solution is about 3 times faster.

因此对于这个特定的字符串,正则表达式解决方案的速度提高了约3倍。

However note that for a string where the first character is unicode, isDoubleByte() returns right away and so is much faster than the regex (which still has the overhead of the regular expression).

但请注意,对于第一个字符为unicode的字符串,isDoubleByte()会立即返回,因此比正则表达式(仍然具有正则表达式的开销)快得多。

For instance for the string 中国, I got these results:

例如对于字符串中国,我得到了这些结果:

isDoubleByte => 51
containsNonLatinCodepoints => 288

To get the best of both world, it's probably better to combine both:

为了获得两全其美,最好将两者结合起来:

var regex = /[^\u0000-\u00ff]/; // Small performance gain from pre-compiling the regex
function containsDoubleByte(str) {
    if (!str.length) return false;
    if (str.charCodeAt(0) > 255) return true;
    return regex.test(str);
}

In that case, if the first character is Chinese (which is likely if the whole text is Chinese), the function will be fast and return right away. If not, it will run the regex, which is still faster than checking each character individually.

在这种情况下,如果第一个字符是中文(很可能整个文本是中文),该功能将很快并立即返回。如果没有,它将运行正则表达式,这仍然比单独检查每个字符更快。

#4


6  

Actually, all of the characters are Unicode, at least from the Javascript engine's perspective.

实际上,所有字符都是Unicode,至少从Javascript引擎的角度来看。

Unfortunately, the mere presence of characters in a particular Unicode range won't be enough to determine you need more space. There are a number of characters which take up roughly the same amount of space as other characters which have Unicode codepoints well above the ASCII range. Typographic quotes, characters with diacritics, certain punctuation symbols, and various currency symbols are outside of the low ASCII range and are allocated in quite disparate places on the Unicode basic multilingual plane.

不幸的是,仅仅存在特定Unicode范围内的字符将不足以确定您需要更多空间。有许多字符与其他字符占用的空间大致相同,其他字符的Unicode代码点远高于ASCII范围。印刷引号,带有变音符号的字符,某些标点符号和各种货币符号都在低ASCII范围之外,并且分配在Unicode基本多语言平面上非常不同的位置。

Generally, projects that I've worked on elect to provide extra space for all languages, or sometimes use javascript to determine whether a window with auto-scrollbar css attributes actually has content with a height which would trigger a scrollbar or not.

通常,我参与过的项目选择为所有语言提供额外的空间,或者有时使用javascript来确定具有自动滚动条css属性的窗口是否实际上具有高度会触发滚动条的内容。

If detecting the presence of, or count of, CJK characters will be adequate to determine you need a bit of extra space, you could construct a regex using the following ranges: [\u3300-\u9fff\uf900-\ufaff], and use that to extract a count of the number of characters that match. (This is a little excessively coarse, and misses all the non-BMP cases, probably excludes some other relevant ranges, and most likely includes some irrelevant characters, but it's a starting point).

如果检测到CJK字符的存在或计数足以确定您需要一些额外的空间,则可以使用以下范围构建正则表达式:[\ u3300 \ u9fff \ uf900- \ ufaff],并使用用于提取匹配字符数的计数。 (这有点过于粗糙,并且错过了所有非BMP案例,可能排除了一些其他相关范围,并且很可能包括一些不相关的字符,但这是一个起点)。

Again, you're only going to be able to manage a rough heuristic without something along the lines of a full text rendering engine, because what you really want is something like GDI's MeasureString (or any other text rendering engine's equivalent). It's been a while since I've done so, but I think the closest HTML/DOM equivalent is setting a width on a div and requesting the height (cut and paste reuse, so apologies if this contains errors):

同样,你只能在没有全文渲染引擎的情况下管理粗略启发式,因为你真正想要的是像GDI的MeasureString(或任何其他文本渲染引擎的等价物)。我已经有一段时间了,但我认为最接近的HTML / DOM等价物是在div上设置宽度并请求高度(剪切和粘贴重用,如果这包含错误,请道歉):

o = document.getElementById("test");

document.defaultView.getComputedStyle(o,"").getPropertyValue("height"))

#5


0  

Why not let the window resize itself based on the runtime height/width?

为什么不让窗口根据运行时高度/宽度自行调整大小?

Run something like this in your pop-up:

在弹出窗口中运行以下内容:

window.resizeTo(document.body.clientWidth, document.body.clientHeight);

#6


0  

Here is benchmark test: http://jsben.ch/NKjKd

这是基准测试:http://jsben.ch/NKjKd

This is much faster:

这要快得多:

function containsNonLatinCodepoints(s) {
    return /[^\u0000-\u00ff]/.test(s);
}

than this:

function isDoubleByte(str) {
    for (var i = 0, n = str.length; i < n; i++) {
        if (str.charCodeAt( i ) > 255) { return true; }
    }
    return false;
}

#1


26  

JavaScript holds text internally as UCS-2, which can encode a fairly extensive subset of Unicode.

JavaScript将文本内部保存为UCS-2,它可以编码相当广泛的Unicode子集。

But that's not really germane to your question. One solution might be to loop through the string and examine the character codes at each position:

但这与你的问题没有密切关系。一种解决方案可能是遍历字符串并检查每个位置的字符代码:

function isDoubleByte(str) {
    for (var i = 0, n = str.length; i < n; i++) {
        if (str.charCodeAt( i ) > 255) { return true; }
    }
    return false;
}

This might not be as fast as you would like.

这可能没有您想要的那么快。

#2


26  

I used mikesamuel answer on this one. However I noticed perhaps because of this form that there should only be one escape slash before the u, e.g. \u and not \\u to make this work correctly.

我在这个问题上使用了mikesamuel的答案。但是我注意到也许是因为这种形式在你之前应该只有一个逃避斜线,例如\ u而不是\\ u使这项工作正常。

function containsNonLatinCodepoints(s) {
    return /[^\u0000-\u00ff]/.test(s);
}

Works for me :)

适合我:)

#3


7  

I have benchmarked the two functions in the top answers and thought I would share the results. Here is the test code I used:

我已在最佳答案中对这两个函数进行了基准测试,并认为我会分享结果。这是我使用的测试代码:

const text1 = `The Chinese Wikipedia was established along with 12 other Wikipedias in May 2001. 中文維基百科的副標題是「海納百川,有容乃大」,這是中国的清朝政治家林则徐(1785年-1850年)於1839年為`;

const regex = /[^\u0000-\u00ff]/; // Small performance gain from pre-compiling the regex
function containsNonLatinCodepoints(s) {
    return regex.test(s);
}

function isDoubleByte(str) {
    for (var i = 0, n = str.length; i < n; i++) {
        if (str.charCodeAt( i ) > 255) { return true; }
    }
    return false;
}

function benchmark(fn, str) {
    let startTime = new Date();
    for (let i = 0; i < 10000000; i++) {
        fn(str);
    }   
    let endTime = new Date();

    return endTime.getTime() - startTime.getTime();
}

console.info('isDoubleByte => ' + benchmark(isDoubleByte, text1));
console.info('containsNonLatinCodepoints => ' + benchmark(containsNonLatinCodepoints, text1));

When running this I got:

在运行时我得到了:

isDoubleByte => 2421
containsNonLatinCodepoints => 868

So for this particular string the regex solution is about 3 times faster.

因此对于这个特定的字符串,正则表达式解决方案的速度提高了约3倍。

However note that for a string where the first character is unicode, isDoubleByte() returns right away and so is much faster than the regex (which still has the overhead of the regular expression).

但请注意,对于第一个字符为unicode的字符串,isDoubleByte()会立即返回,因此比正则表达式(仍然具有正则表达式的开销)快得多。

For instance for the string 中国, I got these results:

例如对于字符串中国,我得到了这些结果:

isDoubleByte => 51
containsNonLatinCodepoints => 288

To get the best of both world, it's probably better to combine both:

为了获得两全其美,最好将两者结合起来:

var regex = /[^\u0000-\u00ff]/; // Small performance gain from pre-compiling the regex
function containsDoubleByte(str) {
    if (!str.length) return false;
    if (str.charCodeAt(0) > 255) return true;
    return regex.test(str);
}

In that case, if the first character is Chinese (which is likely if the whole text is Chinese), the function will be fast and return right away. If not, it will run the regex, which is still faster than checking each character individually.

在这种情况下,如果第一个字符是中文(很可能整个文本是中文),该功能将很快并立即返回。如果没有,它将运行正则表达式,这仍然比单独检查每个字符更快。

#4


6  

Actually, all of the characters are Unicode, at least from the Javascript engine's perspective.

实际上,所有字符都是Unicode,至少从Javascript引擎的角度来看。

Unfortunately, the mere presence of characters in a particular Unicode range won't be enough to determine you need more space. There are a number of characters which take up roughly the same amount of space as other characters which have Unicode codepoints well above the ASCII range. Typographic quotes, characters with diacritics, certain punctuation symbols, and various currency symbols are outside of the low ASCII range and are allocated in quite disparate places on the Unicode basic multilingual plane.

不幸的是,仅仅存在特定Unicode范围内的字符将不足以确定您需要更多空间。有许多字符与其他字符占用的空间大致相同,其他字符的Unicode代码点远高于ASCII范围。印刷引号,带有变音符号的字符,某些标点符号和各种货币符号都在低ASCII范围之外,并且分配在Unicode基本多语言平面上非常不同的位置。

Generally, projects that I've worked on elect to provide extra space for all languages, or sometimes use javascript to determine whether a window with auto-scrollbar css attributes actually has content with a height which would trigger a scrollbar or not.

通常,我参与过的项目选择为所有语言提供额外的空间,或者有时使用javascript来确定具有自动滚动条css属性的窗口是否实际上具有高度会触发滚动条的内容。

If detecting the presence of, or count of, CJK characters will be adequate to determine you need a bit of extra space, you could construct a regex using the following ranges: [\u3300-\u9fff\uf900-\ufaff], and use that to extract a count of the number of characters that match. (This is a little excessively coarse, and misses all the non-BMP cases, probably excludes some other relevant ranges, and most likely includes some irrelevant characters, but it's a starting point).

如果检测到CJK字符的存在或计数足以确定您需要一些额外的空间,则可以使用以下范围构建正则表达式:[\ u3300 \ u9fff \ uf900- \ ufaff],并使用用于提取匹配字符数的计数。 (这有点过于粗糙,并且错过了所有非BMP案例,可能排除了一些其他相关范围,并且很可能包括一些不相关的字符,但这是一个起点)。

Again, you're only going to be able to manage a rough heuristic without something along the lines of a full text rendering engine, because what you really want is something like GDI's MeasureString (or any other text rendering engine's equivalent). It's been a while since I've done so, but I think the closest HTML/DOM equivalent is setting a width on a div and requesting the height (cut and paste reuse, so apologies if this contains errors):

同样,你只能在没有全文渲染引擎的情况下管理粗略启发式,因为你真正想要的是像GDI的MeasureString(或任何其他文本渲染引擎的等价物)。我已经有一段时间了,但我认为最接近的HTML / DOM等价物是在div上设置宽度并请求高度(剪切和粘贴重用,如果这包含错误,请道歉):

o = document.getElementById("test");

document.defaultView.getComputedStyle(o,"").getPropertyValue("height"))

#5


0  

Why not let the window resize itself based on the runtime height/width?

为什么不让窗口根据运行时高度/宽度自行调整大小?

Run something like this in your pop-up:

在弹出窗口中运行以下内容:

window.resizeTo(document.body.clientWidth, document.body.clientHeight);

#6


0  

Here is benchmark test: http://jsben.ch/NKjKd

这是基准测试:http://jsben.ch/NKjKd

This is much faster:

这要快得多:

function containsNonLatinCodepoints(s) {
    return /[^\u0000-\u00ff]/.test(s);
}

than this:

function isDoubleByte(str) {
    for (var i = 0, n = str.length; i < n; i++) {
        if (str.charCodeAt( i ) > 255) { return true; }
    }
    return false;
}