I have a form which lets users input text snippets. So how can figure out the language of the entered text?
我有一个表单,允许用户输入文本片段。那么怎样才能找出输入文本的语言?
Specifically these languages for now:
现在特别是这些语言:
Arabic: هذه هي بعض النصوص العربية
阿拉伯语:هذههيبعضالنصوصالعربية
Chinese: 这是一些阿拉伯文字
Japanese: これは、いくつかのアラビア語のテキストです
[Edit] The detection has work on text which is retrieved via an API too (no browser involved)
[编辑]检测功能也可以通过API检索文本(不涉及浏览器)
5 个解决方案
#1
8
You can figure out whether the characters are from the Arabic, Chinese, or Japanese sections of the Unicode map.
您可以确定字符是来自Unicode映射的阿拉伯语,中文还是日语部分。
If you look at the list on Wikipedia, you'll see that each of those languages has many sections of the map. But you're not doing translation, so you don't need to worry about every last glyph.
如果你查看*上的列表,你会发现每种语言都有很多部分的地图。但是你没有进行翻译,所以你不必担心每一个字形。
For example, your Chinese text begins (in hex) 0x8FD9 0x662F 0x4E00 - and those are all in the "CJK Unified Ideographs" section, which is Chinese. Here are a few ranges to get you started:
例如,你的中文文本开始(十六进制)0x8FD9 0x662F 0x4E00 - 这些都在“CJK统一表意文字”部分,这是中文。以下是一些可以帮助您入门的范围:
Arabic (0600–06FF)
Japanese
- Hiragana (3040–309F)
- Katakana (30A0–30FF)
- Kanbun (3190–319F)
Chinese
- CJK Unified Ideographs (4E00–9FFF)
CJK统一表意文字(4E00-9FFF)
(I got the hex for your Chinese by using a Chinese to Unicode Converter.)
(我使用中文到Unicode转换器获得了中文的十六进制。)
#2
2
You could use the Google Ajax API for detecting the language of a snippet of text.
您可以使用Google Ajax API检测文本片段的语言。
#3
2
Presumably guessing the user's language is to display responses in the proper language. What about examining the browser's settings for preferred languages? Obtain that from the HTTP header Accept-Language. See section 14.4 here.
据推测,猜测用户的语言是以适当的语言显示答案。如何检查浏览器的首选语言设置?从HTTP标头Accept-Language中获取。见这里的第14.4节。
#4
0
I'm exploring the same thing, for server-side. Thus far I have found https://code.google.com/p/language-detection/. Hope this helps someone.
对于服务器端,我正在探索同样的事情。到目前为止,我找到了https://code.google.com/p/language-detection/。希望这有助于某人。
#5
0
You could use https://detectlanguage.com/ which is a webservice build around CLD2.
您可以使用https://detectlanguage.com/这是围绕CLD2构建的Web服务。
#1
8
You can figure out whether the characters are from the Arabic, Chinese, or Japanese sections of the Unicode map.
您可以确定字符是来自Unicode映射的阿拉伯语,中文还是日语部分。
If you look at the list on Wikipedia, you'll see that each of those languages has many sections of the map. But you're not doing translation, so you don't need to worry about every last glyph.
如果你查看*上的列表,你会发现每种语言都有很多部分的地图。但是你没有进行翻译,所以你不必担心每一个字形。
For example, your Chinese text begins (in hex) 0x8FD9 0x662F 0x4E00 - and those are all in the "CJK Unified Ideographs" section, which is Chinese. Here are a few ranges to get you started:
例如,你的中文文本开始(十六进制)0x8FD9 0x662F 0x4E00 - 这些都在“CJK统一表意文字”部分,这是中文。以下是一些可以帮助您入门的范围:
Arabic (0600–06FF)
Japanese
- Hiragana (3040–309F)
- Katakana (30A0–30FF)
- Kanbun (3190–319F)
Chinese
- CJK Unified Ideographs (4E00–9FFF)
CJK统一表意文字(4E00-9FFF)
(I got the hex for your Chinese by using a Chinese to Unicode Converter.)
(我使用中文到Unicode转换器获得了中文的十六进制。)
#2
2
You could use the Google Ajax API for detecting the language of a snippet of text.
您可以使用Google Ajax API检测文本片段的语言。
#3
2
Presumably guessing the user's language is to display responses in the proper language. What about examining the browser's settings for preferred languages? Obtain that from the HTTP header Accept-Language. See section 14.4 here.
据推测,猜测用户的语言是以适当的语言显示答案。如何检查浏览器的首选语言设置?从HTTP标头Accept-Language中获取。见这里的第14.4节。
#4
0
I'm exploring the same thing, for server-side. Thus far I have found https://code.google.com/p/language-detection/. Hope this helps someone.
对于服务器端,我正在探索同样的事情。到目前为止,我找到了https://code.google.com/p/language-detection/。希望这有助于某人。
#5
0
You could use https://detectlanguage.com/ which is a webservice build around CLD2.
您可以使用https://detectlanguage.com/这是围绕CLD2构建的Web服务。