计算字符串中的单词数(不仅是拉丁语)

时间:2021-08-03 20:06:08

If I am not wrong Chinese language (and other languages) doesn't use space ' ' as word delimiter.

如果我没有错,中文(和其他语言)不会使用空格''作为单词分隔符。

So which could be a good algorithm that works internationally?

那么这可能是一个在国际上有效的好算法?

1 个解决方案

#1


3  

The technique I've seen used a lot is to simply count the number of characters used and divide this by the average characters per word in Chinese. A number that is often used for this is 1.5

我看过的很多技巧就是简单地统计使用的字符数,并将其除以中文中每个字的平均字符数。通常用于此的数字是1.5

If your Chinese text has 1500 characters, it's approximately 1000 words long.

如果您的中文文本有1500个字符,则长度约为1000字。

I am not aware of a more accurate way of counting words, except for interpreting the text itself. This would mean actually understanding the context of the words used, since a Chinese character can sometimes be used as a word by itself, but also as a component in a composite word.

除了解释文本本身之外,我不知道计算单词的更准确方法。这实际上意味着理解所用单词的上下文,因为中文字符有时可以单独用作单词,但也可以作为复合单词的一个组成部分。

#1


3  

The technique I've seen used a lot is to simply count the number of characters used and divide this by the average characters per word in Chinese. A number that is often used for this is 1.5

我看过的很多技巧就是简单地统计使用的字符数,并将其除以中文中每个字的平均字符数。通常用于此的数字是1.5

If your Chinese text has 1500 characters, it's approximately 1000 words long.

如果您的中文文本有1500个字符,则长度约为1000字。

I am not aware of a more accurate way of counting words, except for interpreting the text itself. This would mean actually understanding the context of the words used, since a Chinese character can sometimes be used as a word by itself, but also as a component in a composite word.

除了解释文本本身之外,我不知道计算单词的更准确方法。这实际上意味着理解所用单词的上下文,因为中文字符有时可以单独用作单词,但也可以作为复合单词的一个组成部分。