计算NSString中的单词数

时间:2022-09-13 12:03:42

I'm trying to implement a word count function for my app that uses UITextView.

我正在尝试为使用UITextView的应用程序实现字数统计功能。

There's a space between two words in English, so it's really easy to count the number of words in an English sentence. The problem occurs with Chinese and Japanese word counting because usually, there's no any space in the entire sentence.

英语中两个单词之间有一个空格,所以在英语句子中计算单词的数量非常容易。中文和日文单词计数会出现问题,因为通常整个句子中没有任何空格。

I checked with three different text editors in iPad that have a word count feature and compare them with MS Words.

我在iPad中检查了三个具有字数统计功能的不同文本编辑器,并将它们与MS Words进行比较。

For example, here's a series of Japanese characters meaning the world's idea: 世界(the world)の('s)アイデア(idea)

例如,这里有一系列日文字符,意思是世界的想法:世界(世界)の('s)アイデア(想法)

世界のアイデア

1) Pages for iPad and MS Words count each character as one word, so it contains 7 words.

1)用于iPad和MS的页面单词将每个字符计为一个单词,因此它包含7个单词。

2) iPad text editor P*** counts the entire as one word --> They just used space to separate words.

2)iPad文本编辑器P ***将整个计为一个单词 - >他们只是用空格来分隔单词。

3) iPad text editor i*** counts them as three words --> I believe they used CFStringTokenizer with kCFStringTokenizerUnitWord because I could get the same result)

3)iPad文本编辑器我把它们算作三个单词 - >我相信他们使用CFStringTokenizer和kCFStringTokenizerUnitWord,因为我可以得到相同的结果)

I've researched on the Internet, and Pages and MS Words' word counting seems to be correct because each Chinese character has a meaning.

我在互联网上进行过研究,而Pages和MS Words的单词计数似乎是正确的,因为每个汉字都有意义。

I couldn't find any class that counts the words like Pages or MS Words, and it would be very hard to implement it from scratch because besides Japanese and Chinese, iPad supports a lot of different foreign languages.

我找不到任何类似于Pages或MS Words这样的词的类,从头开始实现它是非常困难的,因为除了日语和中文之外,iPad还支持很多不同的外语。

I think CFStringTokenizer with kCFStringTokenizerUnitWord is the best option though.

我认为CFStringTokenizer与kCFStringTokenizerUnitWord是最好的选择。

Is there a way to count words in NSString like Pages and MSWords?

有没有办法计算NSString中的单词,如Pages和MSWords?

Thank you

7 个解决方案

#1


3  

I recommend keep using CFStringTokenizer. Because it's platform feature, so will be upgraded by platform upgrade. And many people in Apple are working hardly to reflect real cultural difference. Which are hard to know for regular developers.

我建议继续使用CFStringTokenizer。因为它是平台功能,所以将通过平台升级进行升级。苹果公司的很多人都在努力反映真正的文化差异。常规开发人员很难知道。

This is hard because this is not a programming problem essentially. This is a human cultural linguistic problem. You need a human language specialist for each culture. For Japanese, you need Japanese culture specialist. However, I don't think Japanese people needs word count feature seriously, because as I heard, the concept of word itself is not so important in the Japanese culture. You should define concept of word first.

这很难,因为这本身并不是一个编程问题。这是一个人类文化语言问题。每种文化都需要一名人类语言专家。对于日本人,您需要日本文化专家。但是,我不认为日本人需要认真对待字数统计,因为正如我所听到的,字本身的概念在日本文化中并不那么重要。你应该首先定义单词的概念。

And I can't understand why you want to force concept of word count into the character count. The Kanji word that you instanced. This is equal with counting universe as 2 words by splitting into uni + verse by meaning. Not even a logic. Splitting word by it's meaning is sometimes completely wrong and useless by the definition of word. Because definition of word itself are different by the cultures. In my language Korean, word is just a formal unit, not a meaning unit. The idea that each word is matching to each meaning is right only in roman character cultures.

我无法理解你为什么要强调字数统计的概念。你实例化的汉字。这与将宇宙计数为2个单词相等,通过意义分裂为uni + verse。甚至不是逻辑。通过它的含义拆分词的含义有时是完全错误的,而且对于词的定义是无用的。因为文字本身的定义不同。在我的韩语中,单词只是一个正式单位,而不是一个意义单位。每个单词与每个含义相匹配的想法仅适用于罗马字符文化。

Just give another feature like character counting for the users in east-asia if you think need it. And counting character in unicode string is so easy with -[NSString length] method.

如果您认为需要,请为东亚地区的用户提供另一项功能,如字符计数。使用 - [NSString length]方法计算unicode字符串中的字符非常容易。

I'm a Korean speaker, (so maybe out of your case :) and in many cases we count characters instead of words. In fact, I never saw people counting words in my whole life. I laughed at word counting feature on MS word because I guessed nobody would use it. (However now I know it's important in roman character cultures.) I have used word counting feature only once to know it works really :) I believe this is similar in Chinese or Japanese. Maybe Japanese users use the word counting because their basic alphabet is similar with roman characters which have no concept of composition. However they're using Kanji heavily which are completely compositing, character-centric system.

我是韩语演讲者(所以可能不在你的情况下:)在很多情况下我们会计算字符而不是单词。事实上,我从来没有看到人们在我的一生中数字。我嘲笑MS字上的字数统计功能,因为我猜没有人会使用它。 (但是现在我知道它在罗马字符文化中很重要。)我只使用一次字数统计功能才知道它真的有效:)我相信这在中文或日文中是相似的。也许日本用户使用单词count,因为他们的基本字母与罗马字符类似,没有组合概念。然而,他们大量使用汉字,这是完全合成的,以角色为中心的系统。

If you make word counting feature works greatly on those languages (which are using by people even does not feel any needs to split sentences into smaller formal units!), it's hard to imagine someone who using it. And without linguistic specialist, the feature should not correct.

如果你使单词计数功能在这些语言上有很大的作用(人们正在使用它们甚至不觉得有必要将句子分成较小的正式单位!),很难想象有人使用它。没有语言专家,该功能不应该纠正。

#2


2  

This is a really hard problem if your string doesn't contain tokens identifying word breaks (like spaces). One way I know derived from attempting to solve anagrams is this:

如果您的字符串不包含标识分词符号(如空格)的标记,那么这是一个非常难的问题。我所知道的一种尝试解决字谜的方法是这样的:

At the start of the string you start with one character. Is it a word? It could be a word like "A" but it could also be a part of a word like "AN" or "ANALOG". So the decision about what is a word has to be made considering all of the string. You would consider the next characters to see if you can make another word starting with the first character following the first word you think you might have found. If you decide the word is "A" and you are left with "NALOG" then you will soon find that there are no more words to be found. When you start finding words in the dictionary (see below) then you know you are making the right choices about where to break the words. When you stop finding words you know you have made a wrong choice and you need to backtrack.

在字符串的开头,您以一个字符开头。这是一个字吗?它可以是像“A”这样的单词,但它也可以是“AN”或“ANALOG”之类的单词的一部分。因此,考虑到所有字符串,必须做出关于什么是单词的决定。您可以考虑下一个字符,看看是否可以从您认为可能找到的第一个单词后面的第一个字符开始生成另一个单词。如果你决定单词是“A”并且你留下了“NALOG”那么你很快就会发现没有更多的单词可以找到。当你开始在字典中找到单词时(见下文),你知道你正在做出正确的选择,在哪里打破单词。当你停止查找单词时,你知道你做出了错误的选择而你需要回溯。

A big part of this is having dictionaries sufficient to contain any word you might encounter. The English resource would be TWL06 or SOWPODS or other scrabble dictionaries, containing many obscure words. You need a lot of memory to do this because if you check the words against a simple array containing all of the possible words your program will run incredibly slow. If you parse your dictionary, persist it as a plist and recreate the dictionary your checking will be quick enough but it will require a lot more space on disk and more space in memory. One of these big scrabble dictionaries can expand to about 10MB with the actual words as keys and a simple NSNumber as a placeholder for value - you don't care what the value is, just that the key exists in the dictionary, which tells you that the word is recognised as valid.

其中很大一部分是字典足以包含您可能遇到的任何单词。英语资源将是TWL06或SOWPODS或其他拼字游戏词典,包含许多模糊的单词。你需要大量的内存才能做到这一点,因为如果你对一个包含所有可能单词的简单数组检查单词,你的程序运行速度会非常慢。如果您解析字典,将其保留为plist并重新创建字典,您的检查将足够快,但它将需要更多的磁盘空间和更多的内存空间。其中一个大的拼字游戏词典可以扩展到大约10MB,实际的单词作为键,一个简单的NSNumber作为值的占位符 - 你不关心价值是什么,只是键中存在键,它告诉你该词被认为是有效的。

If you maintain an array as you count you get to do [array count] in a triumphal manner as you add the last word containing the last characters to it, but you also have an easy way of backtracking. If at some point you stop finding valid words you can pop the lastObject off the array and replace it at the start of the string, then start looking for alternative words. If that fails to get you back on the right track pop another word.

如果您按照计算维护一个数组,那么当您添加包含最后一个字符的最后一个单词时,您将以凯旋的方式执行[数组计数],但您也可以轻松地进行回溯。如果在某些时候你停止找到有效的单词,你可以从数组中弹出lastObject并在字符串的开头替换它,然后开始寻找替代单词。如果这不能让你回到正确的轨道上弹出另一个单词。

I would proceed by experimentation, looking for a potential three words ahead as you parse the string - when you have identified three potential words, take the first away, store it in the array and look for another word. If you find it is too slow to do it this way and you are getting OK results considering only two words ahead, drop it to two. If you find you are running up too many dead ends with your word division strategy then increase the number of words ahead you consider.

我会继续进行实验,在你解析字符串时寻找潜在的三个单词 - 当你识别出三个潜在的单词时,取出第一个单词,将它存储在数组中并寻找另一个单词。如果你发现这样做的速度太慢而且考虑到前面只有两个单词就会得到OK结果,将其降为两个。如果你发现你的单词划分策略运行了太多死胡同,那么就增加你考虑的单词数量。

Another way would be to employ natural language rules - for example "A" and "NALOG" might look OK because a consonant follows "A", but "A" and "ARDVARK" would be ruled out because it would be correct for a word beginning in a vowel to follow "AN", not "A". This can get as complicated as you like to make it - I don't know if this gets simpler in Japanese or not but there are certainly common verb endings like "ma su".

另一种方法是采用自然语言规则 - 例如“A”和“NALOG”看起来可能正常,因为辅音跟随“A”,但“A”和“ARDVARK”将被排除,因为它对于一个单词是正确的从元音开始遵循“AN”,而不是“A”。这可能会变得像你想要的那样复杂 - 我不知道这是否在日语中变得更简单,但是肯定会有像“ma su”这样的常见动词结尾。

(edit: started a bounty, I'd like to know the very best way to do this if my way isn't it.)

(编辑:开始赏金,我想知道最好的办法,如果不是这样的话。)

#3


1  

If you are using iOS 4, you can do something like

如果您使用的是iOS 4,则可以执行类似的操作

__block int count = 0;
[string enumerateSubstringsInRange:range
                           options:NSStringEnumerationByWords
                        usingBlock:^(NSString *word,
                                     NSRange wordRange,
                                     NSRange enclosingRange,
                                     BOOL *stop)
    {
        count++;
    }
];

More information in the NSString class reference.

NSString类引用中的更多信息。

There is also WWDC 2010 session, number 110, about advanced text handling, that explains this, around minute 10 or so.

还有WWDC 2010会议,编号110,关于高级文本处理,解释了这一点,大约10分钟左右。

#4


0  

I think CFStringTokenizer with kCFStringTokenizerUnitWord is the best option though.

我认为CFStringTokenizer与kCFStringTokenizerUnitWord是最好的选择。

That's right, you have to iterate through text and simply count number of word tokens encontered on the way.

这是正确的,你必须遍历文本并简单地计算在途中遇到的单词标记的数量。

#5


0  

Not a native chinese/japanese speaker, but here's my 2cents.

不是中国/日本本土人,但这是我的2点。

Each chinese character does have a meaning, but concept of a word is combination of letters/characters to represent an idea, isn't it?

每个汉字都有一个含义,但一个词的概念是字母/字符的组合来表示一个想法,不是吗?

In that sense, there's probably 3 words in "sekai no aidia" (or 2 if you don't count particles like NO/GA/DE/WA, etc). Same as english - "world's idea" is two words, while "idea of world" is 3, and let's forget about the required 'the' hehe.

从这个意义上讲,“sekai no aidia”中可能有3个单词(如果你不计算像NO / GA / DE / WA这样的粒子,那么就有2个单词)。与英语相同 - “世界的想法”是两个词,而“世界的想法”是3,让我们忘记所需的''hehe。

That given, counting word is not as useful in non-roman language in my opinion, similar to what Eonil mentioned. It's probably better to count number of characters for those languages.. Check around with Chinese/Japanese native speakers and see what they think.

在我看来,计算单词并非在非罗马语言中有用,类似于Eonil所提到的。计算这些语言的字符数可能更好。查看中国/日本母语人士并查看他们的想法。

If I were to do it, I would tokenize the string with spaces and particles (at least for japanese, korean) and count tokens. Not sure about chinese..

如果我这样做,我会用空格和粒子(至少对于日语,韩语)和计数标记来标记字符串。不确定中文..

#6


0  

With Japanese you can create a grammar parser and I think it is the same with Chinese. However, that is easier said than done because natural language tends to have many exceptions, but it is not impossible.

使用日语,您可以创建一个语法分析器,我认为它与中文相同。然而,这说起来容易做起来难,因为自然语言往往会有很多例外,但这并非不可能。

Please note it won't really be efficient since you have to parse each sentence before being able to count the words.

请注意,它不会真正有效,因为你必须解析每个句子才能计算单词。

I would recommend the use of a parser compiler rather than building one yourself as well to start at least you can concentrate on doing the grammar than creating the parser yourself. It's not efficient, but it should get the job done.

我建议使用解析器编译器,而不是自己构建一个,至少可以开始专注于编写语法,而不是自己创建解析器。它效率不高,但它应该完成工作。

Also have a fallback algorithm in case your grammar didn't parse the input correctly (perhaps the input really didn't make sense to begin with) you can use the length of the string to make it easier on you.

如果您的语法没有正确解析输入(也许输入真的没有意义,开头),你也可以使用回退算法。你可以使用字符串的长度来使你更容易。

If you build it, there could be a market opportunity for you to use it as a natural language Domain Specific Language for Japanese/Chinese business rules as well.

如果您构建它,那么您可能有市场机会将其用作日语/中文业务规则的自然语言域特定语言。

#7


-1  

Just use the length method:

只需使用长度方法:

[@"世界のアイデア" length];  // is 7

That being said, as a Japanese speaker, I think 3 is the right answer.

话虽如此,作为日语发言人,我认为3是正确的答案。

#1


3  

I recommend keep using CFStringTokenizer. Because it's platform feature, so will be upgraded by platform upgrade. And many people in Apple are working hardly to reflect real cultural difference. Which are hard to know for regular developers.

我建议继续使用CFStringTokenizer。因为它是平台功能,所以将通过平台升级进行升级。苹果公司的很多人都在努力反映真正的文化差异。常规开发人员很难知道。

This is hard because this is not a programming problem essentially. This is a human cultural linguistic problem. You need a human language specialist for each culture. For Japanese, you need Japanese culture specialist. However, I don't think Japanese people needs word count feature seriously, because as I heard, the concept of word itself is not so important in the Japanese culture. You should define concept of word first.

这很难,因为这本身并不是一个编程问题。这是一个人类文化语言问题。每种文化都需要一名人类语言专家。对于日本人,您需要日本文化专家。但是,我不认为日本人需要认真对待字数统计,因为正如我所听到的,字本身的概念在日本文化中并不那么重要。你应该首先定义单词的概念。

And I can't understand why you want to force concept of word count into the character count. The Kanji word that you instanced. This is equal with counting universe as 2 words by splitting into uni + verse by meaning. Not even a logic. Splitting word by it's meaning is sometimes completely wrong and useless by the definition of word. Because definition of word itself are different by the cultures. In my language Korean, word is just a formal unit, not a meaning unit. The idea that each word is matching to each meaning is right only in roman character cultures.

我无法理解你为什么要强调字数统计的概念。你实例化的汉字。这与将宇宙计数为2个单词相等,通过意义分裂为uni + verse。甚至不是逻辑。通过它的含义拆分词的含义有时是完全错误的,而且对于词的定义是无用的。因为文字本身的定义不同。在我的韩语中,单词只是一个正式单位,而不是一个意义单位。每个单词与每个含义相匹配的想法仅适用于罗马字符文化。

Just give another feature like character counting for the users in east-asia if you think need it. And counting character in unicode string is so easy with -[NSString length] method.

如果您认为需要,请为东亚地区的用户提供另一项功能,如字符计数。使用 - [NSString length]方法计算unicode字符串中的字符非常容易。

I'm a Korean speaker, (so maybe out of your case :) and in many cases we count characters instead of words. In fact, I never saw people counting words in my whole life. I laughed at word counting feature on MS word because I guessed nobody would use it. (However now I know it's important in roman character cultures.) I have used word counting feature only once to know it works really :) I believe this is similar in Chinese or Japanese. Maybe Japanese users use the word counting because their basic alphabet is similar with roman characters which have no concept of composition. However they're using Kanji heavily which are completely compositing, character-centric system.

我是韩语演讲者(所以可能不在你的情况下:)在很多情况下我们会计算字符而不是单词。事实上,我从来没有看到人们在我的一生中数字。我嘲笑MS字上的字数统计功能,因为我猜没有人会使用它。 (但是现在我知道它在罗马字符文化中很重要。)我只使用一次字数统计功能才知道它真的有效:)我相信这在中文或日文中是相似的。也许日本用户使用单词count,因为他们的基本字母与罗马字符类似,没有组合概念。然而,他们大量使用汉字,这是完全合成的,以角色为中心的系统。

If you make word counting feature works greatly on those languages (which are using by people even does not feel any needs to split sentences into smaller formal units!), it's hard to imagine someone who using it. And without linguistic specialist, the feature should not correct.

如果你使单词计数功能在这些语言上有很大的作用(人们正在使用它们甚至不觉得有必要将句子分成较小的正式单位!),很难想象有人使用它。没有语言专家,该功能不应该纠正。

#2


2  

This is a really hard problem if your string doesn't contain tokens identifying word breaks (like spaces). One way I know derived from attempting to solve anagrams is this:

如果您的字符串不包含标识分词符号(如空格)的标记,那么这是一个非常难的问题。我所知道的一种尝试解决字谜的方法是这样的:

At the start of the string you start with one character. Is it a word? It could be a word like "A" but it could also be a part of a word like "AN" or "ANALOG". So the decision about what is a word has to be made considering all of the string. You would consider the next characters to see if you can make another word starting with the first character following the first word you think you might have found. If you decide the word is "A" and you are left with "NALOG" then you will soon find that there are no more words to be found. When you start finding words in the dictionary (see below) then you know you are making the right choices about where to break the words. When you stop finding words you know you have made a wrong choice and you need to backtrack.

在字符串的开头,您以一个字符开头。这是一个字吗?它可以是像“A”这样的单词,但它也可以是“AN”或“ANALOG”之类的单词的一部分。因此,考虑到所有字符串,必须做出关于什么是单词的决定。您可以考虑下一个字符,看看是否可以从您认为可能找到的第一个单词后面的第一个字符开始生成另一个单词。如果你决定单词是“A”并且你留下了“NALOG”那么你很快就会发现没有更多的单词可以找到。当你开始在字典中找到单词时(见下文),你知道你正在做出正确的选择,在哪里打破单词。当你停止查找单词时,你知道你做出了错误的选择而你需要回溯。

A big part of this is having dictionaries sufficient to contain any word you might encounter. The English resource would be TWL06 or SOWPODS or other scrabble dictionaries, containing many obscure words. You need a lot of memory to do this because if you check the words against a simple array containing all of the possible words your program will run incredibly slow. If you parse your dictionary, persist it as a plist and recreate the dictionary your checking will be quick enough but it will require a lot more space on disk and more space in memory. One of these big scrabble dictionaries can expand to about 10MB with the actual words as keys and a simple NSNumber as a placeholder for value - you don't care what the value is, just that the key exists in the dictionary, which tells you that the word is recognised as valid.

其中很大一部分是字典足以包含您可能遇到的任何单词。英语资源将是TWL06或SOWPODS或其他拼字游戏词典,包含许多模糊的单词。你需要大量的内存才能做到这一点,因为如果你对一个包含所有可能单词的简单数组检查单词,你的程序运行速度会非常慢。如果您解析字典,将其保留为plist并重新创建字典,您的检查将足够快,但它将需要更多的磁盘空间和更多的内存空间。其中一个大的拼字游戏词典可以扩展到大约10MB,实际的单词作为键,一个简单的NSNumber作为值的占位符 - 你不关心价值是什么,只是键中存在键,它告诉你该词被认为是有效的。

If you maintain an array as you count you get to do [array count] in a triumphal manner as you add the last word containing the last characters to it, but you also have an easy way of backtracking. If at some point you stop finding valid words you can pop the lastObject off the array and replace it at the start of the string, then start looking for alternative words. If that fails to get you back on the right track pop another word.

如果您按照计算维护一个数组,那么当您添加包含最后一个字符的最后一个单词时,您将以凯旋的方式执行[数组计数],但您也可以轻松地进行回溯。如果在某些时候你停止找到有效的单词,你可以从数组中弹出lastObject并在字符串的开头替换它,然后开始寻找替代单词。如果这不能让你回到正确的轨道上弹出另一个单词。

I would proceed by experimentation, looking for a potential three words ahead as you parse the string - when you have identified three potential words, take the first away, store it in the array and look for another word. If you find it is too slow to do it this way and you are getting OK results considering only two words ahead, drop it to two. If you find you are running up too many dead ends with your word division strategy then increase the number of words ahead you consider.

我会继续进行实验,在你解析字符串时寻找潜在的三个单词 - 当你识别出三个潜在的单词时,取出第一个单词,将它存储在数组中并寻找另一个单词。如果你发现这样做的速度太慢而且考虑到前面只有两个单词就会得到OK结果,将其降为两个。如果你发现你的单词划分策略运行了太多死胡同,那么就增加你考虑的单词数量。

Another way would be to employ natural language rules - for example "A" and "NALOG" might look OK because a consonant follows "A", but "A" and "ARDVARK" would be ruled out because it would be correct for a word beginning in a vowel to follow "AN", not "A". This can get as complicated as you like to make it - I don't know if this gets simpler in Japanese or not but there are certainly common verb endings like "ma su".

另一种方法是采用自然语言规则 - 例如“A”和“NALOG”看起来可能正常,因为辅音跟随“A”,但“A”和“ARDVARK”将被排除,因为它对于一个单词是正确的从元音开始遵循“AN”,而不是“A”。这可能会变得像你想要的那样复杂 - 我不知道这是否在日语中变得更简单,但是肯定会有像“ma su”这样的常见动词结尾。

(edit: started a bounty, I'd like to know the very best way to do this if my way isn't it.)

(编辑:开始赏金,我想知道最好的办法,如果不是这样的话。)

#3


1  

If you are using iOS 4, you can do something like

如果您使用的是iOS 4,则可以执行类似的操作

__block int count = 0;
[string enumerateSubstringsInRange:range
                           options:NSStringEnumerationByWords
                        usingBlock:^(NSString *word,
                                     NSRange wordRange,
                                     NSRange enclosingRange,
                                     BOOL *stop)
    {
        count++;
    }
];

More information in the NSString class reference.

NSString类引用中的更多信息。

There is also WWDC 2010 session, number 110, about advanced text handling, that explains this, around minute 10 or so.

还有WWDC 2010会议,编号110,关于高级文本处理,解释了这一点,大约10分钟左右。

#4


0  

I think CFStringTokenizer with kCFStringTokenizerUnitWord is the best option though.

我认为CFStringTokenizer与kCFStringTokenizerUnitWord是最好的选择。

That's right, you have to iterate through text and simply count number of word tokens encontered on the way.

这是正确的,你必须遍历文本并简单地计算在途中遇到的单词标记的数量。

#5


0  

Not a native chinese/japanese speaker, but here's my 2cents.

不是中国/日本本土人,但这是我的2点。

Each chinese character does have a meaning, but concept of a word is combination of letters/characters to represent an idea, isn't it?

每个汉字都有一个含义,但一个词的概念是字母/字符的组合来表示一个想法,不是吗?

In that sense, there's probably 3 words in "sekai no aidia" (or 2 if you don't count particles like NO/GA/DE/WA, etc). Same as english - "world's idea" is two words, while "idea of world" is 3, and let's forget about the required 'the' hehe.

从这个意义上讲,“sekai no aidia”中可能有3个单词(如果你不计算像NO / GA / DE / WA这样的粒子,那么就有2个单词)。与英语相同 - “世界的想法”是两个词,而“世界的想法”是3,让我们忘记所需的''hehe。

That given, counting word is not as useful in non-roman language in my opinion, similar to what Eonil mentioned. It's probably better to count number of characters for those languages.. Check around with Chinese/Japanese native speakers and see what they think.

在我看来,计算单词并非在非罗马语言中有用,类似于Eonil所提到的。计算这些语言的字符数可能更好。查看中国/日本母语人士并查看他们的想法。

If I were to do it, I would tokenize the string with spaces and particles (at least for japanese, korean) and count tokens. Not sure about chinese..

如果我这样做,我会用空格和粒子(至少对于日语,韩语)和计数标记来标记字符串。不确定中文..

#6


0  

With Japanese you can create a grammar parser and I think it is the same with Chinese. However, that is easier said than done because natural language tends to have many exceptions, but it is not impossible.

使用日语,您可以创建一个语法分析器,我认为它与中文相同。然而,这说起来容易做起来难,因为自然语言往往会有很多例外,但这并非不可能。

Please note it won't really be efficient since you have to parse each sentence before being able to count the words.

请注意,它不会真正有效,因为你必须解析每个句子才能计算单词。

I would recommend the use of a parser compiler rather than building one yourself as well to start at least you can concentrate on doing the grammar than creating the parser yourself. It's not efficient, but it should get the job done.

我建议使用解析器编译器,而不是自己构建一个,至少可以开始专注于编写语法,而不是自己创建解析器。它效率不高,但它应该完成工作。

Also have a fallback algorithm in case your grammar didn't parse the input correctly (perhaps the input really didn't make sense to begin with) you can use the length of the string to make it easier on you.

如果您的语法没有正确解析输入(也许输入真的没有意义,开头),你也可以使用回退算法。你可以使用字符串的长度来使你更容易。

If you build it, there could be a market opportunity for you to use it as a natural language Domain Specific Language for Japanese/Chinese business rules as well.

如果您构建它,那么您可能有市场机会将其用作日语/中文业务规则的自然语言域特定语言。

#7


-1  

Just use the length method:

只需使用长度方法:

[@"世界のアイデア" length];  // is 7

That being said, as a Japanese speaker, I think 3 is the right answer.

话虽如此,作为日语发言人,我认为3是正确的答案。