在流中找到单词序列频率的最佳算法是什么

I am dealing with incoming stream of text. For example USA, UK, China, Russia, USA, UK, China, France, Germany.

我正在处理输入的文本流。例如美国、英国、中国、俄罗斯、美国、英国、中国、法国、德国。

I would need to break them up into sequence of 3 words (or maybe n words) and analyze which sequence has the highest frequency. On the above case, the sequence USA, UK, China occurs twice. So it has the highest frequency.

我需要把它们分成3个单词(或者n个单词)的序列，然后分析哪个序列的频率最高。在上面的例子中，美国，英国，中国发生了两次。所以它的频率最高。

In addition, I would need to index the frequencies of all sequence. I have tried using C++ stl map to partially solve some of the problem, but I do not see the solution as elegant. The reason is to uniquely index m numbers of unique words, in a 3 words sequence using stl map, the mathematics is as below,

此外，我需要索引所有序列的频率。我尝试过使用c++ stl映射来部分地解决一些问题，但是我不认为这个解决方案很优雅。原因是唯一索引m个唯一的单词数，在一个3个单词序列中使用stl映射，数学如下，

i x m x m + j x m + k

i, j, k being the integer map to each word.

i j k是每个单词的整数映射。

The problem with the above solution is in a continuous stream of text, we are agnostic of the total number of unique words, or m. Can anyone suggest a better algorithm?

以上解决方案的问题是在连续的文本流中，我们不知道唯一单词的总数，或者是m。谁能提出一个更好的算法吗?

3 个解决方案

#1

I think you would be better using some sort of map or hash table of triples, because then you store only triples that actually occur, whereas with an array you make space for all possible triples. If you see n words, they might all be different, in which case you store about n triples - but an array for all triples of n different words would be of size n^3.

我认为你最好使用一些映射或散列表的三元组，因为你只存储实际发生的三元组，而对于数组你为所有可能的三元组腾出空间。如果你看到n的话,他们可能都是不同的,在这种情况下,你对n三元组存储——但所有三元组的一个数组n不同单词的大小n ^ 3。

As a curiosity, there are bijective maps from pairs of non-negative integers to non-negative integers. One such is (a,b)->(a+b)(a+b+1)/2 + b which maps (0, 0) (0, 1) (1, 0) (0, 2) (1, 1) (2,1) ... to 0, 1, 2, 3, 4, 5, .. - think of it as numbering the pairs by writing them out in a square and then numbering down diagonals. You could use this twice to map triples of numbers to a single number: (a, b, c) -> ((a, b), c). However it isn't really very practical.

出于好奇，有从非负整数对到非负整数对的双射映射。其中一个是(a,b)- >(a + b)(a + b + 1)/ 2 + b映射(0,0)(0,1)(1,0)(0,2)(1)(2,1)……到0 1 2 3 4 5。-把它们写在一个正方形中，然后按对角线编号，就可以把它们看成对的编号。你可以用它来将三个数字映射成一个数字:(a, b, c) -> (a, b)， c)。

#2

Another option is to use an std::string as the key of the map. Each key could be the concatenation of the 3 words. This way, you would define each triple uniquely with no need of knowing m.

另一种选择是使用std::字符串作为映射的键。每个键都可以是三个单词的连接。这样，您就可以定义每个唯一的三元组，而不需要知道m。

However, you'll have to implement an order operator for 2 strings and pass it as the third parameter on the declaration of the map, as discussed in this thread: std::string as a key in std::map using a compare operator.

但是，您必须为2个字符串实现一个order操作符，并将其作为map声明中的第三个参数传递给它，就像在这个线程中讨论的那样:std::string作为std:::map中的键，使用一个比较操作符。

Hope it helps!

希望它可以帮助!

#3

map<vector<unsigned int>, unsigned int> sequenceFrequency;
vector<unsigned int> codedWord;

void MapSequenceFrequency(unsigned int key0, unsigned int key1, unsigned int key2)
{
    codedWord[0] = key0;
    codedWord[1] = key1;
    codedWord[2] = key2;

    map<vector<unsigned int>, unsigned int>::iterator it;

    if (sequenceFrequency.find(codedWord) == sequenceFrequency.end())
        sequenceFrequency[codedWord] = 0;
    else
        sequenceFrequency[codedWord]++;
}

#1

#2