在集合中查找重复元素并将它们分组的快速算法是什么?

时间:2022-05-06 16:47:49

Say you have a collection of elements, how can you pick out those with duplicates and put them into each group with least amount of comparison? preferably in C++, but algorithm is more important than the language. For Example given {E1,E2,E3,E4,E4,E2,E6,E4,E3}, I wish to extract out {E2,E2}, {E3,E3}, {E4,E4,E4}. what data structure and algorithm you will choose? Please also include the cost of setting up the data structure, say, if it's a pre-sorted one like std::multimap

假设你有一个元素集合,你怎么能挑选出那些具有重复元素的元素并将它们放入每组中并进行最少量的比较?最好是在C ++中,但算法比语言更重要。对于给出{E1,E2,E3,E4,E4,E2,E6,E4,E3}的示例,我希望提取出{E2,E2},{E3,E3},{E4,E4,E4}。您将选择哪种数据结构和算法?还请包括设置数据结构的成本,例如,它是否是像std :: multimap这样的预先排序的数据结构

Updates

To make things clearer as suggested. there's one constraint: the elements must be compared by themselves to be certain they are duplicates.

根据建议使事情更清楚。有一个约束:元素必须自己进行比较,以确定它们是重复的。

So hashes do not apply, because virtually they shift the comparison to from heavy elements(e.g. chunks of data) to light elements(integers), and reduce some comparison, but not do away with them, and in the end, we are back to our original problem, when are inside one collision bucket.

所以哈希不适用,因为实际上他们将比较从重元素(例如数据块)转移到轻元素(整数),并减少一些比较,但不要废除它们,最后,我们回到我们原来的问题,什么时候在一个碰撞桶内。

Pretend you have a bunch of potentials duplicate files of GBs each, they bear the same hash value by every hash-algorithm human beings know. Now you are going to spot the real duplicates.

假装你有一堆潜在的GB的重复文件,它们与人类所知的每个哈希算法具有相同的哈希值。现在你要发现真正的重复。

No, it can't be a real-life problem(even MD5 is enough to generate unique hash for real-life files). But just pretend so that we can focus on finding the data structure + algorithm that involves least amount of comparison.

不,它不能成为现实生活中的问题(即使MD5足以为现实生活中的文件生成唯一的哈希值)。但只是假装我们可以专注于寻找涉及最少量比较的数据结构+算法。


What I am doing is to

我正在做的是

  1. represent into a STL std::list data structure(in that 1) its element-deletion is cheaper than, say, a vector 2) its insertion is cheaper, not requiring sort.)

    代表一个STL std :: list数据结构(在那个1中)它的元素删除比矢量2便宜,它的插入更便宜,不需要排序。)

  2. pop out one element and compare it with the rest, if a duplicate is found, it's pulled out of the list. once the end of the list is reached, one group of duplication is found, if any.

    弹出一个元素并将其与其余元素进行比较,如果找到重复元素,则将其从列表中拉出。一旦到达列表的末尾,就会找到一组重复,如果有的话。

  3. repeat the above 2 steps until the list is empty.

    重复上述两个步骤,直到列表为空。

It needs N-1 in the best case, but (N-1)! in the worse case.

在最好的情况下它需要N-1,但是(N-1)!在更糟糕的情况下。

what are the better alternatives?

什么是更好的选择?


My code using method explained above:

我的代码使用上面解释的方法:

// algorithm to consume the std::list container,
// supports: list<path_type>,list< pair<std::string, paths_type::const_iterater>>
template<class T>
struct consume_list
{
    groups_type operator()(list<T>& l)
    {
        // remove spurious identicals and group the rest
        // algorithm:  
        // 1. compare the first element with the remaining elements, 
        //    pick out all duplicated files including the first element itself.
        // 2. start over again with the shrinked list
        //     until the list contains one or zero elements.

        groups_type sub_groups;           
        group_type one_group; 
        one_group.reserve(1024);

        while(l.size() > 1)
        {
            T front(l.front());
            l.pop_front();

            item_predicate<T> ep(front);
            list<T>::iterator it     = l.begin(); 
            list<T>::iterator it_end = l.end();
            while(it != it_end)
            {
                if(ep.equals(*it))
                {
                    one_group.push_back(ep.extract_path(*(it))); // single it out
                    it = l.erase(it);
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(ep.extract_path(front));                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
        return sub_groups;
    }        
}; 


// type for item-item comparison within a stl container, e.g.  std::list 
template <class T>
struct item_predicate{};

// specialization for type path_type      
template <>
struct item_predicate<path_type>
{
public:
    item_predicate(const path_type& base)/*init list*/            
    {}
public:
    bool equals(const path_type& comparee)
    {
        bool  result;
        /* time-consuming operations here*/
        return result;
    }

    const path_type& extract_path(const path_type& p)
    {
        return p;
    }
private:
    // class members
}; 


};

Thanks for the answer below, however they seem to be misled by my example that it's about integers. In fact the elements are type agnostic(not necessarily integers, strings or any other PODs), and the equal predicates are self-defined, that is the comparison can be very heavy.

感谢下面的答案,但是他们似乎被我的例子误导了它是关于整数的。事实上,元素是类型不可知的(不一定是整数,字符串或任何其他POD),并且相等的谓词是自定义的,即比较可能非常繁重。

So maybe my question should be: using which data structure + algorithm involves fewer comparisons.

所以也许我的问题应该是:使用哪种数据结构+算法涉及更少的比较。

Using a pre-sorted container like multiset, multimap is not better according to my test, since

使用像multiset这样的预先排序的容器,根据我的测试,multimap并不是更好,因为

  1. the sorting while inserting already does the comparisons,
  2. 插入时的排序已经进行了比较,

  3. the following adjacent finding does comparison again,
  4. 以下相邻的发现再次进行比较,

  5. these data structure prefer less-than operations to equal operations, they perform 2 less-than(a
  6. 这些数据结构比操作更少等于操作,它们执行2次以下(a

I do not see how it can save comparisons.

我不知道如何保存比较。


one more thing that's ignored by some answers below, I need to differentiate the duplicate groups from one another, not just keep them in the container.

下面的一些答案忽略了另一件事,我需要区分重复的组,而不是将它们保存在容器中。


Conclusion

After all the discussion, there seem to be 3 ways

在所有的讨论之后,似乎有3种方式

  1. my original naive method as explained above
  2. 我原来的天真方法,如上所述

  3. Start with a linear container like std::vector , sort it and then locate the equal ranges
  4. 从像std :: vector这样的线性容器开始,对其进行排序,然后找到相等的范围

  5. start with an associated container like std::map<Type, vector<duplicates>>, pick out the duplicates during the setup of associated container as suggested by Charles Bailey.
  6. 从像std :: map >这样的相关容器开始,在Charles Bailey建议的相关容器设置过程中挑选出重复项。 ,vector>

I've coded a sample to test all the methods as posted below.

我编写了一个样本来测试下面发布的所有方法。

the number of duplicates and when they are distributed may influence the best choice.

重复的数量以及分发时间可能会影响最佳选择。

  • Method 1 is best when they fall heavily at the front, and is worst when at the end. Sort will not change the distribution, but the endian.
  • 当方法1在前面严重下降时最好,而在最后时最差。排序不会改变分布,而是更改。

  • Method 3 has the most average performance
  • 方法3具有最平均的性能

  • Method 2 is never the best choice
  • 方法2永远不是最佳选择

Thanks for all who participating in the discussion.

one output with 20 sample items from the code below.

一个输出,包含以下代码中的20个样本项。

Test with [ 20 10 6 5 4 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 ]

用[20 10 6 5 4 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1]进行测试

and [ 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 4 5 6 10 20 ] respectively

和[1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 4 5 6 10 20]

using std::vector -> sort() -> adjacent_find():

使用std :: vector - > sort() - > adjacent_find():

comparisons: [ '<' = 139, '==' = 23 ]

比较:['<'= 139,'=='= 23]

comparisons: [ '<' = 38, '==' = 23 ]

比较:['<'= 38,'=='= 23]

using std::list -> sort() -> shrink list:

使用std :: list - > sort() - >缩小列表:

comparisons: [ '<' = 50, '==' = 43 ]

比较:['<'= 50,'=='= 43]

comparisons: [ '<' = 52, '==' = 43 ]

比较:['<'= 52,'=='= 43]

using std::list -> shrink list:

使用std :: list - >收缩列表:

comparisons: [ '<' = 0, '==' = 121 ]

比较:['<'= 0,'=='= 121]

comparisons: [ '<' = 0, '==' = 43 ]

比较:['<'= 0,'=='= 43]

using std::vector -> std::map>:

使用std :: vector - > std :: map>:

comparisons: [ '<' = 79, '==' = 0 ]

比较:['<'= 79,'=='= 0]

comparisons: [ '<' = 53, '==' = 0 ]

比较:['<'= 53,'=='= 0]

using std::vector -> std::multiset -> adjacent_find():

使用std :: vector - > std :: multiset - > adjacent_find():

comparisons: [ '<' = 79, '==' = 7 ]

比较:['<'= 79,'=='= 7]

comparisons: [ '<' = 53, '==' = 7 ]

比较:['<'= 53,'=='= 7]

Code

// compile with VC++10: cl.exe /EHsc

#include <vector>
#include <deque>
#include <list>
#include <map>
#include <set>
#include <algorithm>
#include <iostream>
#include <sstream>

#include <boost/foreach.hpp>
#include <boost/tuple/tuple.hpp>
#include <boost/format.hpp>

using namespace std;

struct Type
{
    Type(int i) : m_i(i){}

    bool operator<(const Type& t) const
    {
        ++number_less_than_comparison;
        return m_i < t.m_i;
    }

    bool operator==(const Type& t) const
    {
        ++number_equal_comparison;    
        return m_i == t.m_i;
    }
public:
    static void log(const string& operation)
    {
        cout 
        << "comparison during " <<operation << ": [ "
        << "'<'  = " << number_less_than_comparison
        << ", "
        << "'==' = " << number_equal_comparison
        << " ]\n";

        reset();
    }

    int to_int() const
    {
        return m_i;
    }
private:
    static void reset()
    {
        number_less_than_comparison = 0;
        number_equal_comparison = 0;      
    }

public:
    static size_t number_less_than_comparison;
    static size_t number_equal_comparison;    
private:
    int m_i;
};

size_t Type::number_less_than_comparison = 0;
size_t Type::number_equal_comparison = 0;  

ostream& operator<<(ostream& os, const Type& t) 
{
    os << t.to_int();
    return os;
}

template< class Container >
struct Test
{    
    void recursive_run(size_t n)
    { 
        bool reserve_order = false;

        for(size_t i = 48; i < n; ++i)
        {
            run(i);
        }    
    }

    void run(size_t i)
    {
        cout << 
        boost::format("\n\nTest %1% sample elements\nusing method%2%:\n") 
        % i 
        % Description();

        generate_sample(i);
        sort();
        locate();   

        generate_reverse_sample(i);
        sort();
        locate(); 
    }

private:    
    void print_me(const string& when)
    {
        std::stringstream ss;
        ss << when <<" = [ ";
        BOOST_FOREACH(const Container::value_type& v, m_container)
        {
            ss << v << " ";
        }
        ss << "]\n";    
        cout << ss.str();
    }

    void generate_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 1; i <= n; ++i)
        {
            m_container.push_back(Type(n/i));    
        }
        print_me("init value");
        Type::log("setup");
    }

    void generate_reverse_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 0; i < n; ++i)
        {
            m_container.push_back(Type(n/(n-i)));     
        }
        print_me("init value(reverse order)");
        Type::log("setup");
    }    

    void sort()
    {    
        sort_it();

        Type::log("sort");
        print_me("after sort");

    }

    void locate()
    {
        locate_duplicates();

        Type::log("locate duplicate");
    }
protected:
    virtual string Description() = 0;
    virtual void sort_it() = 0;
    virtual void locate_duplicates() = 0;
protected:
    Container m_container;    
};

struct Vector : Test<vector<Type> >
{    
    string Description()
    {
        return "std::vector<Type> -> sort() -> adjacent_find()";
    } 

private:           
    void sort_it()
    {    
        std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {
        using std::adjacent_find;
        typedef vector<Type>::iterator ITR;
        typedef vector<Type>::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        ITR itr_begin(m_container.begin());
        ITR itr_end(m_container.end());       
        ITR itr(m_container.begin()); 
        ITR itr_range_begin(m_container.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
                []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    }
};

struct List : Test<list<Type> >
{
    List(bool sorted) : m_sorted(sorted){}

    string Description()
    {
        return m_sorted ? "std::list -> sort() -> shrink list" : "std::list -> shrink list";
    }
private:    
    void sort_it()
    {
        if(m_sorted) m_container.sort();////std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {       
        typedef list<Type>::value_type VALUE;
        typedef list<Type>::iterator ITR;

        typedef vector<VALUE>  GROUP;
        typedef vector<GROUP>  GROUPS;

        GROUPS sub_groups;
        GROUP one_group; 

        while(m_container.size() > 1)
        {
            VALUE front(m_container.front());
            m_container.pop_front();

            ITR it     = m_container.begin(); 
            ITR it_end = m_container.end();
            while(it != it_end)
            {
                if(front == (*it))
                {
                    one_group.push_back(*it); // single it out
                    it = m_container.erase(it); // shrink list by one
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(front);                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
    }        

private:
    bool m_sorted;
};

struct Map : Test<vector<Type>>
{    
    string Description()
    {
        return "std::vector -> std::map<Type, vector<Type>>" ;
    }
private:    
    void sort_it() {}

    void locate_duplicates()
    {
        typedef map<Type, vector<Type> > MAP;
        typedef MAP::iterator ITR;

        MAP local_map;

        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            pair<ITR, bool> mit; 
            mit = local_map.insert(make_pair(v, vector<Type>(1, v)));   
            if(!mit.second) (mit.first->second).push_back(v); 
         }

        ITR itr(local_map.begin());
        while(itr != local_map.end())
        {
            if(itr->second.empty()) local_map.erase(itr);

            itr++;
        }
    }        
};

struct Multiset :  Test<vector<Type>>
{
    string Description()
    {
        return "std::vector -> std::multiset<Type> -> adjacent_find()" ;
    }
private:
    void sort_it() {}

    void locate_duplicates()
    {   
        using std::adjacent_find;

        typedef set<Type> SET;
        typedef SET::iterator ITR;
        typedef SET::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        SET local_set;
        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            local_set.insert(v);
        }

        ITR itr_begin(local_set.begin());
        ITR itr_end(local_set.end());       
        ITR itr(local_set.begin()); 
        ITR itr_range_begin(local_set.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
            []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    } 
};

int main()
{
    size_t N = 20;

    Vector().run(20);
    List(true).run(20);
    List(false).run(20);
    Map().run(20);
    Multiset().run(20);
}

11 个解决方案

#1


3  

You could use a map from a representative element to a list/vector/deque of other elements. This requires relatively fewer comparisons in insertion into the container and means that you can iterate through the resulting groups without having to perform any comparisons.

您可以使用从代表元素到其他元素的列表/向量/双端队列的映射。这需要在插入容器时进行相对较少的比较,这意味着您可以遍历生成的组而无需执行任何比较。

This example always inserts the first representative element into the mapped deque storage as it makes the subsequent iteration through the group logically simple, but if this duplication proves a problem then it would be easy to only perform the push_back only if (!ins_pair.second).

此示例始终将第一个代表性元素插入到映射的双端队列存储中,因为它使通过组的后续迭代在逻辑上变得简单,但如果此重复证明存在问题,则仅在(!ins_pair.second)时才执行push_back很容易。

typedef std::map<Type, std::deque<Type> > Storage;

void Insert(Storage& s, const Type& t)
{
    std::pair<Storage::iterator, bool> ins_pair( s.insert(std::make_pair(t, std::deque<Type>())) );
    ins_pair.first->second.push_back(t);
}

Iterating through the groups is then (relatively) simple and cheap:

然后,通过这些组进行迭代(相对)简单且便宜:

void Iterate(const Storage& s)
{
    for (Storage::const_iterator i = s.begin(); i != s.end(); ++i)
    {
        for (std::deque<Type>::const_iterator j = i->second.begin(); j != i->second.end(); ++j)
        {
            // do something with *j
        }
    }
}

I performed some experiments for comparison and object counts. In a test with 100000 objects in random order forming 50000 groups (i.e. and average of 2 objects per group) the above method cost the following number of comparisons and copies:

我进行了一些实验,用于比较和对象计数。在以随机顺序形成50000组的100000个对象的测试中(即每组平均2个对象),上述方法花费了以下数量的比较和副本:

1630674 comparisons, 443290 copies

(I tried bringing the number of copies down, but only really managed to at the expense of comparisons which seem to be the higher cost operation in your scenario.)

(我尝试降低副本数量,但只是以牺牲比较为代价,这似乎是您方案中的高成本操作。)

Using a multimap, and retaining the previous element in the final iteration to detect the group transitions cost this:

使用多图,并在最后一次迭代中保留前一个元素以检测组转换的成本:

1756208 comparisons, 100000 copies

Using a single list and popping the front element and performing a linear search for other group members cost:

使用单个列表并弹出前面元素并对其他组成员执行线性搜索成本:

1885879088 comparisons, 100000 copies

Yes, that's ~1.9b comparisons compared to ~1.6m for my best method. To get the list method to perform anywhere near an optimal number of comparisons it would have to be sorted and this is going to cost a similar number of comparisons as building an inherently ordered container would in the first place.

是的,那是~1.9b的比较,相比之下,我最好的方法是~1.6m。为了使列表方法在最佳数量的比较附近执行,必须对其进行排序,这将花费相似数量的比较,因为首先构建一个固有有序的容器。

Edit

I took your posted code and ran the implied algorithm (I had to make some assumptions about the code as there as some assumed definitions) over the same test data set as I used before and I counted:

我拿了你发布的代码并运行了隐含的算法(我不得不对代码做出一些假设的假设定义)和我之前使用的相同的测试数据集,我计算了:

1885879088 comparisons, 420939 copies

i.e. exactly the same number of comparisons as my dumb list algorithm, but with more copies. I think that that means we using essentially similar algorithms for this case. I can't see any evidence of an alternative sort order, but it looks like you want a list of the groups which contain more than one equivalent elements. This can be simply achieved in my Iterate function by adding in if (i->size > 1) clause.

即与我的哑列表算法完全相同的比较数,但具有更多副本。我认为这意味着我们在这种情况下使用基本相似的算法。我看不到任何替代排序顺序的证据,但看起来你想要一个包含多个等效元素的组列表。这可以通过添加if(i-> size> 1)子句在我的Iterate函数中简单地实现。

I still can't see any evidence that building a sorted container such as this map of deques isn't a good (even if not optimal) strategy.

我仍然看不到任何证据表明构建一个排序容器(如此deques地图)不是一个好的(即使不是最优的)策略。

#2


13  

Yes, you can do much better.

是的,你可以做得更好。

  1. Sort them (O(n) for simple integers, O(n*log n) in general), then duplicates are guaranteed to be adjacent, making finding them quick O(n)

    对它们进行排序(对于简单整数为O(n),一般为O(n * log n)),然后重复保证相邻,使得快速找到它们O(n)

  2. Use a hash table, also O(n). For each item, (a) check if it's already in the hash table; if so, its a duplicate; if not, put it in the hash table.

    使用哈希表,也是O(n)。对于每个项目,(a)检查它是否已经在哈希表中;如果是的话,它是重复的;如果没有,请将其放在哈希表中。

edit

The method you're using seems to do O(N^2) comparisons:

您正在使用的方法似乎进行O(N ^ 2)比较:

for i = 0; i < length; ++i           // will do length times
    for j = i+1; j < length; ++j     // will do length-i times
        compare

So for length 5 you do 4+3+2+1=10 compares; for 6 you do 15, etc. (N^2)/2 - N/2 to be exact. N*log(N) is smaller, for any reasonably high value of N.

所以对于长度5你做4 + 3 + 2 + 1 = 10比较;对于6你做15,等等(N ^ 2)/ 2 - N / 2是准确的。对于任何合理的N值,N * log(N)都较小。

How big is N in your case?

在你的情况下N有多大?

http://img188.imageshack.us/img188/7315/foou.png

As far as reducing hash collisions, the best way is to get a better hash function :-D. Assuming that is not possible, if you can make a variant (e.g., different modulous), you may be able to do a nested hash.

至于减少散列冲突,最好的方法是获得更好的散列函数:-D。假设这是不可能的,如果你可以做一个变体(例如,不同的modulous),你可以做一个嵌套的哈希。

#3


7  

1. Sort the array O(n log n) at worst - mergesort/heapsort/binary tree sort etc

1.在最坏的情况下对数组O(n log n)进行排序 - mergesort / heapsort / binary tree sort等

2. Compare neighbors and pull the matches out O(n)

2.比较邻居并拉出比赛O(n)

#4


5  

Keep a hash-table based structure from value to count; if your C++ implementation doesn't offer std::hash_map (not really part of the C++ standard so far!-) use Boost or grab a version off the web. One pass over the collection (i.e., O(N)) lets you do a value->count mapping; one more pass over the hash table (<= O(N), clearly) to identify values with a count > 1 and emit them appropriately. Overall O(N), which is not the case for your suggestion.

保持基于散列表的结构从值到计数;如果你的C ++实现没有提供std :: hash_map(到目前为止还不是C ++标准的一部分!),请使用Boost或从网上获取一个版本。对集合进行一次传递(即O(N))可以进行值 - >计数映射;再一次通过哈希表(<= O(N),清楚地)来识别count> 1的值并适当地发出它们。总体O(N),这不是你的建议的情况。

#5


1  

Have you tried sorting? For example using an algorithm like quick sort? If the performance is good enough for you then that would be an easy approach.

你尝试过排序吗?例如使用快速排序等算法?如果性能足够好,那么这将是一个简单的方法。

#6


1  

If it is known to be a list of integers, and if it is known that they are all between A and B (e.g. A=0, B=9), make an array of B-A elements, and create B-A containers.

如果已知它是整数列表,并且如果已知它们都在A和B之间(例如A = 0,B = 9),则创建一个B-A元素数组,并创建B-A容器。

In the very specific case (list of plain integers), I propose that you merely count them, since you cannot distinguish different integers, anyway:

在非常具体的情况下(普通整数列表),我建议你只计算它们,因为你无法区分不同的整数:

for(int i = 0; i < countOfNumbers; i++)
    counter[number[i]]++;

If they are distinguishable, create an array of lists, and add them to the respective list.

如果它们是可区分的,则创建一个列表数组,并将它们添加到相应的列表中。

If they are not numbers, use a std::map or std::hash_map, mapping keys to lists of values.

如果它们不是数字,请使用std :: map或std :: hash_map,将键映射到值列表。

#7


0  

The simplest is probably to just sort the list, and then to iterate over it looking for dups.

最简单的可能只是对列表进行排序,然后迭代它以查找重复。

If you know something about the data, more efficient algorithms are possible.

如果您对数据有所了解,则可以使用更高效的算法。

For example, if you knew the list was large, and only contained integers from 1..n where n is fairly small, you could use a pair of boolean arrays (or a bitmap), and do something like this:

例如,如果您知道列表很大,并且只包含1..n中的整数,其中n相当小,则可以使用一对布尔数组(或位图),并执行以下操作:

bool[] once = new bool[MAXIMUM_VALUE];
bool[] many = new bool[MAXIMUM_VALUE];
for (int i = 0; i < list.Length; i++)
    if (once[ value[i] ])
        many[ value[i] ] = true;
    else
        once[ value[i] ] = true;

Now, many[] contains an array of which values were seen more than once.

现在,许多[]包含多个值被多次看到的数组。

#8


0  

Most people mentioning hash/unordered map solutions are assuming O(1) insertion and query time, however it could be O(N) worst case. Additionally, you void the cost of object hashing.

大多数提到散列/无序映射解决方案的人都假定O(1)插入和查询时间,但它可能是O(N)最坏情况。此外,您还可以消除对象散列的成本。

Personally I would insert objects into a binary tree (O(logn) insertion for each), and keep counter at each node. This would yield O(nlogn) construction time and O(n) traversal to identify all duplicates.

我个人会将对象插入二叉树(每个插入O(logn)插入),并在每个节点保持计数器。这将产生O(nlogn)构造时间和O(n)遍历以识别所有重复。

#9


0  

If I've understood the question correctly, then this is the simplest solution I can think of:

如果我正确理解了这个问题,那么这是我能想到的最简单的解决方案:

std::vector<T> input;
typedef std::vector<T>::iterator iter;

std::vector<std::pair<iter, iter> > output;

sort(input.begin(), input.end()); // O(n lg n) comparisons

for (iter cur = input.begin(); cur != input.end();){
  std::pair<iter, iter> range = equal_range(cur, input.end(), *cur); // find all elements equal to the current one O(lg n) comparisons
  cur = range.second;
  output.push_back(range);
}

Total running time: O(n log n). You have one sorting pass O(n lg n) and then a second pass where O(lg n) comparisons are performed for each group (so this is done at most n times, also yielding O(n lg n).

总运行时间:O(n log n)。您有一个排序过程O(n lg n),然后是第二个过程,其中对每个组执行O(lg n)比较(因此这最多完成n次,也产生O(n lg n)。

Note that this depends on the input being a vector. Only random access iterators have logarithmic complexity in the second pass. Bidirectional iterators would be linear.

请注意,这取决于输入是向量。只有随机访问迭代器在第二遍中具有对数复杂度。双向迭代器是线性的。

This doesn't rely on hashing (as requested), and it preserves all the original elements (rather than just returning one element for each group, along with a count of how often it occurred).

这不依赖于散列(根据请求),并且它保留所有原始元素(而不是仅返回每个组的一个元素,以及它发生的频率计数)。

Of course, a number of smaller constant optimizations are possible. output.reserve(input.size()) on the output vector would be a good idea, to avoid reallocation. input.end() is taken much more often than necessary too, and could be easily cached.

当然,可以进行许多较小的恒定优化。输出向量上的output.reserve(input.size())是个好主意,以避免重新分配。 input.end()的使用频率也非常高,并且可以很容易地进行缓存。

Depending on how big the groups are assumed to be, equal_range may not be the most efficient choice. I assume it does a binary search to get logarithmic complexity, but if each group is only a couple of elements, a simple linear scan would have been faster. In any case, the initial sorting dominates the cost though.

根据假设组的大小,equal_range可能不是最有效的选择。我假设它进行二进制搜索以获得对数复杂度,但如果每个组只有几个元素,那么简单的线性扫描会更快。无论如何,初始排序虽然支配了成本。

#10


0  

Just to register that I had the same problem during a normalization of a triple store that I am working with. I implemented the method 3, summarized by Charles Bailey, in Common Lisp using the hash-table feature from Allegro Common Lisp.

只是为了注册我在我正在使用的三重商店的规范化过程中遇到了同样的问题。我使用Allegro Common Lisp中的哈希表功能,在Common Lisp中实现了Charles Bailey总结的方法3。

The function "agent-equal?" is used to test when two agents in the TS are the same. The function "merge-nodes" merges the nodes on each cluster. In the code below, the "..." was used to remove the not so important parts.

功能“代理相等?”用于测试TS中的两个代理何时相同。函数“merge-nodes”合并每个集群上的节点。在下面的代码中,“...”用于删除不那么重要的部分。

(defun agent-equal? (a b)
 (let ((aprops (car (get-triples-list :s a :p !foaf:name)))
       (bprops (car (get-triples-list :s b :p !foaf:name))))
   (upi= (object aprops) (object bprops))))

(defun process-rdf (out-path filename)
 (let* (...
        (table (make-hash-table :test 'agent-equal?)))
  (progn 
   ...
   (let ((agents (mapcar #'subject 
          (get-triples-list :o !foaf:Agent :limit nil))))
     (progn 
       (dolist (a agents)
          (if (gethash a table)
            (push a (gethash a table))
            (setf (gethash a table) (list a))))
       (maphash #'merge-nodes table)
           ...
           )))))

#11


0  

Since C++11, hash tables are provided by the STL with std::unordered_map. So a O(N) solution is to put your values into an unordered_map< T, <vector<T> >.

从C ++ 11开始,STL使用std :: unordered_map提供哈希表。因此,O(N)解决方案是将您的值放入unordered_map >。 ,

#1


3  

You could use a map from a representative element to a list/vector/deque of other elements. This requires relatively fewer comparisons in insertion into the container and means that you can iterate through the resulting groups without having to perform any comparisons.

您可以使用从代表元素到其他元素的列表/向量/双端队列的映射。这需要在插入容器时进行相对较少的比较,这意味着您可以遍历生成的组而无需执行任何比较。

This example always inserts the first representative element into the mapped deque storage as it makes the subsequent iteration through the group logically simple, but if this duplication proves a problem then it would be easy to only perform the push_back only if (!ins_pair.second).

此示例始终将第一个代表性元素插入到映射的双端队列存储中,因为它使通过组的后续迭代在逻辑上变得简单,但如果此重复证明存在问题,则仅在(!ins_pair.second)时才执行push_back很容易。

typedef std::map<Type, std::deque<Type> > Storage;

void Insert(Storage& s, const Type& t)
{
    std::pair<Storage::iterator, bool> ins_pair( s.insert(std::make_pair(t, std::deque<Type>())) );
    ins_pair.first->second.push_back(t);
}

Iterating through the groups is then (relatively) simple and cheap:

然后,通过这些组进行迭代(相对)简单且便宜:

void Iterate(const Storage& s)
{
    for (Storage::const_iterator i = s.begin(); i != s.end(); ++i)
    {
        for (std::deque<Type>::const_iterator j = i->second.begin(); j != i->second.end(); ++j)
        {
            // do something with *j
        }
    }
}

I performed some experiments for comparison and object counts. In a test with 100000 objects in random order forming 50000 groups (i.e. and average of 2 objects per group) the above method cost the following number of comparisons and copies:

我进行了一些实验,用于比较和对象计数。在以随机顺序形成50000组的100000个对象的测试中(即每组平均2个对象),上述方法花费了以下数量的比较和副本:

1630674 comparisons, 443290 copies

(I tried bringing the number of copies down, but only really managed to at the expense of comparisons which seem to be the higher cost operation in your scenario.)

(我尝试降低副本数量,但只是以牺牲比较为代价,这似乎是您方案中的高成本操作。)

Using a multimap, and retaining the previous element in the final iteration to detect the group transitions cost this:

使用多图,并在最后一次迭代中保留前一个元素以检测组转换的成本:

1756208 comparisons, 100000 copies

Using a single list and popping the front element and performing a linear search for other group members cost:

使用单个列表并弹出前面元素并对其他组成员执行线性搜索成本:

1885879088 comparisons, 100000 copies

Yes, that's ~1.9b comparisons compared to ~1.6m for my best method. To get the list method to perform anywhere near an optimal number of comparisons it would have to be sorted and this is going to cost a similar number of comparisons as building an inherently ordered container would in the first place.

是的,那是~1.9b的比较,相比之下,我最好的方法是~1.6m。为了使列表方法在最佳数量的比较附近执行,必须对其进行排序,这将花费相似数量的比较,因为首先构建一个固有有序的容器。

Edit

I took your posted code and ran the implied algorithm (I had to make some assumptions about the code as there as some assumed definitions) over the same test data set as I used before and I counted:

我拿了你发布的代码并运行了隐含的算法(我不得不对代码做出一些假设的假设定义)和我之前使用的相同的测试数据集,我计算了:

1885879088 comparisons, 420939 copies

i.e. exactly the same number of comparisons as my dumb list algorithm, but with more copies. I think that that means we using essentially similar algorithms for this case. I can't see any evidence of an alternative sort order, but it looks like you want a list of the groups which contain more than one equivalent elements. This can be simply achieved in my Iterate function by adding in if (i->size > 1) clause.

即与我的哑列表算法完全相同的比较数,但具有更多副本。我认为这意味着我们在这种情况下使用基本相似的算法。我看不到任何替代排序顺序的证据,但看起来你想要一个包含多个等效元素的组列表。这可以通过添加if(i-> size> 1)子句在我的Iterate函数中简单地实现。

I still can't see any evidence that building a sorted container such as this map of deques isn't a good (even if not optimal) strategy.

我仍然看不到任何证据表明构建一个排序容器(如此deques地图)不是一个好的(即使不是最优的)策略。

#2


13  

Yes, you can do much better.

是的,你可以做得更好。

  1. Sort them (O(n) for simple integers, O(n*log n) in general), then duplicates are guaranteed to be adjacent, making finding them quick O(n)

    对它们进行排序(对于简单整数为O(n),一般为O(n * log n)),然后重复保证相邻,使得快速找到它们O(n)

  2. Use a hash table, also O(n). For each item, (a) check if it's already in the hash table; if so, its a duplicate; if not, put it in the hash table.

    使用哈希表,也是O(n)。对于每个项目,(a)检查它是否已经在哈希表中;如果是的话,它是重复的;如果没有,请将其放在哈希表中。

edit

The method you're using seems to do O(N^2) comparisons:

您正在使用的方法似乎进行O(N ^ 2)比较:

for i = 0; i < length; ++i           // will do length times
    for j = i+1; j < length; ++j     // will do length-i times
        compare

So for length 5 you do 4+3+2+1=10 compares; for 6 you do 15, etc. (N^2)/2 - N/2 to be exact. N*log(N) is smaller, for any reasonably high value of N.

所以对于长度5你做4 + 3 + 2 + 1 = 10比较;对于6你做15,等等(N ^ 2)/ 2 - N / 2是准确的。对于任何合理的N值,N * log(N)都较小。

How big is N in your case?

在你的情况下N有多大?

http://img188.imageshack.us/img188/7315/foou.png

As far as reducing hash collisions, the best way is to get a better hash function :-D. Assuming that is not possible, if you can make a variant (e.g., different modulous), you may be able to do a nested hash.

至于减少散列冲突,最好的方法是获得更好的散列函数:-D。假设这是不可能的,如果你可以做一个变体(例如,不同的modulous),你可以做一个嵌套的哈希。

#3


7  

1. Sort the array O(n log n) at worst - mergesort/heapsort/binary tree sort etc

1.在最坏的情况下对数组O(n log n)进行排序 - mergesort / heapsort / binary tree sort等

2. Compare neighbors and pull the matches out O(n)

2.比较邻居并拉出比赛O(n)

#4


5  

Keep a hash-table based structure from value to count; if your C++ implementation doesn't offer std::hash_map (not really part of the C++ standard so far!-) use Boost or grab a version off the web. One pass over the collection (i.e., O(N)) lets you do a value->count mapping; one more pass over the hash table (<= O(N), clearly) to identify values with a count > 1 and emit them appropriately. Overall O(N), which is not the case for your suggestion.

保持基于散列表的结构从值到计数;如果你的C ++实现没有提供std :: hash_map(到目前为止还不是C ++标准的一部分!),请使用Boost或从网上获取一个版本。对集合进行一次传递(即O(N))可以进行值 - >计数映射;再一次通过哈希表(<= O(N),清楚地)来识别count> 1的值并适当地发出它们。总体O(N),这不是你的建议的情况。

#5


1  

Have you tried sorting? For example using an algorithm like quick sort? If the performance is good enough for you then that would be an easy approach.

你尝试过排序吗?例如使用快速排序等算法?如果性能足够好,那么这将是一个简单的方法。

#6


1  

If it is known to be a list of integers, and if it is known that they are all between A and B (e.g. A=0, B=9), make an array of B-A elements, and create B-A containers.

如果已知它是整数列表,并且如果已知它们都在A和B之间(例如A = 0,B = 9),则创建一个B-A元素数组,并创建B-A容器。

In the very specific case (list of plain integers), I propose that you merely count them, since you cannot distinguish different integers, anyway:

在非常具体的情况下(普通整数列表),我建议你只计算它们,因为你无法区分不同的整数:

for(int i = 0; i < countOfNumbers; i++)
    counter[number[i]]++;

If they are distinguishable, create an array of lists, and add them to the respective list.

如果它们是可区分的,则创建一个列表数组,并将它们添加到相应的列表中。

If they are not numbers, use a std::map or std::hash_map, mapping keys to lists of values.

如果它们不是数字,请使用std :: map或std :: hash_map,将键映射到值列表。

#7


0  

The simplest is probably to just sort the list, and then to iterate over it looking for dups.

最简单的可能只是对列表进行排序,然后迭代它以查找重复。

If you know something about the data, more efficient algorithms are possible.

如果您对数据有所了解,则可以使用更高效的算法。

For example, if you knew the list was large, and only contained integers from 1..n where n is fairly small, you could use a pair of boolean arrays (or a bitmap), and do something like this:

例如,如果您知道列表很大,并且只包含1..n中的整数,其中n相当小,则可以使用一对布尔数组(或位图),并执行以下操作:

bool[] once = new bool[MAXIMUM_VALUE];
bool[] many = new bool[MAXIMUM_VALUE];
for (int i = 0; i < list.Length; i++)
    if (once[ value[i] ])
        many[ value[i] ] = true;
    else
        once[ value[i] ] = true;

Now, many[] contains an array of which values were seen more than once.

现在,许多[]包含多个值被多次看到的数组。

#8


0  

Most people mentioning hash/unordered map solutions are assuming O(1) insertion and query time, however it could be O(N) worst case. Additionally, you void the cost of object hashing.

大多数提到散列/无序映射解决方案的人都假定O(1)插入和查询时间,但它可能是O(N)最坏情况。此外,您还可以消除对象散列的成本。

Personally I would insert objects into a binary tree (O(logn) insertion for each), and keep counter at each node. This would yield O(nlogn) construction time and O(n) traversal to identify all duplicates.

我个人会将对象插入二叉树(每个插入O(logn)插入),并在每个节点保持计数器。这将产生O(nlogn)构造时间和O(n)遍历以识别所有重复。

#9


0  

If I've understood the question correctly, then this is the simplest solution I can think of:

如果我正确理解了这个问题,那么这是我能想到的最简单的解决方案:

std::vector<T> input;
typedef std::vector<T>::iterator iter;

std::vector<std::pair<iter, iter> > output;

sort(input.begin(), input.end()); // O(n lg n) comparisons

for (iter cur = input.begin(); cur != input.end();){
  std::pair<iter, iter> range = equal_range(cur, input.end(), *cur); // find all elements equal to the current one O(lg n) comparisons
  cur = range.second;
  output.push_back(range);
}

Total running time: O(n log n). You have one sorting pass O(n lg n) and then a second pass where O(lg n) comparisons are performed for each group (so this is done at most n times, also yielding O(n lg n).

总运行时间:O(n log n)。您有一个排序过程O(n lg n),然后是第二个过程,其中对每个组执行O(lg n)比较(因此这最多完成n次,也产生O(n lg n)。

Note that this depends on the input being a vector. Only random access iterators have logarithmic complexity in the second pass. Bidirectional iterators would be linear.

请注意,这取决于输入是向量。只有随机访问迭代器在第二遍中具有对数复杂度。双向迭代器是线性的。

This doesn't rely on hashing (as requested), and it preserves all the original elements (rather than just returning one element for each group, along with a count of how often it occurred).

这不依赖于散列(根据请求),并且它保留所有原始元素(而不是仅返回每个组的一个元素,以及它发生的频率计数)。

Of course, a number of smaller constant optimizations are possible. output.reserve(input.size()) on the output vector would be a good idea, to avoid reallocation. input.end() is taken much more often than necessary too, and could be easily cached.

当然,可以进行许多较小的恒定优化。输出向量上的output.reserve(input.size())是个好主意,以避免重新分配。 input.end()的使用频率也非常高,并且可以很容易地进行缓存。

Depending on how big the groups are assumed to be, equal_range may not be the most efficient choice. I assume it does a binary search to get logarithmic complexity, but if each group is only a couple of elements, a simple linear scan would have been faster. In any case, the initial sorting dominates the cost though.

根据假设组的大小,equal_range可能不是最有效的选择。我假设它进行二进制搜索以获得对数复杂度,但如果每个组只有几个元素,那么简单的线性扫描会更快。无论如何,初始排序虽然支配了成本。

#10


0  

Just to register that I had the same problem during a normalization of a triple store that I am working with. I implemented the method 3, summarized by Charles Bailey, in Common Lisp using the hash-table feature from Allegro Common Lisp.

只是为了注册我在我正在使用的三重商店的规范化过程中遇到了同样的问题。我使用Allegro Common Lisp中的哈希表功能,在Common Lisp中实现了Charles Bailey总结的方法3。

The function "agent-equal?" is used to test when two agents in the TS are the same. The function "merge-nodes" merges the nodes on each cluster. In the code below, the "..." was used to remove the not so important parts.

功能“代理相等?”用于测试TS中的两个代理何时相同。函数“merge-nodes”合并每个集群上的节点。在下面的代码中,“...”用于删除不那么重要的部分。

(defun agent-equal? (a b)
 (let ((aprops (car (get-triples-list :s a :p !foaf:name)))
       (bprops (car (get-triples-list :s b :p !foaf:name))))
   (upi= (object aprops) (object bprops))))

(defun process-rdf (out-path filename)
 (let* (...
        (table (make-hash-table :test 'agent-equal?)))
  (progn 
   ...
   (let ((agents (mapcar #'subject 
          (get-triples-list :o !foaf:Agent :limit nil))))
     (progn 
       (dolist (a agents)
          (if (gethash a table)
            (push a (gethash a table))
            (setf (gethash a table) (list a))))
       (maphash #'merge-nodes table)
           ...
           )))))

#11


0  

Since C++11, hash tables are provided by the STL with std::unordered_map. So a O(N) solution is to put your values into an unordered_map< T, <vector<T> >.

从C ++ 11开始,STL使用std :: unordered_map提供哈希表。因此,O(N)解决方案是将您的值放入unordered_map >。 ,