检查列表是否包含唯一字符串的最快方法。

时间:2021-07-23 05:05:52

Basically I have about 1,000,000 strings, for each request I have to check if a String belongs to the list or not.

基本上,我有大约1,000,000个字符串,对于每个请求,我必须检查一个字符串是否属于列表。

I'm worried about the performance, so what's the best method? ArrayList? Hash?

我担心性能,那么最好的方法是什么?ArrayList吗?散列?

10 个解决方案

#1


88  

Your best bet is to use a HashSet and check if a string exists in the set via the contains() method. HashSets are built for fast access via the use of Object methods hashCode() and equals(). The Javadoc for HashSet states:

最好的方法是使用HashSet,并通过contains()方法检查集合中是否存在字符串。hashset是通过使用对象方法hashCode()和equals()构建的。HashSet状态的Javadoc:

This class offers constant time performance for the basic operations (add, remove, contains and size),

这个类为基本操作(添加、删除、包含和大小)提供恒定的时间性能,

HashSet stores objects in hash buckets which is to say that the value returned by the hashCode method will determine which bucket an object is stored in. This way, the amount of equality checks the HashSet has to perform via the equals() method is reduced to just the other Objects in the same hash bucket.

HashSet将对象存储在hash bucket中,也就是说,hashCode方法返回的值将确定对象存储在哪个bucket中。这样,通过equals()方法检查HashSet必须执行的相等性的数量就减少为同一散列表中的其他对象。

To use HashSets and HashMaps effectively, you must conform to the equals and hashCode contract outlined in the javadoc. In the case of java.lang.String these methods have already been implemented to do this.

要有效地使用hashset和hashmap,必须符合javadoc中概述的equals和hashCode契约。以java.lang为例。字符串这些方法已经实现了。

#2


11  

In general, a HashSet will give you better performance, since it does not have to look through each element and compare, like an ArrayList does, but typically compares at most a few elements, where the hashcodes are equal.

一般来说,HashSet会提供更好的性能,因为它不需要像ArrayList那样检查每个元素并进行比较,但通常最多只比较几个元素,其中hashcodes是相等的。

However, for 1M strings, the performance of hashSet may still not be optimal. A lot of cache misses will slow down searching the set. If all strings are equally likely, then this is unavoidable. However, if some strings are more often requested than others, then you can place the common strings into a small hashSet, and check that first, before checking the larger set. The small hashset should be sized to fit in cache (e.g. a few hundred K at most). Hits to the small hashset will then be very fast, while hits to the larger hashset proceed at speed limited by the memory bandwidth.

然而,对于1M字符串,hashSet的性能可能仍然不是最优的。大量的缓存丢失会减慢搜索集的速度。如果所有的字符串都是相等的,那么这是不可避免的。但是,如果某些字符串比其他字符串更常被请求,那么您可以将公共字符串放入一个小的hashSet中,并在检查较大的字符集之前首先检查这个字符集。对小hashset的命中将非常快,而对大hashset的命中则以受内存带宽限制的速度进行。

#3


7  

Before going further, please consider this: Why are you worried about performance? How often is this check called?

在进一步讨论之前,请考虑以下问题:为什么你担心性能?这张支票多久开一次?

As for possible solutions:

至于可能的解决方案:

  • If the list is already sorted, then you can use java.util.Collections.binarySearch which offers the same performance characteristics as a java.util.TreeSet.

    如果列表已经排序,那么可以使用java.util.Collections。binarySearch提供与java.util.TreeSet相同的性能特征。

  • Otherwise you can use a java.util.HashSet that as a performance characteristic of O(1). Note that calculating the hash code for a string that doesn't have one calculated yet is an O(m) operation with m=string.length(). Also keep in mind that hashtables only work well until they reach a given load factor, i.e. hashtables will use more memory than plain lists. The default load factor used by HashSet is .75, meaning that internally a HashSet for 1e6 objects will use an array with 1.3e6 entries.

    否则可以使用java.util。哈希集作为O(1)的性能特征。注意,对于还没有计算过的字符串,计算哈希代码是使用m=string.length()的O(m)操作。还要记住,hashtables只在达到给定的负载因数时才会正常工作,也就是说,hashtables要比普通列表占用更多的内存。HashSet使用的默认负载因子是.75,这意味着在内部,1e6对象的HashSet将使用一个包含1.3e6条目的数组。

  • If the HashSet does not work for you (e.g. because there are lots of hash-collisions, because memory is tight or because there are lots of insertions), than consider using a Trie. Lookup in a Trie has a worst-case complexity of O(m) where m=string.length(). A Trie has also some extra-benefits that might be useful for you: e.g., it can give you the closest fit for a search string. But keep in mind that the best code is no code, so only roll your own Trie implementiation if the benefits outweight the costs.

    如果HashSet对您不起作用(例如,因为有很多hash-collision,因为内存很紧或者有很多插入),那么可以考虑使用Trie。在Trie中查找最坏情况的复杂度是O(m),其中m=string.length()。Trie也有一些对你有用的额外好处:例如,它可以让你最接近搜索字符串。但是请记住,最好的代码是没有代码的,因此,只有当收益大于成本时,才可以使用自己的Trie实现。

  • Consider using a database if you want more complex queries, e.g. match for a substring or a regular expression.

    如果需要更复杂的查询,可以考虑使用数据库,例如匹配子字符串或正则表达式。

#4


5  

I'd use a Set, in most cases HashSet is fine.

我使用集合,在大多数情况下HashSet是可以的。

#5


2  

With such a huge number of Strings, I immediately think of a Trie. It works better with a more limited set of characters (such as letters) and/or when the start of many string overlap.

有了这么多的弦,我立刻想到了一个三角。对于一组更有限的字符(如字母)和/或当许多字符串开始重叠时,它会更好地工作。

#6


2  

Having run the exercise here are my results.

在这里做了练习后,我得到了结果。

private static final int TEST_CYCLES = 4000;
private static final long RAND_ELEMENT_COUNT = 1000000l;
private static final int RAND_STR_LEN = 20;
//Mean time
/*
Array list:18.55425
Array list not contains:17.113
Hash set:5.0E-4
Hash set not contains:7.5E-4
*/

I believe the numbers speak for themselves. The lookup time of the hash set is way, wayyyy faster.

我相信这些数字说明了一切。哈希集的查找时间非常快。

#7


1  

If you are having such a large amount of strings, the best opportunity for you is to use a database. Look for MySQL.

如果您有这么多字符串,那么最好的机会是使用数据库。寻找MySQL。

#8


0  

Not only for String, you can use Set for any case you need unique items.

不仅用于字符串,您还可以为任何需要特殊项目的情况使用Set。

If the type of items is primitive or wrapper, you may not care. But if it is a class, you must override two methods:

如果项目的类型是原始的或包装,您可能不关心。但是如果它是一个类,您必须重写两个方法:

  1. hashCode()
  2. hashCode()
  3. equals()
  4. equals()

#9


0  

Sometimes you want to check if an object is in the list/set and at the same time you want the list/set to be ordered. If you are looking to also retrieve objects easily without using an enumeration or iterator, you may consider using both an ArrayList<String> and HashMap<String, Integer>. The list is backed by the map.

有时您想要检查对象是否在列表/集合中,同时又想要对列表/集合进行排序。如果您希望在不使用枚举或迭代器的情况下轻松检索对象,可以考虑同时使用ArrayList 和HashMap 。这个列表是由地图支持的。 ,>

Example from some work I recently did:

我最近做了一些工作:

public class NodeKey<K> implements Serializable, Cloneable{
private static final long serialVersionUID = -634779076519943311L;

private NodeKey<K> parent;
private List<K> children = new ArrayList<K>();
private Map<K, Integer> childrenToListMap = new HashMap<K, Integer>();

public NodeKey() {}

public NodeKey(Collection<? extends K> c){
    List<K> childHierarchy = new ArrayList<K>(c);
    K childLevel0 = childHierarchy.remove(0);

    if(!childrenToListMap.containsKey(childLevel0)){
        children.add(childLevel0);
        childrenToListMap.put(childLevel0, children.size()-1);
    }

    ...

In this case, parameter K would be a String for you. The map (childrenToMapList) stores Strings inserted into the list (children) as the key, and the map values are the index position in the list.

在本例中,参数K将是一个字符串。映射(children埋葬列表)将插入到列表(子列表)中的字符串存储为键,映射值是列表中的索引位置。

The reason for the list and the map is so that you can retrieve indexed values of the list, without having to do an iteration over a HashSet<String>.

列表和映射的原因是,您可以检索列表的索引值,而不必对HashSet 进行迭代。

#10


0  

Perhaps this isn't required for your case but I think it's useful to know that there is space-efficient probabilistic algorithm:

也许这对你来说不是必需的,但是我认为知道有空间效率的概率算法是有用的:

https://en.wikipedia.org/wiki/Bloom_filter

https://en.wikipedia.org/wiki/Bloom_filter

#1


88  

Your best bet is to use a HashSet and check if a string exists in the set via the contains() method. HashSets are built for fast access via the use of Object methods hashCode() and equals(). The Javadoc for HashSet states:

最好的方法是使用HashSet,并通过contains()方法检查集合中是否存在字符串。hashset是通过使用对象方法hashCode()和equals()构建的。HashSet状态的Javadoc:

This class offers constant time performance for the basic operations (add, remove, contains and size),

这个类为基本操作(添加、删除、包含和大小)提供恒定的时间性能,

HashSet stores objects in hash buckets which is to say that the value returned by the hashCode method will determine which bucket an object is stored in. This way, the amount of equality checks the HashSet has to perform via the equals() method is reduced to just the other Objects in the same hash bucket.

HashSet将对象存储在hash bucket中,也就是说,hashCode方法返回的值将确定对象存储在哪个bucket中。这样,通过equals()方法检查HashSet必须执行的相等性的数量就减少为同一散列表中的其他对象。

To use HashSets and HashMaps effectively, you must conform to the equals and hashCode contract outlined in the javadoc. In the case of java.lang.String these methods have already been implemented to do this.

要有效地使用hashset和hashmap,必须符合javadoc中概述的equals和hashCode契约。以java.lang为例。字符串这些方法已经实现了。

#2


11  

In general, a HashSet will give you better performance, since it does not have to look through each element and compare, like an ArrayList does, but typically compares at most a few elements, where the hashcodes are equal.

一般来说,HashSet会提供更好的性能,因为它不需要像ArrayList那样检查每个元素并进行比较,但通常最多只比较几个元素,其中hashcodes是相等的。

However, for 1M strings, the performance of hashSet may still not be optimal. A lot of cache misses will slow down searching the set. If all strings are equally likely, then this is unavoidable. However, if some strings are more often requested than others, then you can place the common strings into a small hashSet, and check that first, before checking the larger set. The small hashset should be sized to fit in cache (e.g. a few hundred K at most). Hits to the small hashset will then be very fast, while hits to the larger hashset proceed at speed limited by the memory bandwidth.

然而,对于1M字符串,hashSet的性能可能仍然不是最优的。大量的缓存丢失会减慢搜索集的速度。如果所有的字符串都是相等的,那么这是不可避免的。但是,如果某些字符串比其他字符串更常被请求,那么您可以将公共字符串放入一个小的hashSet中,并在检查较大的字符集之前首先检查这个字符集。对小hashset的命中将非常快,而对大hashset的命中则以受内存带宽限制的速度进行。

#3


7  

Before going further, please consider this: Why are you worried about performance? How often is this check called?

在进一步讨论之前,请考虑以下问题:为什么你担心性能?这张支票多久开一次?

As for possible solutions:

至于可能的解决方案:

  • If the list is already sorted, then you can use java.util.Collections.binarySearch which offers the same performance characteristics as a java.util.TreeSet.

    如果列表已经排序,那么可以使用java.util.Collections。binarySearch提供与java.util.TreeSet相同的性能特征。

  • Otherwise you can use a java.util.HashSet that as a performance characteristic of O(1). Note that calculating the hash code for a string that doesn't have one calculated yet is an O(m) operation with m=string.length(). Also keep in mind that hashtables only work well until they reach a given load factor, i.e. hashtables will use more memory than plain lists. The default load factor used by HashSet is .75, meaning that internally a HashSet for 1e6 objects will use an array with 1.3e6 entries.

    否则可以使用java.util。哈希集作为O(1)的性能特征。注意,对于还没有计算过的字符串,计算哈希代码是使用m=string.length()的O(m)操作。还要记住,hashtables只在达到给定的负载因数时才会正常工作,也就是说,hashtables要比普通列表占用更多的内存。HashSet使用的默认负载因子是.75,这意味着在内部,1e6对象的HashSet将使用一个包含1.3e6条目的数组。

  • If the HashSet does not work for you (e.g. because there are lots of hash-collisions, because memory is tight or because there are lots of insertions), than consider using a Trie. Lookup in a Trie has a worst-case complexity of O(m) where m=string.length(). A Trie has also some extra-benefits that might be useful for you: e.g., it can give you the closest fit for a search string. But keep in mind that the best code is no code, so only roll your own Trie implementiation if the benefits outweight the costs.

    如果HashSet对您不起作用(例如,因为有很多hash-collision,因为内存很紧或者有很多插入),那么可以考虑使用Trie。在Trie中查找最坏情况的复杂度是O(m),其中m=string.length()。Trie也有一些对你有用的额外好处:例如,它可以让你最接近搜索字符串。但是请记住,最好的代码是没有代码的,因此,只有当收益大于成本时,才可以使用自己的Trie实现。

  • Consider using a database if you want more complex queries, e.g. match for a substring or a regular expression.

    如果需要更复杂的查询,可以考虑使用数据库,例如匹配子字符串或正则表达式。

#4


5  

I'd use a Set, in most cases HashSet is fine.

我使用集合,在大多数情况下HashSet是可以的。

#5


2  

With such a huge number of Strings, I immediately think of a Trie. It works better with a more limited set of characters (such as letters) and/or when the start of many string overlap.

有了这么多的弦,我立刻想到了一个三角。对于一组更有限的字符(如字母)和/或当许多字符串开始重叠时,它会更好地工作。

#6


2  

Having run the exercise here are my results.

在这里做了练习后,我得到了结果。

private static final int TEST_CYCLES = 4000;
private static final long RAND_ELEMENT_COUNT = 1000000l;
private static final int RAND_STR_LEN = 20;
//Mean time
/*
Array list:18.55425
Array list not contains:17.113
Hash set:5.0E-4
Hash set not contains:7.5E-4
*/

I believe the numbers speak for themselves. The lookup time of the hash set is way, wayyyy faster.

我相信这些数字说明了一切。哈希集的查找时间非常快。

#7


1  

If you are having such a large amount of strings, the best opportunity for you is to use a database. Look for MySQL.

如果您有这么多字符串,那么最好的机会是使用数据库。寻找MySQL。

#8


0  

Not only for String, you can use Set for any case you need unique items.

不仅用于字符串,您还可以为任何需要特殊项目的情况使用Set。

If the type of items is primitive or wrapper, you may not care. But if it is a class, you must override two methods:

如果项目的类型是原始的或包装,您可能不关心。但是如果它是一个类,您必须重写两个方法:

  1. hashCode()
  2. hashCode()
  3. equals()
  4. equals()

#9


0  

Sometimes you want to check if an object is in the list/set and at the same time you want the list/set to be ordered. If you are looking to also retrieve objects easily without using an enumeration or iterator, you may consider using both an ArrayList<String> and HashMap<String, Integer>. The list is backed by the map.

有时您想要检查对象是否在列表/集合中,同时又想要对列表/集合进行排序。如果您希望在不使用枚举或迭代器的情况下轻松检索对象,可以考虑同时使用ArrayList 和HashMap 。这个列表是由地图支持的。 ,>

Example from some work I recently did:

我最近做了一些工作:

public class NodeKey<K> implements Serializable, Cloneable{
private static final long serialVersionUID = -634779076519943311L;

private NodeKey<K> parent;
private List<K> children = new ArrayList<K>();
private Map<K, Integer> childrenToListMap = new HashMap<K, Integer>();

public NodeKey() {}

public NodeKey(Collection<? extends K> c){
    List<K> childHierarchy = new ArrayList<K>(c);
    K childLevel0 = childHierarchy.remove(0);

    if(!childrenToListMap.containsKey(childLevel0)){
        children.add(childLevel0);
        childrenToListMap.put(childLevel0, children.size()-1);
    }

    ...

In this case, parameter K would be a String for you. The map (childrenToMapList) stores Strings inserted into the list (children) as the key, and the map values are the index position in the list.

在本例中,参数K将是一个字符串。映射(children埋葬列表)将插入到列表(子列表)中的字符串存储为键,映射值是列表中的索引位置。

The reason for the list and the map is so that you can retrieve indexed values of the list, without having to do an iteration over a HashSet<String>.

列表和映射的原因是,您可以检索列表的索引值,而不必对HashSet 进行迭代。

#10


0  

Perhaps this isn't required for your case but I think it's useful to know that there is space-efficient probabilistic algorithm:

也许这对你来说不是必需的,但是我认为知道有空间效率的概率算法是有用的:

https://en.wikipedia.org/wiki/Bloom_filter

https://en.wikipedia.org/wiki/Bloom_filter