从C中的字符串数组中删除重复项

时间:2023-01-13 19:29:49

I have an array of strings in C. The string length is around 3000 characters each. I thought to hash them for faster search results and preferred Perfect hashing. The problem is, perfect hash needs unique strings from data set to create hash function where as my data set has inevitable duplicates.

我在C中有一个字符串数组。字符串长度各约为3000个字符。我想哈希它们以获得更快的搜索结果,并且首选完美哈希。问题是,完美哈希需要来自数据集的唯一字符串来创建哈希函数,因为我的数据集具有不可避免的重复。

So now, I need a very fast way of removing duplicates from array of strings in C. Kindly suggest the fastest way to do this.

所以现在,我需要一种非常快速的方法来从C中的字符串数组中删除重复项。请建议以最快的方式执行此操作。

3 个解决方案

#1


1  

My first thought, without researching, was to potentially create some kind of basic hash for each string and only check the complete strings for equality if the hashes match. This should allow for speeding up the algorithm slightly, at a small cost to how straightforward the whole algorithm is. There should be a better solution than this, but it should help in a pinch.

在没有研究的情况下,我的第一个想法是为每个字符串创建某种基本哈希值,并且只有在哈希值匹配时才检查完整字符串是否相等。这应该允许稍微加快算法,以较低的成本完成整个算法的直接性。应该有一个比这个更好的解决方案,但它应该有所帮助。

#2


1  

These are the data structures which can help

这些是可以提供帮助的数据结构

array

Add each item to an array. qsort the result. Output the result but not if the previous string was a duplicate. Unix sort | uniq

将每个项添加到数组中。 qsort结果。输出结果,但如果前一个字符串是重复的则不输出。 Unix排序| uniq的

binary tree

Hold the strings in a binary tree. Wikipedia binary tree. As each string is added, then search the tree. Add the string if it is not there.

将字符串保存在二叉树中。*二叉树。添加每个字符串后,搜索树。如果不存在,请添加字符串。

hash table

Use a hash of string to keep a hash table. Collisions are checked by strcmp, and duplicates not added.

使用字符串哈希来保留哈希表。 strcmp检查冲突,并且不添加重复项。

trie

Wikipedia trie. The trie stores the common prefix. This would automatically 'lose' duplicates

*特里。 trie存储公共前缀。这会自动“丢失”重复

#3


0  

#include <string.h>
#include <stdio.h>

/**
 * Removes duplicate strings from the array and shifts items left.
 * Returns the number of items in the modified array.
 *
 * Parameters:
 * n_items   - number of items in the array.
 * arr       - an array of strings with possible duplicates.
 */
int remove_dups(int n_items, char *arr[])
{
    int i, j = 1, k = 1;

    for (i = 0; i < n_items; i++)
    {
        for (j = i + 1, k = j; j < n_items; j++)
        {
            /* If strings don't match... */
            if (strcmp(arr[i], arr[j])) 
            {
                arr[k] = arr[j];
                k++;
            }
        }
        n_items -= j - k;
    }
    return n_items;
}

#1


1  

My first thought, without researching, was to potentially create some kind of basic hash for each string and only check the complete strings for equality if the hashes match. This should allow for speeding up the algorithm slightly, at a small cost to how straightforward the whole algorithm is. There should be a better solution than this, but it should help in a pinch.

在没有研究的情况下,我的第一个想法是为每个字符串创建某种基本哈希值,并且只有在哈希值匹配时才检查完整字符串是否相等。这应该允许稍微加快算法,以较低的成本完成整个算法的直接性。应该有一个比这个更好的解决方案,但它应该有所帮助。

#2


1  

These are the data structures which can help

这些是可以提供帮助的数据结构

array

Add each item to an array. qsort the result. Output the result but not if the previous string was a duplicate. Unix sort | uniq

将每个项添加到数组中。 qsort结果。输出结果,但如果前一个字符串是重复的则不输出。 Unix排序| uniq的

binary tree

Hold the strings in a binary tree. Wikipedia binary tree. As each string is added, then search the tree. Add the string if it is not there.

将字符串保存在二叉树中。*二叉树。添加每个字符串后,搜索树。如果不存在,请添加字符串。

hash table

Use a hash of string to keep a hash table. Collisions are checked by strcmp, and duplicates not added.

使用字符串哈希来保留哈希表。 strcmp检查冲突,并且不添加重复项。

trie

Wikipedia trie. The trie stores the common prefix. This would automatically 'lose' duplicates

*特里。 trie存储公共前缀。这会自动“丢失”重复

#3


0  

#include <string.h>
#include <stdio.h>

/**
 * Removes duplicate strings from the array and shifts items left.
 * Returns the number of items in the modified array.
 *
 * Parameters:
 * n_items   - number of items in the array.
 * arr       - an array of strings with possible duplicates.
 */
int remove_dups(int n_items, char *arr[])
{
    int i, j = 1, k = 1;

    for (i = 0; i < n_items; i++)
    {
        for (j = i + 1, k = j; j < n_items; j++)
        {
            /* If strings don't match... */
            if (strcmp(arr[i], arr[j])) 
            {
                arr[k] = arr[j];
                k++;
            }
        }
        n_items -= j - k;
    }
    return n_items;
}