如何从C中的数组中删除重复的字符串?

时间:2021-10-23 21:41:49

I have an array of strings in C and an integer indicating how many strings are in the array.

我有一个C中的字符串数组和一个表示数组中有多少字符串的整数。

char *strarray[MAX];  
int strcount;

In this array, the highest index (where 10 is higher than 0) is the most recent item added and the lowest index is the most distant item added. The order of items within the array matters.

在这个数组中,最高的索引(10大于0)是最近添加的项,而最低的索引是添加的最远的项。数组中项目的顺序很重要。

I need a quick way to check the array for duplicates, remove all but the highest index duplicate, and collapse the array.

我需要一种快速的方法来检查数组是否有重复,删除除最高索引重复之外的所有重复,并折叠数组。

For example:

例如:

strarray[0] = "Line 1"; 
strarray[1] = "Line 2"; 
strarray[2] = "Line 3"; 
strarray[3] = "Line 2"; 
strarray[4] = "Line 4";

would become:

将成为:

strarray[0] = "Line 1"; 
strarray[1] = "Line 3"; 
strarray[2] = "Line 2"; 
strarray[3] = "Line 4";

Index 1 of the original array was removed and indexes 2, 3, and 4 slid downwards to fill the gap.

删除原数组的索引1,将索引2、3和4向下滑动以填补空白。

I have one idea of how to do it. It is untested and I am currently attempting to code it but just from my faint understanding, I am sure this is a horrendous algorithm.

我有一个办法。它还没有经过测试,我目前正在尝试对它进行编码,但根据我的模糊理解,我确信这是一个可怕的算法。

The algorithm presented below would be ran every time a new string is added to the strarray.

每次向strarray中添加一个新字符串时,都会运行下面给出的算法。

For the interest of showing that I am trying, I will include my proposed algorithm below:

为了表示我正在尝试,我将把我提出的算法包括在下面:

  1. Search entire strarray for match to str
  2. 搜索整个strarray以匹配str
  3. If no match, do nothing
  4. 如果没有匹配,就什么都不做。
  5. If match found, put str in strarray
  6. 如果找到匹配,将str放入strarray
  7. Now we have a strarray with a max of 1 duplicate entry
  8. 现在,我们有一个带有最多1个重复项的strarray。
  9. Add highest index strarray string to lowest index of temporary string array
  10. 将索引最高的strarray字符串添加到临时字符串数组的最低索引中
  11. Continue downwards into strarray and check each element
  12. 继续向下进入strarray并检查每个元素
  13. If duplicate found, skip it
  14. 如果找到副本,跳过它
  15. If not, add it to the next highest index of the temporary string array
  16. 如果不是,则将其添加到临时字符串数组的下一个最高索引中
  17. Reverse temporary string array and copy to strarray
  18. 反转临时字符串数组并复制到strarray

Once again, this is untested (I am currently implementing it now). I just hope someone out there will have a much better solution.

同样,这是未经测试的(我现在正在实现它)。我只是希望有人能有更好的解决办法。

The order of items is important and the code must utilize the C language (not C++). The lowest index duplicates should be removed and the single highest index kept.

项目的顺序很重要,代码必须使用C语言(而不是c++)。应删除最低索引重复项,并保留最高索引。

Thank you!

谢谢你!

4 个解决方案

#1


3  

The typical efficient unique function is to:

典型的高效唯一函数是:

  1. Sort the given array.
  2. 给定的数组进行排序。
  3. Verify that consecutive runs of the same item are setup so that only one remains.
  4. 验证相同项目的连续运行是设置的,以便只保留一个。

I believe you can use qsort in combination with strcmp to accomplish the first part; writing an efficient remove would be all on you though.

我相信您可以使用qsort结合strcmp完成第一部分;写一个有效的删除将是你的全部。

Unfortunately I don't have specific ideas here; this is kind of a grey area for me because I'm usually using C++, where this would be a simple:

不幸的是,我没有具体的想法;对我来说,这是一个灰色地带,因为我通常使用c++,这很简单:

std::vector<std::string> src;
std::sort(src.begin(), src.end());
src.remove(std::unique(src.begin(), src.end()), src.end);

I know you can't use C++, but the implementation should essentially be the same.

我知道您不能使用c++,但是实现本质上应该是相同的。

Because you need to save the original order, you can have something like:

因为您需要保存原始订单,您可以有如下内容:

typedef struct
{
    int originalPosition;
    char * string;
} tempUniqueEntry;

Do your first sort with respect to string, remove unique sets of elements on the sorted set, then resort with respect to originalPosition. This way you still get O(n lg n) performance, yet you don't lose the original order.

首先对字符串进行排序,在已排序的集合上删除唯一的元素集,然后在原始位置上使用。这样你仍然可以得到O(nlgn)的性能,但是你不会失去原来的顺序。

EDIT2: Simple C implementation example of std::unique:

简单的C实现示例std::unique:

tempUniqueEntry* unique ( tempUniqueEntry * first, tempUniqueEntry * last )
{
  tempUniqueEntry *result=first;
  while (++first != last)
  {
    if (strcmp(result->string,first->string))
      *(++result)=*first;
  }
  return ++result;
}

#2


1  

I don't quite understand your proposed algorithm (I don't understand what it means to add a string to an index in step 5), but what I would do is:

我不太理解你提出的算法(我不明白在步骤5中为索引添加字符串意味着什么),但我要做的是:

unsigned int i;
for (i = n; i > 0; i--)
{
    unsigned int j;

    if (strarray[i - 1] == NULL)
    {
        continue;
    }

    for (j = i - 1; j > 0; j--)
    {
        if (strcmp(strarray[i - 1], strarray[j - 1]) == 0)
        {
            strarray[j - 1] = NULL;
        }
    }
}

Then you just need to filter the null pointers out of your array (which I'll leave as an exercise).

然后你只需要从你的数组中过滤空指针(这个我留作练习)。

A different approach would be to iterate backwards over the array and to insert each item into a (balanced) binary search tree as you go. If the item is already in the binary search tree, flag the array item (such as setting the array element to NULL) and move on. When you've processed the entire array, filter out the flagged elements as before. This would have slightly more overhead and would consume more space, but its running time would be O(n log n) instead of O(n^2).

另一种方法是遍历数组,并在执行时将每个条目插入到(平衡的)二进制搜索树中。如果项已经在二叉搜索树中,标记数组项(例如将数组元素设置为NULL),然后继续。处理完整个数组后,像以前一样过滤掉标记的元素。这稍微开销和将消耗更多的空间,但它的运行时间将是O(n log n),而不是O(n ^ 2)。

#3


1  

Sort the array with an algorithm like qsort (man 3 qsort in the terminal to see how it should be used) and then use the function strcmp to compare the strings and find duplicates

使用qsort(在终端中使用man3 qsort查看它应该如何使用)这样的算法对数组进行排序,然后使用函数strcmp来比较字符串并找到副本

If you want to mantain the original order you could use a O(N^2) complexity algorithm nesting two for, the first each time pick an element to compare to the other and the second for will be used to scan the rest of the array to find if the chosen element is a duplicate.

如果你想好好原始订单您可以使用O(N ^ 2)复杂性算法嵌套两个,第一个每次选择一个元素比较,第二个将用于扫描其他数组找到如果选择元素是重复的。

#4


0  

Can you control the input as it is going into the array? If so, just do something like this:

你能在输入进入数组时控制它吗?如果是的话,就这样做:

int addToArray(const char * toadd, char * strarray[], int strcount)
{
    const int toaddlen = strlen(toadd);

    // Add new string to end.
    // Remember to add one for the \0 terminator.
    strarray[strcount] = malloc(sizeof(char) * (toaddlen + 1));
    strncpy(strarray[strcount], toadd, toaddlen + 1);

    // Search for a duplicate.
    // Note that we are cutting the new array short by one.
    for(int i = 0; i < strcount; ++i)
    {
        if (strncmp(strarray[i], toaddlen + 1) == 0)
        {
            // Found duplicate.
            // Remove it and compact.
            // Note use of new array size here.  
            free(strarray[i]);
            for(int k = i + 1; k < strcount + 1; ++k)
                strarray[i] = strarray[k];

            strarray[strcount] = null;
            return strcount;
        }
    }

    // No duplicate found.
    return (strcount + 1);
}

You can always use the above function looping over the elements of an existing array, building a new array without duplicates.

您总是可以使用上面的函数循环遍历现有数组的元素,构建一个没有重复的新数组。

PS: If you are doing this type of operation a lot, you should move away from an array as your storage structure, and used a linked list instead. They are much more efficient for removing elements from a location other than the end.

PS:如果你经常做这种操作,你应该远离数组作为你的存储结构,而使用一个链表。它们可以更有效地从一个位置删除元素,而不是从终端删除元素。

#1


3  

The typical efficient unique function is to:

典型的高效唯一函数是:

  1. Sort the given array.
  2. 给定的数组进行排序。
  3. Verify that consecutive runs of the same item are setup so that only one remains.
  4. 验证相同项目的连续运行是设置的,以便只保留一个。

I believe you can use qsort in combination with strcmp to accomplish the first part; writing an efficient remove would be all on you though.

我相信您可以使用qsort结合strcmp完成第一部分;写一个有效的删除将是你的全部。

Unfortunately I don't have specific ideas here; this is kind of a grey area for me because I'm usually using C++, where this would be a simple:

不幸的是,我没有具体的想法;对我来说,这是一个灰色地带,因为我通常使用c++,这很简单:

std::vector<std::string> src;
std::sort(src.begin(), src.end());
src.remove(std::unique(src.begin(), src.end()), src.end);

I know you can't use C++, but the implementation should essentially be the same.

我知道您不能使用c++,但是实现本质上应该是相同的。

Because you need to save the original order, you can have something like:

因为您需要保存原始订单,您可以有如下内容:

typedef struct
{
    int originalPosition;
    char * string;
} tempUniqueEntry;

Do your first sort with respect to string, remove unique sets of elements on the sorted set, then resort with respect to originalPosition. This way you still get O(n lg n) performance, yet you don't lose the original order.

首先对字符串进行排序,在已排序的集合上删除唯一的元素集,然后在原始位置上使用。这样你仍然可以得到O(nlgn)的性能,但是你不会失去原来的顺序。

EDIT2: Simple C implementation example of std::unique:

简单的C实现示例std::unique:

tempUniqueEntry* unique ( tempUniqueEntry * first, tempUniqueEntry * last )
{
  tempUniqueEntry *result=first;
  while (++first != last)
  {
    if (strcmp(result->string,first->string))
      *(++result)=*first;
  }
  return ++result;
}

#2


1  

I don't quite understand your proposed algorithm (I don't understand what it means to add a string to an index in step 5), but what I would do is:

我不太理解你提出的算法(我不明白在步骤5中为索引添加字符串意味着什么),但我要做的是:

unsigned int i;
for (i = n; i > 0; i--)
{
    unsigned int j;

    if (strarray[i - 1] == NULL)
    {
        continue;
    }

    for (j = i - 1; j > 0; j--)
    {
        if (strcmp(strarray[i - 1], strarray[j - 1]) == 0)
        {
            strarray[j - 1] = NULL;
        }
    }
}

Then you just need to filter the null pointers out of your array (which I'll leave as an exercise).

然后你只需要从你的数组中过滤空指针(这个我留作练习)。

A different approach would be to iterate backwards over the array and to insert each item into a (balanced) binary search tree as you go. If the item is already in the binary search tree, flag the array item (such as setting the array element to NULL) and move on. When you've processed the entire array, filter out the flagged elements as before. This would have slightly more overhead and would consume more space, but its running time would be O(n log n) instead of O(n^2).

另一种方法是遍历数组,并在执行时将每个条目插入到(平衡的)二进制搜索树中。如果项已经在二叉搜索树中,标记数组项(例如将数组元素设置为NULL),然后继续。处理完整个数组后,像以前一样过滤掉标记的元素。这稍微开销和将消耗更多的空间,但它的运行时间将是O(n log n),而不是O(n ^ 2)。

#3


1  

Sort the array with an algorithm like qsort (man 3 qsort in the terminal to see how it should be used) and then use the function strcmp to compare the strings and find duplicates

使用qsort(在终端中使用man3 qsort查看它应该如何使用)这样的算法对数组进行排序,然后使用函数strcmp来比较字符串并找到副本

If you want to mantain the original order you could use a O(N^2) complexity algorithm nesting two for, the first each time pick an element to compare to the other and the second for will be used to scan the rest of the array to find if the chosen element is a duplicate.

如果你想好好原始订单您可以使用O(N ^ 2)复杂性算法嵌套两个,第一个每次选择一个元素比较,第二个将用于扫描其他数组找到如果选择元素是重复的。

#4


0  

Can you control the input as it is going into the array? If so, just do something like this:

你能在输入进入数组时控制它吗?如果是的话,就这样做:

int addToArray(const char * toadd, char * strarray[], int strcount)
{
    const int toaddlen = strlen(toadd);

    // Add new string to end.
    // Remember to add one for the \0 terminator.
    strarray[strcount] = malloc(sizeof(char) * (toaddlen + 1));
    strncpy(strarray[strcount], toadd, toaddlen + 1);

    // Search for a duplicate.
    // Note that we are cutting the new array short by one.
    for(int i = 0; i < strcount; ++i)
    {
        if (strncmp(strarray[i], toaddlen + 1) == 0)
        {
            // Found duplicate.
            // Remove it and compact.
            // Note use of new array size here.  
            free(strarray[i]);
            for(int k = i + 1; k < strcount + 1; ++k)
                strarray[i] = strarray[k];

            strarray[strcount] = null;
            return strcount;
        }
    }

    // No duplicate found.
    return (strcount + 1);
}

You can always use the above function looping over the elements of an existing array, building a new array without duplicates.

您总是可以使用上面的函数循环遍历现有数组的元素,构建一个没有重复的新数组。

PS: If you are doing this type of operation a lot, you should move away from an array as your storage structure, and used a linked list instead. They are much more efficient for removing elements from a location other than the end.

PS:如果你经常做这种操作,你应该远离数组作为你的存储结构,而使用一个链表。它们可以更有效地从一个位置删除元素,而不是从终端删除元素。