处理电子邮件地址列表的最佳方式

I have got 3 text files (A, B and C), each with several hundred email addresses. I want to merge list A and list B into a single file, ignoring differences in case and white space. Then I want to remove all emails in the new list that are in list C, again ignoring differences in case and white space.

我有3个文本文件(A,B和C),每个文件有几百个电子邮件地址。我想将列表A和列表B合并到一个文件中,忽略大小写和空格的差异。然后我想删除列表C中新列表中的所有电子邮件,再次忽略大小写和空格的差异。

My programming language of choice is normally C++, but it seems poorly suited for this task. Is there a scripting language that could do this (and similar tasks) in relatively few lines?

我选择的编程语言通常是C ++,但它似乎不适合这项任务。是否有一种脚本语言可以在相对较少的行中执行此操作(以及类似的任务)?

Or is there software already out there (free or commercial) that would allow me to do it? Is it possible to do it in Excel, for example?

或者是否有可以让我这样做的软件(免费或商业版)?例如,可以在Excel中执行此操作吗?

7 个解决方案

#1

The fastest way to do this probably wouldn't necessarily require coding. You could import files A and B into Excel in one worksheet, then (if necessary) do a filter on that resulting list of addresses to remove any duplicates.

最快的方法可能不一定需要编码。您可以在一个工作表中将文件A和B导入Excel,然后(如有必要)对生成的地址列表执行过滤以删除任何重复项。

The next step would be to import file C into a second worksheet. In a third worksheet, you'd do a VLOOKUP to pick out all the addresses in your first list and remove them if they're in your "List C".

下一步是将文件C导入第二个工作表。在第三个工作表中,您将执行VLOOKUP以选择第一个列表中的所有地址,如果它们位于“列表C”中,则将其删除。

The VLOOKUP would look something like this:

VLOOKUP看起来像这样:

=IF(ISNA(VLOOKUP(email_address_cell, Sheet2!email_duplicates_list, 1, false), "", (VLOOKUP(email_address_cell, Sheet2!email_duplicates_list, 1, false)))

= IF(ISNA(VLOOKUP(email_address_cell,Sheet2!email_duplicates_list,1,false),“”,(VLOOKUP(email_address_cell,Sheet2!email_duplicates_list,1,false)))

I've also included a check to see if the formula returns a "Value Not Available" error, in which case the cell just shows a blank value. From there, you just need to remove your white-space and there's your final list.

我还包括检查公式是否返回“值不可用”错误,在这种情况下单元格只显示一个空值。从那里,你只需要删除你的空白区域,这是你的最终列表。

Now having said all that, you could still do a VBA macro to do much the same thing, but perhaps clean the lists up a bit, depending on what you need. Hope that helps!

现在说了这么多,你仍然可以做一个VBA宏做同样的事情,但也许清理一下,取决于你需要的。希望有所帮助!

#2

For text processing of the sort you describe, either perl or python are ideal.

对于您描述的排序的文本处理,perl或python都是理想的。

You can use associative arrays (arrays with a string index in this case) to store the email addresses in a list.

您可以使用关联数组(在这种情况下具有字符串索引的数组)将电子邮件地址存储在列表中。

Use the lowercased, un-whitespaced email address as a key and the real email address as the value.

使用小写的,非空白的电子邮件地址作为密钥,使用真实的电子邮件地址作为值。

Then it's a matter of reading in and storing the first file, reading in and storing the second (which will overwrite email addresses with the same key), then reading in the third file and deleting entries from the list with that key.

然后是读入和存储第一个文件,读入并存储第二个文件(将使用相同的密钥覆盖电子邮件地址),然后读入第三个文件并使用该密钥从列表中删除条目。

What you're then left with is the list you desire (A + B - C).

你剩下的就是你想要的清单(A + B - C)。

Pseudo-code here:

set list to empty
foreach line in file one:
    key = unwhitespace(tolowercase(line))
    list{key} = line
foreach line in file two:
    key = unwhitespace(tolowercase(line))
    list{key} = line
foreach line in file three:
    key = unwhitespace(tolowercase(line))
    if exists(list{key})
        delete list{key}
foreach key in list:
    print list{key}

#3

As Excel was mentioned, you can also do this kind of thing with Jet and VBScript.

提到Excel时,你也可以用Jet和VBScript做这种事情。

Set cn = CreateObject("ADODB.Connection")
strCon = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\Docs\;" _
& "Extended Properties=""text;HDR=No;FMT=Delimited"";"

cn.Open strCon

strSQL = "SELECT F1 Into New.txt From EmailsA.txt " _
    & "WHERE UCase(F1) Not IN (SELECT UCase(F1) From EmailsC.txt)"
cn.Execute strSQL

strSQL = "INSERT INTO New.txt ( F1 ) SELECT F1 FROM EmailsB.txt " _
    & "WHERE UCase(F1) Not IN (SELECT UCase(F1) From EmailsC.txt)"
cn.Execute strSQL

#4

In Python, something like this:

在Python中,像这样:

Note, this will write lower-case emails to the final output. If that's not ok, then a dictionary-based solution would be necessary.

请注意,这会将小写电子邮件写入最终输出。如果那不行,那么基于字典的解决方案将是必要的。

def read_file(filename):
    with file(filename, "r") as f:
        while True:
            line = f.readline();
            if not line:
                break;
            line = line.rstrip();
            if line:
                yield line;

def write_file(filename, lines):
    with file(filename, "w") as f:
        for line in lines:
            f.write(line + "\n");

set_a = set((line.lower() for line in read_file("file_a.txt")));
set_b = set((line.lower() for line in read_file("file_b.txt")));
set_c = set((line.lower() for line in read_file("file_c.txt")));

# Calculate (a + b) - c
write_file("result.txt", set_a.union(set_b).difference(set_c));

#5

I think the above answers, answer the technical HOW TO question; the only thing left to consider is how many times will you have to perform the task. If it's a one-time thing and you're more comfortable with Excel, start there. If you know you will have perform this task at least twice and maybe more, then coding up a script or executable is the way to go.

我想以上的答案,回答技术如何提问;唯一需要考虑的是你需要执行多少次任务。如果它是一次性的东西,你对Excel更熟悉,那就从那里开始吧。如果您知道自己将执行此任务至少两次甚至更多,那么编写脚本或可执行文件就可以了。

#6

Sadly this answer probably won't help you, but if in fact you were using Unix (Linux for example) you could do something like:

遗憾的是,这个答案可能对您没有帮助,但如果事实上您使用的是Unix(例如Linux),您可以执行以下操作:

cat filea >> fileb # append file a to file b

cat filea >> fileb#append file a to file b

sort fileb | uniq > newFile # newFile now contains a merger of file a and file b, with sorted and unique email addresses

sort fileb | uniq> newFile #newFile现在包含文件a和文件b的合并,以及已排序和唯一的电子邮件地址

The above could all be done on one line as follows: cat filea >> fileb | sort | uniq > newFile

以上都可以在一行上完成,如下所示:cat filea >> fileb |排序| uniq> newFile

Now you're left with simply removing common emails. Some variation of "diff" should be helpful there such as perhaps: diff newFile fileC > finalFile

现在你只需删除常见的电子邮件。 “diff”的一些变体应该有帮助,例如:diff newFile fileC> finalFile

diff will give you a list of differences between the two files, so the output in "finalFile" should be a list of emails that are in "newFile" (the merger of A & B) but are NOT in fileC. Options to the various tools allow you to ignore whitespace and case. I'd have to play with it a bit to get it exactly right but the above is the general idea.

diff将为您提供两个文件之间的差异列表,因此“finalFile”中的输出应该是“newFile”(A和B的合并)中的电子邮件列表,但不在fileC中。各种工具的选项允许您忽略空格和大小写。我必须稍微玩一下才能完全正确,但以上是一般的想法。

I used to have an extra box running Linux for the sole purpose of doing stuff like the above which is a hassle under Windoze but a breeze under Unix type operating systems. When my hardware died I never got around to building another Linux box.

我曾经有一个额外的盒子运行Linux,其唯一目的是做上面的事情,这在Windoze下是一个麻烦但在Unix类型的操作系统下轻而易举。当我的硬件死了,我从来没有开始构建另一个Linux机箱。

I believe the MKS toolkit for Windoze probably has all of the above utilities.

我相信Windoze的MKS工具包可能具有上述所有实用程序。

#7

Excel can do it, as above. The programming language most suited though is Perl.

Excel可以这样做,如上所述。最适合的编程语言是Perl。

#1