如何比较10个大型XML文件?

时间:2021-10-26 19:33:44

I have 10 big XML files and some files are different from others (it represents data in a step of the process).

我有10个大的XML文件,有些文件与其他文件不同(它表示过程中的数据)。

How to compare them AUTOMATICALLY?

如何自动比较它们?

I do know I can compare them manually using tools like WinMerge or eyes, but I don't like that approach.

我知道我可以使用WinMerge或eyes等工具来手动比较它们,但我不喜欢这种方法。

I would like it to do it on a Windows machine, but I have Cygwin installed.

我想让它在Windows机器上运行,但是我已经安装了Cygwin。

I think I can somehow use git diff to do that, but ... how?

我想我可以用git diff来做这个,但是……如何?

4 个解决方案

#1


2  

If all you want to know is the difference, the simplest (not fastest!) will be to do a hash over them and compare the results. md5sum yourfile*.xml and see which entries are identical.

如果您想知道的只是差异,那么最简单的(不是最快的!)就是对它们进行散列并比较结果。md5sum yourfile *。查看哪些条目是相同的。

It would be more efficient to compare them in a different way, but I don't think there are standard tools for that - a small program would do, however.

以不同的方式对它们进行比较会更有效,但是我不认为有标准的工具可以做到这一点——一个小程序就可以做到。

Open all files to be compared
Loop over the character indices
    fetch character from each, compare
    remove from list those which are not identical / group those who have the same 

So on the first difference, you can narrow down your search, depending on what you want to do. Calculating a checksum/hash will do this over the entire files by default; you wrote about large files.

第一个区别是,你可以缩小搜索范围,这取决于你想做什么。默认情况下,计算校验和/哈希会对整个文件执行此操作;你写过大文件。

I'd go with the md5sum (shasum, ...) for now, though.

不过,现在我还是选择md5sum (shasum,…)。

#2


2  

Do you need an XML-aware comparison, e.g. one that recognizes that attribute order is not significant? If so, you can compare the files by parsing them and using the deep-equal() function in XPath or XQuery. Alternatively, you can turn the files into XML canonical form and then compare the canonicalized files bytewise.

您是否需要一个支持xml的比较,例如,一个识别属性顺序不重要的比较?如果是这样,您可以通过解析它们并使用XPath或XQuery中的深等()函数来比较这些文件。或者,您可以将文件转换为XML规范格式,然后按字节对规范化文件进行比较。

If you need an analysis of the differences, rather than merely a boolean value telling you they are different, there is a product called DeltaXML that specializes in this. It's not free.

如果您需要对差异进行分析,而不仅仅是一个布尔值来告诉您它们是不同的,那么有一个叫做DeltaXML的产品专门研究这个问题。它不是免费的。

#3


1  

If you just want to determine quickly whether the files are the same or not you might consider using a hashing algorithm - md5 each file and compare the resulting hashes?

如果您只想快速确定文件是否相同,您可以考虑使用散列算法—每个文件md5并比较结果的散列?

#4


1  

Well the simplest way to compare two files is to use diff file1 file2. You can add -b and -B options to ignore whitespace and white line differences : diff -bB file1 file2. try man diff.

比较两个文件最简单的方法是使用diff file1 file2。您可以添加-b和-b选项来忽略空格和白线差异:diff -bB file1 file2。试人diff。

If you want to do that for a lot of files, use a script.

如果您想要对很多文件执行此操作,请使用脚本。

git diff is relevant if you compare two revision of the same file.

如果您比较同一个文件的两个版本,git diff是相关的。

my2c

my2c

#1


2  

If all you want to know is the difference, the simplest (not fastest!) will be to do a hash over them and compare the results. md5sum yourfile*.xml and see which entries are identical.

如果您想知道的只是差异,那么最简单的(不是最快的!)就是对它们进行散列并比较结果。md5sum yourfile *。查看哪些条目是相同的。

It would be more efficient to compare them in a different way, but I don't think there are standard tools for that - a small program would do, however.

以不同的方式对它们进行比较会更有效,但是我不认为有标准的工具可以做到这一点——一个小程序就可以做到。

Open all files to be compared
Loop over the character indices
    fetch character from each, compare
    remove from list those which are not identical / group those who have the same 

So on the first difference, you can narrow down your search, depending on what you want to do. Calculating a checksum/hash will do this over the entire files by default; you wrote about large files.

第一个区别是,你可以缩小搜索范围,这取决于你想做什么。默认情况下,计算校验和/哈希会对整个文件执行此操作;你写过大文件。

I'd go with the md5sum (shasum, ...) for now, though.

不过,现在我还是选择md5sum (shasum,…)。

#2


2  

Do you need an XML-aware comparison, e.g. one that recognizes that attribute order is not significant? If so, you can compare the files by parsing them and using the deep-equal() function in XPath or XQuery. Alternatively, you can turn the files into XML canonical form and then compare the canonicalized files bytewise.

您是否需要一个支持xml的比较,例如,一个识别属性顺序不重要的比较?如果是这样,您可以通过解析它们并使用XPath或XQuery中的深等()函数来比较这些文件。或者,您可以将文件转换为XML规范格式,然后按字节对规范化文件进行比较。

If you need an analysis of the differences, rather than merely a boolean value telling you they are different, there is a product called DeltaXML that specializes in this. It's not free.

如果您需要对差异进行分析,而不仅仅是一个布尔值来告诉您它们是不同的,那么有一个叫做DeltaXML的产品专门研究这个问题。它不是免费的。

#3


1  

If you just want to determine quickly whether the files are the same or not you might consider using a hashing algorithm - md5 each file and compare the resulting hashes?

如果您只想快速确定文件是否相同,您可以考虑使用散列算法—每个文件md5并比较结果的散列?

#4


1  

Well the simplest way to compare two files is to use diff file1 file2. You can add -b and -B options to ignore whitespace and white line differences : diff -bB file1 file2. try man diff.

比较两个文件最简单的方法是使用diff file1 file2。您可以添加-b和-b选项来忽略空格和白线差异:diff -bB file1 file2。试人diff。

If you want to do that for a lot of files, use a script.

如果您想要对很多文件执行此操作,请使用脚本。

git diff is relevant if you compare two revision of the same file.

如果您比较同一个文件的两个版本,git diff是相关的。

my2c

my2c