I have 10 big XML files and some files are different from others (it represents data in a step of the process).
我有10个大的XML文件,有些文件与其他文件不同(它表示过程中的数据)。
How to compare them AUTOMATICALLY?
如何自动比较它们?
I do know I can compare them manually using tools like WinMerge or eyes, but I don't like that approach.
我知道我可以使用WinMerge或eyes等工具来手动比较它们,但我不喜欢这种方法。
I would like it to do it on a Windows machine, but I have Cygwin installed.
我想让它在Windows机器上运行,但是我已经安装了Cygwin。
I think I can somehow use git diff to do that, but ... how?
我想我可以用git diff来做这个,但是……如何?
4 个解决方案
#1
2
If all you want to know is the difference, the simplest (not fastest!) will be to do a hash over them and compare the results. md5sum yourfile*.xml and see which entries are identical.
如果您想知道的只是差异,那么最简单的(不是最快的!)就是对它们进行散列并比较结果。md5sum yourfile *。查看哪些条目是相同的。
It would be more efficient to compare them in a different way, but I don't think there are standard tools for that - a small program would do, however.
以不同的方式对它们进行比较会更有效,但是我不认为有标准的工具可以做到这一点——一个小程序就可以做到。
Open all files to be compared
Loop over the character indices
fetch character from each, compare
remove from list those which are not identical / group those who have the same
So on the first difference, you can narrow down your search, depending on what you want to do. Calculating a checksum/hash will do this over the entire files by default; you wrote about large files.
第一个区别是,你可以缩小搜索范围,这取决于你想做什么。默认情况下,计算校验和/哈希会对整个文件执行此操作;你写过大文件。
I'd go with the md5sum (shasum, ...) for now, though.
不过,现在我还是选择md5sum (shasum,…)。
#2
2
Do you need an XML-aware comparison, e.g. one that recognizes that attribute order is not significant? If so, you can compare the files by parsing them and using the deep-equal() function in XPath or XQuery. Alternatively, you can turn the files into XML canonical form and then compare the canonicalized files bytewise.
您是否需要一个支持xml的比较,例如,一个识别属性顺序不重要的比较?如果是这样,您可以通过解析它们并使用XPath或XQuery中的深等()函数来比较这些文件。或者,您可以将文件转换为XML规范格式,然后按字节对规范化文件进行比较。
If you need an analysis of the differences, rather than merely a boolean value telling you they are different, there is a product called DeltaXML that specializes in this. It's not free.
如果您需要对差异进行分析,而不仅仅是一个布尔值来告诉您它们是不同的,那么有一个叫做DeltaXML的产品专门研究这个问题。它不是免费的。
#3
1
If you just want to determine quickly whether the files are the same or not you might consider using a hashing algorithm - md5 each file and compare the resulting hashes?
如果您只想快速确定文件是否相同,您可以考虑使用散列算法—每个文件md5并比较结果的散列?
#4
1
Well the simplest way to compare two files is to use diff file1 file2
. You can add -b
and -B
options to ignore whitespace and white line differences : diff -bB file1 file2
. try man diff
.
比较两个文件最简单的方法是使用diff file1 file2。您可以添加-b和-b选项来忽略空格和白线差异:diff -bB file1 file2。试人diff。
If you want to do that for a lot of files, use a script.
如果您想要对很多文件执行此操作,请使用脚本。
git diff
is relevant if you compare two revision of the same file.
如果您比较同一个文件的两个版本,git diff是相关的。
my2c
my2c
#1
2
If all you want to know is the difference, the simplest (not fastest!) will be to do a hash over them and compare the results. md5sum yourfile*.xml and see which entries are identical.
如果您想知道的只是差异,那么最简单的(不是最快的!)就是对它们进行散列并比较结果。md5sum yourfile *。查看哪些条目是相同的。
It would be more efficient to compare them in a different way, but I don't think there are standard tools for that - a small program would do, however.
以不同的方式对它们进行比较会更有效,但是我不认为有标准的工具可以做到这一点——一个小程序就可以做到。
Open all files to be compared
Loop over the character indices
fetch character from each, compare
remove from list those which are not identical / group those who have the same
So on the first difference, you can narrow down your search, depending on what you want to do. Calculating a checksum/hash will do this over the entire files by default; you wrote about large files.
第一个区别是,你可以缩小搜索范围,这取决于你想做什么。默认情况下,计算校验和/哈希会对整个文件执行此操作;你写过大文件。
I'd go with the md5sum (shasum, ...) for now, though.
不过,现在我还是选择md5sum (shasum,…)。
#2
2
Do you need an XML-aware comparison, e.g. one that recognizes that attribute order is not significant? If so, you can compare the files by parsing them and using the deep-equal() function in XPath or XQuery. Alternatively, you can turn the files into XML canonical form and then compare the canonicalized files bytewise.
您是否需要一个支持xml的比较,例如,一个识别属性顺序不重要的比较?如果是这样,您可以通过解析它们并使用XPath或XQuery中的深等()函数来比较这些文件。或者,您可以将文件转换为XML规范格式,然后按字节对规范化文件进行比较。
If you need an analysis of the differences, rather than merely a boolean value telling you they are different, there is a product called DeltaXML that specializes in this. It's not free.
如果您需要对差异进行分析,而不仅仅是一个布尔值来告诉您它们是不同的,那么有一个叫做DeltaXML的产品专门研究这个问题。它不是免费的。
#3
1
If you just want to determine quickly whether the files are the same or not you might consider using a hashing algorithm - md5 each file and compare the resulting hashes?
如果您只想快速确定文件是否相同,您可以考虑使用散列算法—每个文件md5并比较结果的散列?
#4
1
Well the simplest way to compare two files is to use diff file1 file2
. You can add -b
and -B
options to ignore whitespace and white line differences : diff -bB file1 file2
. try man diff
.
比较两个文件最简单的方法是使用diff file1 file2。您可以添加-b和-b选项来忽略空格和白线差异:diff -bB file1 file2。试人diff。
If you want to do that for a lot of files, use a script.
如果您想要对很多文件执行此操作,请使用脚本。
git diff
is relevant if you compare two revision of the same file.
如果您比较同一个文件的两个版本,git diff是相关的。
my2c
my2c