搜索多个XML文件以查找字符串

时间:2023-01-15 00:17:09

I have a folder with 400k+ XML-documents and many more to come, each file is named with 'ID'.xml, and each belongs to a specific user. In a SQL server database I have the 'ID' from the XML-file matched with a userID which is where I interconnect the XML-document with the user. A user can have an infinite number of XML-document attached (but let's say maximum >10k documents)

我有一个包含400k + XML文档的文件夹以及更多文件,每个文件都以'ID'.xml命名,每个文件都属于特定用户。在SQL服务器数据库中,我使用XML文件中的“ID”与userID匹配,在userID中,我将XML文档与用户互连。用户可以附加无限数量的XML文档(但是假设最大> 10k文档)

All XML-documents have a few common elements, but the structure can vary a little.

所有XML文档都有一些共同的元素,但结构可能会有所不同。

Now, each user will need to make a search in the XML-documents belonging to her, and what I've tried so far (looping through each file and read it with a streamreader) is too slow. I don't care, if it reads and matches the whole file with attributes and so on, or just the text in each element. What should be returned in the first place is a list with the ID's from the filenames.

现在,每个用户都需要在属于她的XML文档中进行搜索,到目前为止我已经尝试过(循环遍历每个文件并使用streamreader读取它)太慢了。我不在乎,如果它读取并匹配整个文件与属性等,或只是每个元素中的文本。首先应该返回的是一个包含文件名中ID的列表。

What is the fastest and smartest methods here, if any?

如果有的话,这里最快最聪明的方法是什么?

7 个解决方案

#1


2  

I think LINQ-to-XML is probably the direction you want to go.

我认为LINQ-to-XML可能是您想要的方向。

Assuming you know the names of the tags that you want, you would be able to do a search for those particular elements and return the values.

假设您知道所需标记的名称,您就可以搜索这些特定元素并返回值。

var xDoc = XDocument.Load("yourFile.xml");

var result = from dec in xDoc.Descendants()
             where dec.Name == "tagName"
             select dec.Value;

results would then contain an IEnumerable of the value of any XML tag that has has a name matching "tagName"

然后,结果将包含名称与“tagName”匹配的任何XML标记的值的IEnumerable

The query could also be written like this:

查询也可以这样写:

var result = from dec in xDoc.Decendants("tagName")
             select dec.Value;

or this:

var result = xDoc.Descendants("tagName").Select(tag => tag.Value);

The output would be the same, it is just a different way to filter based on the element name.

输出将是相同的,它只是一种基于元素名称进行过滤的不同方式。

#2


2  

You'll have to open each file that contains relevant data, and if you don't know which files contain it, you'll have to open all that may match. So the only performance gain would be in the parsing routine.

您必须打开包含相关数据的每个文件,如果您不知道哪些文件包含它,则必须打开所有可能匹配的文件。因此,唯一的性能提升将在解析例程中。

When parsing Xml, if speed is the requirement, you could use the XmlReader as it performs way better than the other parsers (most read the entire Xml file before you can query them). The fact that it is forward-only should not be a limitation for this case.

在解析Xml时,如果需要速度,则可以使用XmlReader,因为它比其他解析器执行得更好(大多数读取整个Xml文件,然后才能查询它们)。它仅向前的事实不应该是对这种情况的限制。

If parsing takes about as long as the disk I/O, you could try parsing files in parallel, so one thread could wait for a file to be read while the other parses the loaded data. I don't think you can make that big a win there, though.

如果解析需要与磁盘I / O一样长,您可以尝试并行解析文件,因此一个线程可以等待读取文件,而另一个线程解析加载的数据。不过,我认为你不能在那里取得那么大的胜利。

Also what is "too slow" and what is acceptable? Would this solution of many files become slower over time?

什么是“太慢”,什么是可以接受的?随着时间的推移,许多文件的解决方案会变慢吗?

#3


1  

Use LINQ to XML.

使用LINQ to XML。

Check out this article. over at msdn.

看看这篇文章。在msdn。

XDocument doc = XDocument.Load("C:\file.xml");

And don't forget that reading so many files will always be slow, you may try writing a multi-threaded program...

并且不要忘记读取这么多文件总是很慢,你可以尝试编写一个多线程程序......

#4


1  

If I understood correctly you don't want to open each xml file for particular user because it's too slow whether you are using linq to xml or some other method. Have you considered saving some values both in xml file and relational database (tags) (together with xml ID). In that case you could search for some values in DB first and select only xml files that contain searched values ?

如果我理解正确,你不想为特定用户打开每个xml文件,因为无论你使用linq到xml还是其他方法,它都太慢了。您是否考虑在xml文件和关系数据库(标记)中保存一些值(与xml ID一起)。在这种情况下,您可以先在DB中搜索某些值,然后只选择包含搜索值的xml文件?

for example: ID, tagName1, tagName2 xmlDocID, value1, value2

例如:ID,tagName1,tagName2 xmlDocID,value1,value2

my other question is, why have you chosen to store xml documents in file system. If you are using SQL Server 2005/2008, it has very good support for storing, searching through xml columns (even indexing some values in xml)

我的另一个问题是,为什么选择将xml文档存储在文件系统中。如果您使用的是SQL Server 2005/2008,它可以非常好地支持存储,搜索xml列(甚至索引xml中的某些值)

#5


0  

Are you just looking for files that have a specific string in the content somewhere?

您只是在寻找内容中具有特定字符串的文件吗?

WARNING - Not a pure .NET solution. If this scares you, then stick with the other answers. :)

警告 - 不是纯.NET解决方案。如果这让你害怕,那么坚持其他答案。 :)

If that's what you're doing, another alternative is to get something like grep to do the heavy lifting for you. Shell out to that with the "-l" argument to specify that you are only interested in filenames and you are on to a winner. (for more usage examples, see this link)

如果这就是你正在做的事情,另一种选择就是让像grep这样的东西为你做繁重的工作。使用“-l”参数对其进行修改,以指定您只对文件名感兴趣,并且您将获胜。 (有关更多用法示例,请参阅此链接)

#6


0  

L.B Have already made a valid point. This is a case, where Lucene.Net(or any indexer) would be a must. It would give you a steady (very fast) performance in all searches. And it is one of the primary benefits of indexers, to handle a very large amount of arbitrary data.

L.B已经提出了一个有效的观点。这是一个案例,Lucene.Net(或任何索引器)将是必须的。它会在所有搜索中为您提供稳定(非常快)的性能。处理大量任意数据是索引器的主要优点之一。

Or is there any reason, why you wouldn't use Lucene?

或者有什么理由,为什么你不会使用Lucene?

#7


0  

Lucene.NET (and Lucene) support incremental indexing. If you can re-open the index for reading every so often, then you can keep adding documents to the index all day long -- your searches will be up-to-date with the last time you re-opened the index for searching.

Lucene.NET(和Lucene)支持增量索引。如果您可以经常重新打开索引以进行阅读,那么您可以整天将文档添加到索引中 - 您的搜索将与上次重新打开索引进行搜索时保​​持同步。

#1


2  

I think LINQ-to-XML is probably the direction you want to go.

我认为LINQ-to-XML可能是您想要的方向。

Assuming you know the names of the tags that you want, you would be able to do a search for those particular elements and return the values.

假设您知道所需标记的名称,您就可以搜索这些特定元素并返回值。

var xDoc = XDocument.Load("yourFile.xml");

var result = from dec in xDoc.Descendants()
             where dec.Name == "tagName"
             select dec.Value;

results would then contain an IEnumerable of the value of any XML tag that has has a name matching "tagName"

然后,结果将包含名称与“tagName”匹配的任何XML标记的值的IEnumerable

The query could also be written like this:

查询也可以这样写:

var result = from dec in xDoc.Decendants("tagName")
             select dec.Value;

or this:

var result = xDoc.Descendants("tagName").Select(tag => tag.Value);

The output would be the same, it is just a different way to filter based on the element name.

输出将是相同的,它只是一种基于元素名称进行过滤的不同方式。

#2


2  

You'll have to open each file that contains relevant data, and if you don't know which files contain it, you'll have to open all that may match. So the only performance gain would be in the parsing routine.

您必须打开包含相关数据的每个文件,如果您不知道哪些文件包含它,则必须打开所有可能匹配的文件。因此,唯一的性能提升将在解析例程中。

When parsing Xml, if speed is the requirement, you could use the XmlReader as it performs way better than the other parsers (most read the entire Xml file before you can query them). The fact that it is forward-only should not be a limitation for this case.

在解析Xml时,如果需要速度,则可以使用XmlReader,因为它比其他解析器执行得更好(大多数读取整个Xml文件,然后才能查询它们)。它仅向前的事实不应该是对这种情况的限制。

If parsing takes about as long as the disk I/O, you could try parsing files in parallel, so one thread could wait for a file to be read while the other parses the loaded data. I don't think you can make that big a win there, though.

如果解析需要与磁盘I / O一样长,您可以尝试并行解析文件,因此一个线程可以等待读取文件,而另一个线程解析加载的数据。不过,我认为你不能在那里取得那么大的胜利。

Also what is "too slow" and what is acceptable? Would this solution of many files become slower over time?

什么是“太慢”,什么是可以接受的?随着时间的推移,许多文件的解决方案会变慢吗?

#3


1  

Use LINQ to XML.

使用LINQ to XML。

Check out this article. over at msdn.

看看这篇文章。在msdn。

XDocument doc = XDocument.Load("C:\file.xml");

And don't forget that reading so many files will always be slow, you may try writing a multi-threaded program...

并且不要忘记读取这么多文件总是很慢,你可以尝试编写一个多线程程序......

#4


1  

If I understood correctly you don't want to open each xml file for particular user because it's too slow whether you are using linq to xml or some other method. Have you considered saving some values both in xml file and relational database (tags) (together with xml ID). In that case you could search for some values in DB first and select only xml files that contain searched values ?

如果我理解正确,你不想为特定用户打开每个xml文件,因为无论你使用linq到xml还是其他方法,它都太慢了。您是否考虑在xml文件和关系数据库(标记)中保存一些值(与xml ID一起)。在这种情况下,您可以先在DB中搜索某些值,然后只选择包含搜索值的xml文件?

for example: ID, tagName1, tagName2 xmlDocID, value1, value2

例如:ID,tagName1,tagName2 xmlDocID,value1,value2

my other question is, why have you chosen to store xml documents in file system. If you are using SQL Server 2005/2008, it has very good support for storing, searching through xml columns (even indexing some values in xml)

我的另一个问题是,为什么选择将xml文档存储在文件系统中。如果您使用的是SQL Server 2005/2008,它可以非常好地支持存储,搜索xml列(甚至索引xml中的某些值)

#5


0  

Are you just looking for files that have a specific string in the content somewhere?

您只是在寻找内容中具有特定字符串的文件吗?

WARNING - Not a pure .NET solution. If this scares you, then stick with the other answers. :)

警告 - 不是纯.NET解决方案。如果这让你害怕,那么坚持其他答案。 :)

If that's what you're doing, another alternative is to get something like grep to do the heavy lifting for you. Shell out to that with the "-l" argument to specify that you are only interested in filenames and you are on to a winner. (for more usage examples, see this link)

如果这就是你正在做的事情,另一种选择就是让像grep这样的东西为你做繁重的工作。使用“-l”参数对其进行修改,以指定您只对文件名感兴趣,并且您将获胜。 (有关更多用法示例,请参阅此链接)

#6


0  

L.B Have already made a valid point. This is a case, where Lucene.Net(or any indexer) would be a must. It would give you a steady (very fast) performance in all searches. And it is one of the primary benefits of indexers, to handle a very large amount of arbitrary data.

L.B已经提出了一个有效的观点。这是一个案例,Lucene.Net(或任何索引器)将是必须的。它会在所有搜索中为您提供稳定(非常快)的性能。处理大量任意数据是索引器的主要优点之一。

Or is there any reason, why you wouldn't use Lucene?

或者有什么理由,为什么你不会使用Lucene?

#7


0  

Lucene.NET (and Lucene) support incremental indexing. If you can re-open the index for reading every so often, then you can keep adding documents to the index all day long -- your searches will be up-to-date with the last time you re-opened the index for searching.

Lucene.NET(和Lucene)支持增量索引。如果您可以经常重新打开索引以进行阅读,那么您可以整天将文档添加到索引中 - 您的搜索将与上次重新打开索引进行搜索时保​​持同步。