如何有效地使用内存在C#中附加大型XML文件

时间:2023-01-14 22:48:51

Is there some way I can combine two XmlDocuments without holding the first in memory?

有没有什么方法可以组合两个XmlDocuments而不保留内存中的第一个?

I have to cycle through a list of up to a hundred large (~300MB) XML files, appending to each up to 1000 nodes, repeating the whole process several times (as the new node list is cleared to save memory). Currently I load the whole XmlDocument into memory before appending new nodes, which is currently not tenable.

我必须遍历一个包含多达一百个大(~300MB)XML文件的列表,每个最多1000个节点附加,重复整个过程多次(因为清除新节点列表以节省内存)。目前我在添加新节点之前将整个XmlDocument加载到内存中,这些节点目前还不成立。

What would you say is the best way to go about this? I have a few ideas but I'm not sure which is best:

你会说最好的方法是什么?我有一些想法,但我不确定哪个是最好的:

  1. Never load the whole XMLDocument, instead using XmlReader and XmlWriter simultaneously to write to a temp file which is subsequently renamed.
  2. 永远不要加载整个XMLDocument,而是同时使用XmlReader和XmlWriter写入随后重命名的临时文件。

  3. Make a XmlDocument for the new nodes only, and then manually write it to the existing file (i.e. file.WriteLine( "<node>\n" )
  4. 仅为新节点创建一个XmlDocument,然后手动将其写入现有文件(即file.WriteLine(“ \ n”))

  5. Something else?

Any help will be much appreciated.

任何帮助都感激不尽。

Edit Some more details in answer to some of the comments:

编辑更多细节以回答一些评论:

The program parses several large logs into XML, grouping into different files by source. It only needs to run once a day, and once the XML is written there is a lightweight proprietary reader program which gives reports on the data. The program only needs to run once a day so can be slow, but runs on a server which performs other actions, mainly file compression and transfer, which cannot be effected too much.

该程序将几个大型日志解析为XML,按源分组到不同的文件中。它只需要每天运行一次,并且一旦编写了XML,就会有一个轻量级的专有读取器程序,它提供有关数据的报告。程序只需要每天运行一次,因此可能很慢,但是在执行其他操作的服务器上运行,主要是文件压缩和传输,这不会影响太多。

A database would probably be easier, but the company isn't going to do this any time soon!

数据库可能会更容易,但该公司不会很快就会这样做!

As is, the program runs on the dev machine using a few GB of memory at the most, but throws out of memory exceptions when run on the sever.

因此,程序最多使用几GB内存在dev机器上运行,但在服务器上运行时会抛出内存异常。

Final Edit The task is quite low-prority, which is why it would only cost extra to get a database (though I will look into mongo).

最终编辑任务是非常低的,这就是为什么它只需要额外的费用来获得一个数据库(虽然我会考虑mongo)。

The file will only be appended to, and won't grow indefinitely - each final file is only for a day's worth of the log, and then new files are generated the following day.

该文件将仅附加到,并且不会无限增长 - 每个最终文件仅用于一天的日志,然后在第二天生成新文件。

I'll probably use the XmlReader/Writer method since it will be easiest to ensure XML validity, but I have taken all your comments/answers into consideration. I know that having XML files this large is not a particularly good solution, but it's what I'm limited to, so thanks for all the help given.

我可能会使用XmlReader / Writer方法,因为它最容易确保XML有效性,但我已经考虑了所有的评论/答案。我知道拥有这么大的XML文件并不是一个特别好的解决方案,但这是我所限制的,所以感谢所有给予的帮助。

1 个解决方案

#1


2  

If you wish to be completely certain of the XML structure, using XMLWriter and XMLReader are the best way to go.

如果您希望完全确定XML结构,那么使用XMLWriter和XMLReader是最好的方法。

However, for absolutely highest possible performance, you may be able to recreate this code quickly using direct string functions. You could do this, although you'd lose the ability to verify the XML structure - if one file had an error you wouldn't be able to correct it:

但是,为了获得绝对最高的性能,您可以使用直接字符串函数快速重新创建此代码。你可以这样做,虽然你失去了验证XML结构的能力 - 如果一个文件有错误,你将无法纠正它:

using (StreamWriter sw = new StreamWriter("out.xml")) {
    foreach (string filename in files) {
        sw.Write(String.Format(@"<inputfile name=""{0}"">", filename));
        using (StreamReader sr = new StreamReader(filename)) {
            // Using .NET 4's CopyTo(); alternatively try http://bit.ly/RiovFX
            if (max_performance) {
                sr.CopyTo(sw);
            } else {
                string line = sr.ReadLine();
                // parse the line and make any modifications you want
                sw.Write(line);
                sw.Write("\n");
            }
        }
        sw.Write("</inputfile>");
    }
}

Depending on the way your input XML files are structured, you might opt to remove the XML headers, maybe the document element, or a few other un-necessary structures. You could do that by parsing the file line by line

根据输入XML文件的结构方式,您可以选择删除XML标头,可能是文档元素或其他一些不必要的结构。你可以通过逐行解析文件来做到这一点

#1


2  

If you wish to be completely certain of the XML structure, using XMLWriter and XMLReader are the best way to go.

如果您希望完全确定XML结构,那么使用XMLWriter和XMLReader是最好的方法。

However, for absolutely highest possible performance, you may be able to recreate this code quickly using direct string functions. You could do this, although you'd lose the ability to verify the XML structure - if one file had an error you wouldn't be able to correct it:

但是,为了获得绝对最高的性能,您可以使用直接字符串函数快速重新创建此代码。你可以这样做,虽然你失去了验证XML结构的能力 - 如果一个文件有错误,你将无法纠正它:

using (StreamWriter sw = new StreamWriter("out.xml")) {
    foreach (string filename in files) {
        sw.Write(String.Format(@"<inputfile name=""{0}"">", filename));
        using (StreamReader sr = new StreamReader(filename)) {
            // Using .NET 4's CopyTo(); alternatively try http://bit.ly/RiovFX
            if (max_performance) {
                sr.CopyTo(sw);
            } else {
                string line = sr.ReadLine();
                // parse the line and make any modifications you want
                sw.Write(line);
                sw.Write("\n");
            }
        }
        sw.Write("</inputfile>");
    }
}

Depending on the way your input XML files are structured, you might opt to remove the XML headers, maybe the document element, or a few other un-necessary structures. You could do that by parsing the file line by line

根据输入XML文件的结构方式,您可以选择删除XML标头,可能是文档元素或其他一些不必要的结构。你可以通过逐行解析文件来做到这一点