为什么这个XML文件加载缓慢?

时间:2023-01-21 09:01:21

I have some very simple code:

我有一些非常简单的代码:

        XmlDocument doc = new XmlDocument();
        Console.WriteLine("loading");
        doc.Load(url);
        Console.WriteLine("loaded");

        XmlNodeList nodeList = doc.GetElementsByTagName("p");

        foreach(XmlNode node in nodeList)
        {
            Console.WriteLine(node.ChildNodes[0].Value);
        }
        return source;

I'm working on this file and it takes two minutes to load. Why does it take so long? I tried both with fetching and file from the net and loading a local file.

我正在处理这个文件,加载需要两分钟。为什么需要这么长时间?我尝试了从网上获取文件并加载本地文件。

2 个解决方案

#1


9  

I imagine it's the DTD of the page that's taking so long to load. Given that it defines entities, you shouldn't disable it, so you're probably better off not going down this path.

我想这是加载这么长时间的页面的DTD。鉴于它定义了实体,你不应该禁用它,所以你最好不要走这条路。

Given the inner workings of the wikipedia parser (a right mess), I'd say it's a big leap to assume it's going to produce well-formed XHTML every time.

鉴于*解析器的内部工作原理(一个正确的混乱),我认为这是一个巨大的飞跃,假设它每次都会产生格式良好的XHTML。

Use HTML Agility Pack to parse (then you can convert to XmlDocument a little more easily if required, IIRC).

使用HTML Agility Pack进行解析(如果需要,您可以更轻松地转换为XmlDocument,IIRC)。

If you really want to go down the XmlDocument route you can keep a local cache of the HTML DTDs. See this post, this post and this post for details.

如果你真的想要沿着XmlDocument路线走下去,你可以保留HTML DTD的本地缓存。有关详细信息,请参阅此帖子,此帖子和此帖子。

#2


5  

It is becuase XmlDocument doesn't just load your Xml into a nice class heirarchy it also goes and fetches all of the namespace DTD's defined in the document. Run fiddler and you will see the calls to fetch

这是因为XmlDocument不只是将你的Xml加载到一个很好的类heirarchy中,它还会获取文档中定义的所有命名空间DTD。运行fiddler,您将看到要获取的调用

http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent

These all took me about 20 seconds to fetch.

这些都花了我大约20秒来取。

#1


9  

I imagine it's the DTD of the page that's taking so long to load. Given that it defines entities, you shouldn't disable it, so you're probably better off not going down this path.

我想这是加载这么长时间的页面的DTD。鉴于它定义了实体,你不应该禁用它,所以你最好不要走这条路。

Given the inner workings of the wikipedia parser (a right mess), I'd say it's a big leap to assume it's going to produce well-formed XHTML every time.

鉴于*解析器的内部工作原理(一个正确的混乱),我认为这是一个巨大的飞跃,假设它每次都会产生格式良好的XHTML。

Use HTML Agility Pack to parse (then you can convert to XmlDocument a little more easily if required, IIRC).

使用HTML Agility Pack进行解析(如果需要,您可以更轻松地转换为XmlDocument,IIRC)。

If you really want to go down the XmlDocument route you can keep a local cache of the HTML DTDs. See this post, this post and this post for details.

如果你真的想要沿着XmlDocument路线走下去,你可以保留HTML DTD的本地缓存。有关详细信息,请参阅此帖子,此帖子和此帖子。

#2


5  

It is becuase XmlDocument doesn't just load your Xml into a nice class heirarchy it also goes and fetches all of the namespace DTD's defined in the document. Run fiddler and you will see the calls to fetch

这是因为XmlDocument不只是将你的Xml加载到一个很好的类heirarchy中,它还会获取文档中定义的所有命名空间DTD。运行fiddler,您将看到要获取的调用

http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent

These all took me about 20 seconds to fetch.

这些都花了我大约20秒来取。