通过记录验证和提取XML记录到数据库中

时间:2022-12-04 15:39:02

Here's the deal. I have an XML document with a lot of records. Something like this:

这是交易。我有一个包含大量记录的XML文档。像这样的东西:

print("<?xml version="1.0" encoding="utf-8" ?>
      <Orders>
       <Order>
         <Phone>1254</Phone>
         <City>City1</City>
      <State>State</State>
      </Order>
      <Order>
         <Phone>98764321</Phone>
         <City>City2</City>
        <State>State2</State>
      </Order>  
     </Orders>");

There's also an XSD schema file. I would like to extract data from this file and insert these records into a database table. First of course I would like to validate each order record. For example if there are 5 orders in the file and 2 of them fail validation I would like to insert the 3 that passed validation into the db and left the other 2. There can be thousands of records in one xml file. What would be the best approach here. And how would the validation go for this since I need to discard the failed records and only use the ones that passed validation. At the moment I'm using XmlReaderSettings to validate the XML document records. Should I extract these records into another XML file or a Dataset or a custom object before I insert into a DB. I'm using .Net 3.5. Any code or link is welcome.

还有一个XSD架构文件。我想从此文件中提取数据并将这些记录插入到数据库表中。首先,我想验证每个订单记录。例如,如果文件中有5个订单,其中2个订单未通过验证,我想将通过验证的3插入数据库并将其他2个插入.2。一个xml文件中可能有数千条记录。这里最好的方法是什么。由于我需要丢弃失败的记录并且仅使用通过验证的记录,因此验证将如何进行验证。目前我正在使用XmlReaderSettings来验证XML文档记录。在插入数据库之前,是否应将这些记录提取到另一个XML文件或数据集或自定义对象中。我正在使用.Net 3.5。欢迎任何代码或链接。

5 个解决方案

#1


0  

You have a couple of options:

你有几个选择:

  1. XmlDataDocument or XmlDocument. The downside to this approach is that the data will be cached in memory, which is bad if you have a lot of it. On the other hand, you get good in-memory querying facilities with DataSet. XmlDocument requires that you use XPath queries to work on the data, whereas XmlDataDocument gives you an experience more like the DataSet functionality.

    XmlDataDocument或XmlDocument。这种方法的缺点是数据将缓存在内存中,如果你有很多数据,那就很糟糕。另一方面,您可以使用DataSet获得良好的内存查询功能。 XmlDocument要求您使用XPath查询来处理数据,而XmlDataDocument为您提供更像DataSet功能的体验。

  2. XmlReader. This is a good, fast approach because the data isn't cached; you read it in a bit at a time as a stream. You move from one element to the next, and query information about that element in your application to decide what to do with it. This does mean that you maintain in your application's memory the tree level that you're at, but with a simple XML file structure like yours this should be very simple.

    的XmlReader。这是一种好的,快速的方法,因为数据没有被缓存;你一次把它作为一个流读取。您从一个元素移动到下一个元素,并在应用程序中查询有关该元素的信息,以决定如何处理它。这意味着您在应用程序的内存中保留了您所在的树级别,但是使用像您这样的简单XML文件结构,这应该非常简单。

I recommend option 2 in your case. It should scale well in terms of memory usage, and should provide the simplest implementation for processing a file.

在你的情况下我推荐选项2。它应该在内存使用方面很好地扩展,并且应该提供最简单的处理文件的实现。

#2


1  

If the data maps fairly cleanly to an object model, you could try using xsd.exe to generate some classes from the .xsd, and process the classes into your DAL of choice. The problem is that if the volume is high (you mention thousands of records), you will most likely have a lot of round-trips.

如果数据相当干净地映射到对象模型,您可以尝试使用xsd.exe从.xsd生成一些类,并将类处理为您选择的DAL。问题是如果音量很高(你提到数千条记录),你很可能会有很多往返。

Another option might be to pass the data "as is" through to the database and use SQL/XML to process the data in TSQL - presumably as a stored procedure that accepts a parameter of type xml (SQL Server 2005 etc).

另一种选择可能是将数据“按原样”传递到数据库,并使用SQL / XML处理TSQL中的数据 - 可能是一个接受xml类型的参数(SQL Server 2005等)的存储过程。

#3


1  

I agree with idea that you should use an XmlReader, but I thought I'd try something a little different.

我同意你应该使用XmlReader,但我想我会尝试一些不同的东西。

Basically, I am first validating the whole XDocument, then if there are errors, I enumerate through the orders and bin them as needed. It's not pretty, but maybe it'll give you some ideas.

基本上,我首先验证整个XDocument,然后如果有错误,我会通过命令枚举并根据需要将它们分开。它不漂亮,但也许它会给你一些想法。

        XDocument doc = XDocument.Load("sample.xml");
        XmlSchemaSet schemas = new XmlSchemaSet();
        schemas.Add("", "sample.xsd");

        bool errors = false;
        doc.Validate(schemas, (sender, e) =>
        {
            errors = true;
        });

        List<XElement> good = new List<XElement>();
        List<XElement> bad = new List<XElement>();
        var orders = doc.Descendants("Order");
        if (errors)
        {
            foreach (var order in orders)
            {
                errors = false;
                order.Validate(order.GetSchemaInfo().SchemaElement, schemas, (sender, e) =>
                {
                    errors = true;
                });

                if (errors)
                    bad.Add(order);
                else
                    good.Add(order);
            }
        }
        else
        {
            good = orders.ToList();
        }

Instead of the lambda expressions, you could use a common function, but I just threw this together. Also, you could build two XDocuments instead of shoving the order elements into a list. I'm sure there are a ton of other problems here too, but maybe this will spark something.

您可以使用常用函数代替lambda表达式,但我只是将它们放在一起。此外,您可以构建两个XDocuments,而不是将订单元素推送到列表中。我相信这里也有很多其他问题,但也许这会引发一些问题。

#4


0  

A lot of that depends on what "validation" means in your scenario. I assume, since you're using an .xsd, you are already validating that the data is syntactically correct. So, validation probably means you'll be calling other services or procedures to determine if an order is valid?

其中很大程度上取决于您的方案中“验证”的含义。我假设,因为你使用的是.xsd,你已经在验证数据在语法上是否正确。因此,验证可能意味着您将调用其他服务或程序来确定订单是否有效?

You might want to look at Sql Server Integration Services. The XML Task in SSIS lets you do things like XPath queries, merging, likely anything and everything you'd need to do with that document. You could also use that do to all of your upfront validation with schema file too.

您可能希望查看Sql Server Integration Services。 SSIS中的XML任务允许您执行XPath查询,合并,可能的任何事情以及您需要对该文档执行的所有操作。您也可以使用它来执行对模式文件的所有前期验证。

Marc's option of passing that data to a stored procedure might work in this scenario too, but SSIS (or, even DTS but you're going to give up too much related to XML to make it as nice of an option) will let you visually orchestrate all of this work. Plus, it'll make it easier for these things to run out of process so you should end up with a much more scalable solution.

Marc将这些数据传递给存储过程的选项也可以在这种情况下工作,但是SSIS(或者,甚至DTS,但是你将放弃太多与XML相关的东西以使其成为一个不错的选项)将让你在视觉上协调所有这些工作。此外,它还可以让这些东西更容易耗尽流程,因此您最终可以获得更具可扩展性的解决方案。

#5


0  

By validation I mean validating each node. The nodes that have at least one error need to be inserted into a new xml document. Basically at the end I should have 2 xml documents. One containing the successful nodes and the other containing the failure nodes. Any way I can accomplish that? I'm using LINQ.

通过验证,我的意思是验证每个节点。需要将至少有一个错误的节点插入到新的xml文档中。基本上最后我应该有2个xml文件。一个包含成功节点,另一个包含故障节点。我能做到的任何方式吗?我正在使用LINQ。

#1


0  

You have a couple of options:

你有几个选择:

  1. XmlDataDocument or XmlDocument. The downside to this approach is that the data will be cached in memory, which is bad if you have a lot of it. On the other hand, you get good in-memory querying facilities with DataSet. XmlDocument requires that you use XPath queries to work on the data, whereas XmlDataDocument gives you an experience more like the DataSet functionality.

    XmlDataDocument或XmlDocument。这种方法的缺点是数据将缓存在内存中,如果你有很多数据,那就很糟糕。另一方面,您可以使用DataSet获得良好的内存查询功能。 XmlDocument要求您使用XPath查询来处理数据,而XmlDataDocument为您提供更像DataSet功能的体验。

  2. XmlReader. This is a good, fast approach because the data isn't cached; you read it in a bit at a time as a stream. You move from one element to the next, and query information about that element in your application to decide what to do with it. This does mean that you maintain in your application's memory the tree level that you're at, but with a simple XML file structure like yours this should be very simple.

    的XmlReader。这是一种好的,快速的方法,因为数据没有被缓存;你一次把它作为一个流读取。您从一个元素移动到下一个元素,并在应用程序中查询有关该元素的信息,以决定如何处理它。这意味着您在应用程序的内存中保留了您所在的树级别,但是使用像您这样的简单XML文件结构,这应该非常简单。

I recommend option 2 in your case. It should scale well in terms of memory usage, and should provide the simplest implementation for processing a file.

在你的情况下我推荐选项2。它应该在内存使用方面很好地扩展,并且应该提供最简单的处理文件的实现。

#2


1  

If the data maps fairly cleanly to an object model, you could try using xsd.exe to generate some classes from the .xsd, and process the classes into your DAL of choice. The problem is that if the volume is high (you mention thousands of records), you will most likely have a lot of round-trips.

如果数据相当干净地映射到对象模型,您可以尝试使用xsd.exe从.xsd生成一些类,并将类处理为您选择的DAL。问题是如果音量很高(你提到数千条记录),你很可能会有很多往返。

Another option might be to pass the data "as is" through to the database and use SQL/XML to process the data in TSQL - presumably as a stored procedure that accepts a parameter of type xml (SQL Server 2005 etc).

另一种选择可能是将数据“按原样”传递到数据库,并使用SQL / XML处理TSQL中的数据 - 可能是一个接受xml类型的参数(SQL Server 2005等)的存储过程。

#3


1  

I agree with idea that you should use an XmlReader, but I thought I'd try something a little different.

我同意你应该使用XmlReader,但我想我会尝试一些不同的东西。

Basically, I am first validating the whole XDocument, then if there are errors, I enumerate through the orders and bin them as needed. It's not pretty, but maybe it'll give you some ideas.

基本上,我首先验证整个XDocument,然后如果有错误,我会通过命令枚举并根据需要将它们分开。它不漂亮,但也许它会给你一些想法。

        XDocument doc = XDocument.Load("sample.xml");
        XmlSchemaSet schemas = new XmlSchemaSet();
        schemas.Add("", "sample.xsd");

        bool errors = false;
        doc.Validate(schemas, (sender, e) =>
        {
            errors = true;
        });

        List<XElement> good = new List<XElement>();
        List<XElement> bad = new List<XElement>();
        var orders = doc.Descendants("Order");
        if (errors)
        {
            foreach (var order in orders)
            {
                errors = false;
                order.Validate(order.GetSchemaInfo().SchemaElement, schemas, (sender, e) =>
                {
                    errors = true;
                });

                if (errors)
                    bad.Add(order);
                else
                    good.Add(order);
            }
        }
        else
        {
            good = orders.ToList();
        }

Instead of the lambda expressions, you could use a common function, but I just threw this together. Also, you could build two XDocuments instead of shoving the order elements into a list. I'm sure there are a ton of other problems here too, but maybe this will spark something.

您可以使用常用函数代替lambda表达式,但我只是将它们放在一起。此外,您可以构建两个XDocuments,而不是将订单元素推送到列表中。我相信这里也有很多其他问题,但也许这会引发一些问题。

#4


0  

A lot of that depends on what "validation" means in your scenario. I assume, since you're using an .xsd, you are already validating that the data is syntactically correct. So, validation probably means you'll be calling other services or procedures to determine if an order is valid?

其中很大程度上取决于您的方案中“验证”的含义。我假设,因为你使用的是.xsd,你已经在验证数据在语法上是否正确。因此,验证可能意味着您将调用其他服务或程序来确定订单是否有效?

You might want to look at Sql Server Integration Services. The XML Task in SSIS lets you do things like XPath queries, merging, likely anything and everything you'd need to do with that document. You could also use that do to all of your upfront validation with schema file too.

您可能希望查看Sql Server Integration Services。 SSIS中的XML任务允许您执行XPath查询,合并,可能的任何事情以及您需要对该文档执行的所有操作。您也可以使用它来执行对模式文件的所有前期验证。

Marc's option of passing that data to a stored procedure might work in this scenario too, but SSIS (or, even DTS but you're going to give up too much related to XML to make it as nice of an option) will let you visually orchestrate all of this work. Plus, it'll make it easier for these things to run out of process so you should end up with a much more scalable solution.

Marc将这些数据传递给存储过程的选项也可以在这种情况下工作,但是SSIS(或者,甚至DTS,但是你将放弃太多与XML相关的东西以使其成为一个不错的选项)将让你在视觉上协调所有这些工作。此外,它还可以让这些东西更容易耗尽流程,因此您最终可以获得更具可扩展性的解决方案。

#5


0  

By validation I mean validating each node. The nodes that have at least one error need to be inserted into a new xml document. Basically at the end I should have 2 xml documents. One containing the successful nodes and the other containing the failure nodes. Any way I can accomplish that? I'm using LINQ.

通过验证,我的意思是验证每个节点。需要将至少有一个错误的节点插入到新的xml文档中。基本上最后我应该有2个xml文件。一个包含成功节点,另一个包含故障节点。我能做到的任何方式吗?我正在使用LINQ。