SSIS使用太多内存将大型(40GB+) XML文件加载到SQL Server表中

时间:2022-09-15 16:52:41

I need to load a single large (40GB+) XML file into an SQL Server 2012 database table using SSIS. I'm having problems because SSIS seems to be trying to load the entire document in memory instead of streaming it.

我需要使用SSIS将一个大型(40GB+) XML文件加载到SQL Server 2012数据库表中。我遇到了一些问题,因为SSIS似乎试图将整个文档加载到内存中,而不是将其流到内存中。

Here's more details of my SSIS package.

下面是我的SSIS包的详细信息。

I've created an XML Source with the folowing properties:

我创建了一个具有折叠属性的XML源:

  • Data access mode: XML file from variable (but could be XML File Location)
  • 数据访问模式:来自变量的XML文件(但可以是XML文件位置)
  • Variable name: variable that specifies the XML file path in my computer.
  • 变量名:在我的计算机中指定XML文件路径的变量。
  • XSD location: the path to the XSD that defines the XML being read.
  • XSD位置:定义要读取的XML的XSD路径。

The XML structure is simple, with only 3 hierarchical levels:

XML结构简单,只有3个层次结构:

  1. Root element with header information
  2. 具有头信息的根元素
  3. One level defining collections of objects
  4. 定义对象集合的级别
  5. The leaf level defining individual objects (each with a fixed set of fields)
  6. 定义单个对象的叶级(每个对象具有固定的字段集)

I need to insert one database record per leaf element, repeating fields from the higher hierarchy levels. In other words, I need to flaten the XML hierarchy.

我需要为每个leaf元素插入一个数据库记录,重复来自更高层次结构级别的字段。换句话说,我需要简化XML层次结构。

How can I make SSIS stream load the data, instead of trying to load the entire document in memory?

如何让SSIS流加载数据,而不是试图在内存中加载整个文档?

2 个解决方案

#1


4  

The XML source always loads the entire file. It uses XmlDocument to do so (last I checked).

XML源总是加载整个文件。它使用XmlDocument这样做(上次我检查过)。

The only thing you can do, is to split up the file somehow, then iteratively run each piece through your data flow.

您唯一能做的就是以某种方式分割文件,然后在数据流中迭代地运行每个部分。

Beyond that, you're looking at creating a custom data source, which is not trivial. It also represents a serious piece of code to maintain.

除此之外,您还需要创建一个自定义数据源,这并非易事。它还代表要维护的一段代码。

There may be third-party data sources which can do this. I had to write my own about five years ago.

可能有第三方数据源可以做到这一点。大约五年前,我不得不自己写。

#2


1  

Have you considered processing the files in smaller chunks?

你考虑过把文件分成更小的块来处理吗?

I had the same issue before so I created a script component to process that 1 big XML file into 100's of smaller XML Files then do a forloop and iterate on all of the smaller XML Files to process.

我以前也遇到过同样的问题,所以我创建了一个脚本组件,将一个大的XML文件处理成100个较小的XML文件,然后执行forloop,并在所有较小的XML文件上进行迭代。

To do this you cant use a StreamReader.ReadLine because it will still do the same thing, load that very large file so instead of that use System.IO.MemoryMappedFiles which is a designed class for this scenario.

要做到这一点,你不能使用流线阅读器。ReadLine因为它仍然会做同样的事情,加载非常大的文件,而不是使用System.IO。MemoryMappedFiles是为这个场景设计的类。

Have a look here http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx

请在这里查看http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx

#1


4  

The XML source always loads the entire file. It uses XmlDocument to do so (last I checked).

XML源总是加载整个文件。它使用XmlDocument这样做(上次我检查过)。

The only thing you can do, is to split up the file somehow, then iteratively run each piece through your data flow.

您唯一能做的就是以某种方式分割文件,然后在数据流中迭代地运行每个部分。

Beyond that, you're looking at creating a custom data source, which is not trivial. It also represents a serious piece of code to maintain.

除此之外,您还需要创建一个自定义数据源,这并非易事。它还代表要维护的一段代码。

There may be third-party data sources which can do this. I had to write my own about five years ago.

可能有第三方数据源可以做到这一点。大约五年前,我不得不自己写。

#2


1  

Have you considered processing the files in smaller chunks?

你考虑过把文件分成更小的块来处理吗?

I had the same issue before so I created a script component to process that 1 big XML file into 100's of smaller XML Files then do a forloop and iterate on all of the smaller XML Files to process.

我以前也遇到过同样的问题,所以我创建了一个脚本组件,将一个大的XML文件处理成100个较小的XML文件,然后执行forloop,并在所有较小的XML文件上进行迭代。

To do this you cant use a StreamReader.ReadLine because it will still do the same thing, load that very large file so instead of that use System.IO.MemoryMappedFiles which is a designed class for this scenario.

要做到这一点,你不能使用流线阅读器。ReadLine因为它仍然会做同样的事情,加载非常大的文件,而不是使用System.IO。MemoryMappedFiles是为这个场景设计的类。

Have a look here http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx

请在这里查看http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx