需要帮助设计大数据库更新过程

时间:2022-03-04 21:12:02

We have a database with ~100K business objects in it. Each object has about 40 properties which are stored amongst 15 tables. I have to get these objects, perform some transforms on them and then write them to a different database (with the same schema.) This is ADO.Net 3.5, SQL Server 2005.

我们有一个包含~100K业务对象的数据库。每个对象有大约40个属性,存储在15个表中。我必须得到这些对象,对它们执行一些转换,然后将它们写入不同的数据库(使用相同的模式。)这是ADO.Net 3.5,SQL Server 2005。

We have a library method to write a single property. It figures out which of the 15 tables the property goes into, creates and opens a connection, determines whether the property already exists and does an insert or update accordingly, and closes the connection.

我们有一个库方法来编写单个属性。它确定属性进入的15个表中的哪一个,创建并打开连接,确定属性是否已存在并相应地插入或更新,并关闭连接。

My first pass at the program was to read an object from the source DB, perform the transform, and call the library routine on each of its 40 properties to write the object to the destination DB. Repeat 100,000 times. Obviously this is egregiously inefficent.

我在程序中的第一个过程是从源DB读取一个对象,执行转换,并在其40个属性中的每一个上调用库例程,以将对象写入目标DB。重复100,000次。显然,这是非常无能为力的。

What are some good designs for handling this type of problem?

有哪些好的设计可以解决这类问题?

Thanks

5 个解决方案

#1


This is exactly the sort of thing that SQL Server Integration Services (SSIS) is good for. It's documented in Books Online, same as SQL Server is.

这正是SQL Server Integration Services(SSIS)的优点所在。它在联机丛书中有记录,与SQL Server相同。

#2


Unfortunately, I would say that you need to forget your client-side library, and do it all in SQL.

不幸的是,我会说你需要忘记你的客户端库,并在SQL中完成所有这些工作。

#3


How many times do you need to do this? If only once, and it can run unattended, I see no reason why you shouldn't reuse your existing client code. Automating the work of human beings is what computers are for. If it's inefficient, I know that sucks, but if you're going to do a week of work setting up a SSIS package, that's inefficient too. Plus, your client-side solution could contain business logic or validation code that you'd have to remember to carry over to SQL.

你需要多少次这样做?如果只有一次,它可以无人值守运行,我认为没有理由不重用现有的客户端代码。自动化人类的工作是计算机的用途。如果它效率低下,我知道这很糟糕,但是如果你要做一周的工作来设置一个SSIS包,那也是效率低下的。此外,您的客户端解决方案可能包含业务逻辑或验证代码,您必须记住这些代码将继承到SQL。

You might want to research Create_Assembly, moving your client code across the network to reside on your SQL box. This will avoid network latency, but could destabilize your SQL Server.

您可能希望研究Create_Assembly,将您的客户端代码移动到网络上以驻留在SQL框中。这样可以避免网络延迟,但可能会破坏SQL Server的稳定性。

#4


Bad news: you have many options

坏消息:你有很多选择

use flatfile transformations: Extract all the data into flatfiles, manipulate them using grep, awk, sed, c, perl into the required insert/update statements and execute those against the target database

使用flatfile转换:将所有数据提取到flatfiles中,使用grep,awk,sed,c,perl将它们操作到所需的insert / update语句中,并针对目标数据库执行这些操作

PRO: Fast; CON: extremly ugly ... nightmare for maintanance, don't do this if you need this for longer then a week. And a couple dozens of executions

PRO:快; CON:非常丑陋...对于维护的噩梦,如果你需要这个超过一周的时间,不要这样做。还有几十次处决

use pure sql: I don't know much about sql server, but I assume it has away to access one database from within the other, so one of the fastes ways to do this is to write it as a collection of 'insert / update / merge statements fed with select statements.

使用纯sql:我不太了解sql server,但我认为它已经从另一个中访问一个数据库了,所以这样做的一个方法就是把它写成'insert / update'的集合/ merge语句用select语句提供。

PRO: Fast, one technology only; CON: Requires direct connection between databases You might reach the limit of SQL or the available SQL knowledge pretty fast, depending on the kind of transformation.

PRO:快速,仅限一项技术; CON:需要数据库之间的直接连接您可能会很快达到SQL的限制或可用的SQL知识,具体取决于转换的类型。

use t-sql, or whatever iterative language the database provides, everything else is similiar to pure sql aproach.

使用t-sql,或者数据库提供的任何迭代语言,其他一切都与纯sql aproach类似。

PRO: pretty fast since you don't leave the database CON: I don't know t-sql, but if it is anything like PL/SQL it is not the nicest language to do complex transformation.

PRO:非常快,因为你没有离开数据库CON:我不知道t-sql,但如果它像PL / SQL那样,它不是进行复杂转换的最好的语言。

use a high level language (Java, C#, VB ...): You would load your data into proper business objects manipulate those and store them in the database. Pretty much what you seem to be doing right now, although it sounds there are better ORMs available, e.g. nhibernate

使用高级语言(Java,C#,VB ......):您可以将数据加载到适当的业务对象中,然后操作它们并将它们存储在数据库中。几乎就是你现在正在做的事情,虽然听起来有更好的ORM可用,例如NHibernate的

use a ETL Tool: There are special tools for extracting, transforming and loading data. They often support various databases. And have many strategies readily available for deciding if an update or insert is in place.

使用ETL工具:有一些用于提取,转换和加载数据的特殊工具。他们经常支持各种数据库。并且有许多策略可用于决定是否有更新或插入。

PRO: Sorry, you'll have to ask somebody else for that, I so far have nothing but bad experience with those tools.

亲:对不起,你不得不问别人这个,我到目前为止只有这些工具的经验不好。

CON: A highly specialized tool, that you need to master. I my personal experience: slower in implementation and execution of the transformation then handwritten SQL. A nightmare for maintainability, since everything is hidden away in proprietary repositories, so for IDE, Version Control, CI, Testing you are stuck with whatever the tool provider gives you, if any.

CON:您需要掌握的高度专业化的工具。我个人的经验:实现和执行转换然后手写SQL更慢。可维护性的噩梦,因为所有内容都隐藏在专有存储库中,因此对于IDE,版本控制,CI,测试,如果有任何工具提供商提供给你的话,你会遇到困难。

PRO: Even complex manipulations can be implemented in a clean maintainable way, you can use all the fancy tools like good IDEs, Testing Frameworks, CI Systems to support you while developing the transformation.

专业版:即使是复杂的操作也可以以干净的可维护方式实现,您可以使用所有精美的工具,如优秀的IDE,测试框架,CI系统,在开发转换时为您提供支持。

CON: It adds a lot of overhead (retrieving the data, out of the database, instanciating the objects, and marshalling the objects back into the target database. I'd go this way if it is a process that is going to be around for a long time.

CON:它增加了很多开销(从数据库中检索数据,实例化对象,并将对象编组回目标数据库。如果这是一个将要进行的过程,我会这样做很长时间。

Building on the last option you could further glorify the architectur by using messaging and webservices, which could be relevant if you have more then one source database, or more then one target database. Or you could manually implement a multithreaded transformer, in order to gain through put. But I guess I am leaving the scope of your question.

在最后一个选项的基础上,您可以通过使用消息传递和Web服务进一步美化架构,如果您有多个源数据库或多个目标数据库,这可能是相关的。或者你可以手动实现一个多线程变压器,以获得通过。但我想我会离开你问题的范围。

#5


I'm with John, SSIS is the way to go for any repeatable process to import large amounts of data. It should be much faster than the 30 hours you are currently getting. You could also write pure t-sql code to do this if the two database are on the same server or are linked servers. If you go the t-sql route, you may need to do a hybrid of set-based and looping code to run on batches (of say 2000 records at a time) rather than lock up the table for the whole time a large insert would take.

我和John在一起,SSIS是任何可重复进程导入大量数据的方法。它应该比你目前获得的30小时快得多。如果两个数据库位于同一服务器上或者是链接服务器,您也可以编写纯t-sql代码来执行此操作。如果你使用t-sql路由,你可能需要混合使用基于集合和循环的代码来批量运行(比如说一次只有2000条记录),而不是在一个大插入的整个时间内锁定表格。采取。

#1


This is exactly the sort of thing that SQL Server Integration Services (SSIS) is good for. It's documented in Books Online, same as SQL Server is.

这正是SQL Server Integration Services(SSIS)的优点所在。它在联机丛书中有记录,与SQL Server相同。

#2


Unfortunately, I would say that you need to forget your client-side library, and do it all in SQL.

不幸的是,我会说你需要忘记你的客户端库,并在SQL中完成所有这些工作。

#3


How many times do you need to do this? If only once, and it can run unattended, I see no reason why you shouldn't reuse your existing client code. Automating the work of human beings is what computers are for. If it's inefficient, I know that sucks, but if you're going to do a week of work setting up a SSIS package, that's inefficient too. Plus, your client-side solution could contain business logic or validation code that you'd have to remember to carry over to SQL.

你需要多少次这样做?如果只有一次,它可以无人值守运行,我认为没有理由不重用现有的客户端代码。自动化人类的工作是计算机的用途。如果它效率低下,我知道这很糟糕,但是如果你要做一周的工作来设置一个SSIS包,那也是效率低下的。此外,您的客户端解决方案可能包含业务逻辑或验证代码,您必须记住这些代码将继承到SQL。

You might want to research Create_Assembly, moving your client code across the network to reside on your SQL box. This will avoid network latency, but could destabilize your SQL Server.

您可能希望研究Create_Assembly,将您的客户端代码移动到网络上以驻留在SQL框中。这样可以避免网络延迟,但可能会破坏SQL Server的稳定性。

#4


Bad news: you have many options

坏消息:你有很多选择

use flatfile transformations: Extract all the data into flatfiles, manipulate them using grep, awk, sed, c, perl into the required insert/update statements and execute those against the target database

使用flatfile转换:将所有数据提取到flatfiles中,使用grep,awk,sed,c,perl将它们操作到所需的insert / update语句中,并针对目标数据库执行这些操作

PRO: Fast; CON: extremly ugly ... nightmare for maintanance, don't do this if you need this for longer then a week. And a couple dozens of executions

PRO:快; CON:非常丑陋...对于维护的噩梦,如果你需要这个超过一周的时间,不要这样做。还有几十次处决

use pure sql: I don't know much about sql server, but I assume it has away to access one database from within the other, so one of the fastes ways to do this is to write it as a collection of 'insert / update / merge statements fed with select statements.

使用纯sql:我不太了解sql server,但我认为它已经从另一个中访问一个数据库了,所以这样做的一个方法就是把它写成'insert / update'的集合/ merge语句用select语句提供。

PRO: Fast, one technology only; CON: Requires direct connection between databases You might reach the limit of SQL or the available SQL knowledge pretty fast, depending on the kind of transformation.

PRO:快速,仅限一项技术; CON:需要数据库之间的直接连接您可能会很快达到SQL的限制或可用的SQL知识,具体取决于转换的类型。

use t-sql, or whatever iterative language the database provides, everything else is similiar to pure sql aproach.

使用t-sql,或者数据库提供的任何迭代语言,其他一切都与纯sql aproach类似。

PRO: pretty fast since you don't leave the database CON: I don't know t-sql, but if it is anything like PL/SQL it is not the nicest language to do complex transformation.

PRO:非常快,因为你没有离开数据库CON:我不知道t-sql,但如果它像PL / SQL那样,它不是进行复杂转换的最好的语言。

use a high level language (Java, C#, VB ...): You would load your data into proper business objects manipulate those and store them in the database. Pretty much what you seem to be doing right now, although it sounds there are better ORMs available, e.g. nhibernate

使用高级语言(Java,C#,VB ......):您可以将数据加载到适当的业务对象中,然后操作它们并将它们存储在数据库中。几乎就是你现在正在做的事情,虽然听起来有更好的ORM可用,例如NHibernate的

use a ETL Tool: There are special tools for extracting, transforming and loading data. They often support various databases. And have many strategies readily available for deciding if an update or insert is in place.

使用ETL工具:有一些用于提取,转换和加载数据的特殊工具。他们经常支持各种数据库。并且有许多策略可用于决定是否有更新或插入。

PRO: Sorry, you'll have to ask somebody else for that, I so far have nothing but bad experience with those tools.

亲:对不起,你不得不问别人这个,我到目前为止只有这些工具的经验不好。

CON: A highly specialized tool, that you need to master. I my personal experience: slower in implementation and execution of the transformation then handwritten SQL. A nightmare for maintainability, since everything is hidden away in proprietary repositories, so for IDE, Version Control, CI, Testing you are stuck with whatever the tool provider gives you, if any.

CON:您需要掌握的高度专业化的工具。我个人的经验:实现和执行转换然后手写SQL更慢。可维护性的噩梦,因为所有内容都隐藏在专有存储库中,因此对于IDE,版本控制,CI,测试,如果有任何工具提供商提供给你的话,你会遇到困难。

PRO: Even complex manipulations can be implemented in a clean maintainable way, you can use all the fancy tools like good IDEs, Testing Frameworks, CI Systems to support you while developing the transformation.

专业版:即使是复杂的操作也可以以干净的可维护方式实现,您可以使用所有精美的工具,如优秀的IDE,测试框架,CI系统,在开发转换时为您提供支持。

CON: It adds a lot of overhead (retrieving the data, out of the database, instanciating the objects, and marshalling the objects back into the target database. I'd go this way if it is a process that is going to be around for a long time.

CON:它增加了很多开销(从数据库中检索数据,实例化对象,并将对象编组回目标数据库。如果这是一个将要进行的过程,我会这样做很长时间。

Building on the last option you could further glorify the architectur by using messaging and webservices, which could be relevant if you have more then one source database, or more then one target database. Or you could manually implement a multithreaded transformer, in order to gain through put. But I guess I am leaving the scope of your question.

在最后一个选项的基础上,您可以通过使用消息传递和Web服务进一步美化架构,如果您有多个源数据库或多个目标数据库,这可能是相关的。或者你可以手动实现一个多线程变压器,以获得通过。但我想我会离开你问题的范围。

#5


I'm with John, SSIS is the way to go for any repeatable process to import large amounts of data. It should be much faster than the 30 hours you are currently getting. You could also write pure t-sql code to do this if the two database are on the same server or are linked servers. If you go the t-sql route, you may need to do a hybrid of set-based and looping code to run on batches (of say 2000 records at a time) rather than lock up the table for the whole time a large insert would take.

我和John在一起,SSIS是任何可重复进程导入大量数据的方法。它应该比你目前获得的30小时快得多。如果两个数据库位于同一服务器上或者是链接服务器,您也可以编写纯t-sql代码来执行此操作。如果你使用t-sql路由,你可能需要混合使用基于集合和循环的代码来批量运行(比如说一次只有2000条记录),而不是在一个大插入的整个时间内锁定表格。采取。