SQL Server 2005中的大量交叉连接

时间:2022-10-19 01:04:06

I'm porting a process which creates a MASSIVE CROSS JOIN of two tables. The resulting table contains 15m records (looks like the process makes a 30m cross join with a 2600 row table and a 12000 row table and then does some grouping which must split it in half). The rows are relatively narrow - just 6 columns. It's been running for 5 hours with no sign of completion. I only just noticed the count discrepancy between the known good and what I would expect for the cross join, so my output doesn't have the grouping or deduping which will halve the final table - but this still seems like it's not going to complete any time soon.

我正在移植一个过程,它创建两个表的大量交叉连接。生成的表包含15m条记录(看起来过程将一个30m的交叉连接与一个2600行表和一个12000行表结合,然后进行一些分组,这些分组必须将其一分为二)。行比较窄——只有6列。它已经运行了5个小时,没有完成的迹象。我只注意到已知的商品和我对交叉连接的期望之间的计数差异,所以我的输出没有分组或去duping,这将使最终的表减半——但这似乎仍然不会很快完成。

First I'm going to look to eliminate this table from the process if at all possible - obviously it could be replaced by joining to both tables individually, but right now I do not have visibility into everywhere else it is used.

首先,如果可能的话,我将设法将这个表从流程中删除——显然,可以将它单独连接到两个表中,但是现在我对它所使用的其他地方没有可见性。

But given that the existing process does it (in less time, on a less powerful machine, using the FOCUS language), are there any options for improving the performance of large CROSS JOINs in SQL Server (2005) (hardware is not really an option, this box is a 64-bit 8-way with 32-GB of RAM)?

但是考虑到现有的过程它(在更少的时间,在一个不那么强大的机器,使用重点语言),有什么选择对提高大型交叉连接的性能在SQL Server(2005)(硬件并不是一个选择,这个盒子是一个64位的8路32的RAM)?

Details:

细节:

It's written this way in FOCUS (I'm trying to produce the same output, which is a CROSS JOIN in SQL):

它是这样写的(我尝试生成相同的输出,这是SQL中的交叉连接):

JOIN CLEAR *
DEFINE FILE COSTCENT
  WBLANK/A1 = ' ';
  END
TABLE FILE COSTCENT
  BY WBLANK BY CC_COSTCENT
  ON TABLE HOLD AS TEMPCC FORMAT FOCUS
  END

DEFINE FILE JOINGLAC
  WBLANK/A1 = ' ';
  END
TABLE FILE JOINGLAC
  BY WBLANK BY ACCOUNT_NO BY LI_LNTM
  ON TABLE HOLD AS TEMPAC FORMAT FOCUS INDEX WBLANK

JOIN CLEAR *
JOIN WBLANK IN TEMPCC TO ALL WBLANK IN TEMPAC
DEFINE FILE TEMPCC
  CA_JCCAC/A16=EDIT(CC_COSTCENT)|EDIT(ACCOUNT_NO);
  END
TABLE FILE TEMPCC
  BY CA_JCCAC BY CC_COSTCENT AS COST CENTER BY ACCOUNT_NO
  BY LI_LNTM
  ON TABLE HOLD AS TEMPCCAC
  END

So the required output really is a CROSS JOIN (it's joining a blank column from each side).

因此,所需的输出实际上是一个交叉连接(它将从每条边加入一个空列)。

In SQL:

在SQL:

CREATE TABLE [COSTCENT](
       [COST_CTR_NUM] [int] NOT NULL,
       [CC_CNM] [varchar](40) NULL,
       [CC_DEPT] [varchar](7) NULL,
       [CC_ALSRC] [varchar](6) NULL,
       [CC_HIER_CODE] [varchar](20) NULL,
 CONSTRAINT [PK_LOOKUP_GL_COST_CTR] PRIMARY KEY NONCLUSTERED
(
       [ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY
= OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [JOINGLAC](
       [ACCOUNT_NO] [int] NULL,
       [LI_LNTM] [int] NULL,
       [PR_PRODUCT] [varchar](5) NULL,
       [PR_GROUP] [varchar](1) NULL,
       [AC_NAME_LONG] [varchar](40) NULL,
       [LI_NM_LONG] [varchar](30) NULL,
       [LI_INC] [int] NULL,
       [LI_MULT] [int] NULL,
       [LI_ANLZ] [int] NULL,
       [LI_TYPE] [varchar](2) NULL,
       [PR_SORT] [varchar](2) NULL,
       [PR_NM] [varchar](26) NULL,
       [PZ_SORT] [varchar](2) NULL,
       [PZNAME] [varchar](26) NULL,
       [WANLZ] [varchar](3) NULL,
       [OPMLNTM] [int] NULL,
       [PS_GROUP] [varchar](5) NULL,
       [PS_SORT] [varchar](2) NULL,
       [PS_NAME] [varchar](26) NULL,
       [PT_GROUP] [varchar](5) NULL,
       [PT_SORT] [varchar](2) NULL,
       [PT_NAME] [varchar](26) NULL
) ON [PRIMARY]

CREATE TABLE [JOINCCAC](
       [CA_JCCAC] [varchar](16) NOT NULL,
       [CA_COSTCENT] [int] NOT NULL,
       [CA_GLACCOUNT] [int] NOT NULL,
       [CA_LNTM] [int] NOT NULL,
       [CA_UNIT] [varchar](6) NOT NULL,
 CONSTRAINT [PK_JOINCCAC_KNOWN_GOOD] PRIMARY KEY CLUSTERED
(
       [CA_JCCAC] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY
= OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

With the SQL Code:

SQL代码:

INSERT  INTO [JOINCCAC]
       (
        [CA_JCCAC]
       ,[CA_COSTCENT]
       ,[CA_GLACCOUNT]
       ,[CA_LNTM]
       ,[CA_UNIT]
       )
       SELECT  Util.PADLEFT(CONVERT(varchar, CC.COST_CTR_NUM), '0',
                                     7)
               + Util.PADLEFT(CONVERT(varchar, GL.ACCOUNT_NO), '0',
                                       9) AS CC_JCCAC
              ,CC.COST_CTR_NUM AS CA_COSTCENT
              ,GL.ACCOUNT_NO % 900000000 AS CA_GLACCOUNT
              ,GL.LI_LNTM AS CA_LNTM
              ,udf_BUPDEF(GL.ACCOUNT_NO, CC.COST_CTR_NUM, GL.LI_LNTM, 'N') AS CA_UNIT
       FROM   JOINGLAC AS GL
       CROSS JOIN COSTCENT AS CC

Depending on how this table is subsequently used, it should be able to be eliminated from the process, by simply joining to both the original tables used to build it. However, this is an extremely large porting effort, and I might not find the usage of the table for some time, so I was wondering if there were any tricks to CROSS JOINing big tables like that in a timely fashion (especially given that the existing process in FOCUS is able to do it more speedily). That way I could validate the correctness of my building of the replacement query and then later factor it out with views or whatever.

根据该表随后的使用方式,应该能够通过简单地连接用于构建该表的两个原始表从流程中删除该表。然而,这是一个非常大的移植工作,我可能没有找到表的使用一段时间,所以我想知道如果有任何技巧来交叉及时加入这样的大表(特别是考虑到现有流程的重点是能够做到更迅速)。通过这种方式,我可以验证构建替换查询的正确性,然后再用视图或其他东西将其分解出来。

I am also considering factoring out the UDFs and string manipulation and performing the CROSS JOIN first to break the process up a bit.

我还在考虑分解udf和字符串操作,并首先执行交叉连接,以稍微破坏进程。

RESULTS SO FAR:

结果到目前为止:

It turns out that the UDFs do contribute a lot (negatively) to the performance. But there also appears to be a big difference between a 15m row cross join and a 30m row cross join. I do not have SHOWPLAN rights (boo hoo), so I can't tell whether the plan it is using is better or worse after changing indexes. I have not refactored it yet, but am expecting the entire table to go away shortly.

事实证明,udf确实对性能有很大(负面)贡献。但是15m行交叉连接和30m行交叉连接之间似乎也有很大的不同。我没有SHOWPLAN的权限(boo hoo),所以在修改索引之后,我无法判断它使用的计划是好是坏。我还没有重构它,但是我希望整个表很快就会消失。

3 个解决方案

#1


2  

Examining that query shows only one column used from one table, and only two columns used from the other table. Due to the very low numbers of columns used, this query can be easily enhanced with covering indexes:

检查该查询只显示一个表中使用的一列,而另一个表中使用的两列。由于所使用的列的数量非常少,这个查询可以通过覆盖索引轻松增强:

CREATE INDEX COSTCENTCoverCross ON COSTCENT(COST_CTR_NUM)
CREATE INDEX JOINGLACCoverCross ON JOINGLAC(ACCOUNT_NO, LI_LNTM)

Here are my questions for further optimization:

下面是我进一步优化的问题:

When you put the query in query analyzer and whack the "show estimated execution plan" button, it will show a graphical representation of what it's going to do.

当您将查询放到query analyzer中并敲击“显示估计执行计划”按钮时,它将显示它将要执行的操作的图形表示。

Join Type: There should be a nested loop join in there. (the other options are merge join and hash join). If you see nested loop, then ok. If you see merge join or hash join, let us know.

连接类型:应该有一个嵌套循环联接。(其他选项是merge join和hash join)。如果您看到嵌套循环,那么ok。如果您看到合并连接或散列连接,请告诉我们。

Order of table access: Go all the way to the top and scroll all the way to the right. The first step should be accessing a table. Which table is that and what method is used(index scan, clustered index scan)? What method is used to access the other table?

访问表的顺序:一直到顶部并向右滚动。第一步应该是访问表。这是哪个表,使用什么方法(索引扫描、聚类索引扫描)?使用什么方法访问另一个表?

Parallelism: You should see the little jaggedy arrows on almost all icons in the plan indicating that parallelism is being used. If you don't see this, there is a major problem!

并行性:您应该在计划中的几乎所有图标上看到小的锯齿箭头,表明正在使用并行性。如果你没有看到这一点,有一个大问题!

That udf_BUPDEF concerns me. Does it read from additional tables? Util.PADLEFT concerns me less, but still.. what is it? If it isn't a Database Object, then consider using this instead:

udf_BUPDEF担忧我。它是否从其他表中读取?跑龙套。PADLEFT不太关心我,但仍然。它是什么?如果它不是一个数据库对象,那么考虑使用以下方法:

RIGHT('z00000000000000000000000000' + columnName, 7)

Are there any triggers on JOINCCAC? How about indexes? With an insert this large, you'll want to drop all triggers and indexes on that table.

在JOINCCAC上有触发器吗?索引呢?使用如此大的插入,您将希望删除该表上的所有触发器和索引。

#2


2  

Continuing on what others a saying, DB functions that contained queries which are used in a select always made my queries extremely slow. Off the top of my head, I believe i had a query run in 45 seconds, then I removed the function, and then result was 0 seconds :)

继续别人所说的,DB函数包含在select中使用的查询,这使得我的查询非常缓慢。在我的脑海中,我相信我在45秒内运行了一个查询,然后我删除了函数,然后结果是0秒:)

So check udf_BUPDEF is not doing any queries.

所以检查udf_BUPDEF没有执行任何查询。

#3


1  

Break down the query to make it a plain simple cross join.

分解查询,使其成为简单的交叉连接。


   SELECT  CC.COST_CTR_NUM, GL.ACCOUNT_NO
              ,CC.COST_CTR_NUM AS CA_COSTCENT
              ,GL.ACCOUNT_NO AS CA_GLACCOUNT
              ,GL.LI_LNTM AS CA_LNTM
-- I don't know what is BUPDEF doing? but remove it from the query for time being
--              ,udf_BUPDEF(GL.ACCOUNT_NO, CC.COST_CTR_NUM, GL.LI_LNTM, 'N') AS CA_UNIT
       FROM   JOINGLAC AS GL
       CROSS JOIN COSTCENT AS CC

See how good is the simple cross join? (without any functions applied on it)

看看简单的交叉连接有多好?(没有任何功能)

#1


2  

Examining that query shows only one column used from one table, and only two columns used from the other table. Due to the very low numbers of columns used, this query can be easily enhanced with covering indexes:

检查该查询只显示一个表中使用的一列,而另一个表中使用的两列。由于所使用的列的数量非常少,这个查询可以通过覆盖索引轻松增强:

CREATE INDEX COSTCENTCoverCross ON COSTCENT(COST_CTR_NUM)
CREATE INDEX JOINGLACCoverCross ON JOINGLAC(ACCOUNT_NO, LI_LNTM)

Here are my questions for further optimization:

下面是我进一步优化的问题:

When you put the query in query analyzer and whack the "show estimated execution plan" button, it will show a graphical representation of what it's going to do.

当您将查询放到query analyzer中并敲击“显示估计执行计划”按钮时,它将显示它将要执行的操作的图形表示。

Join Type: There should be a nested loop join in there. (the other options are merge join and hash join). If you see nested loop, then ok. If you see merge join or hash join, let us know.

连接类型:应该有一个嵌套循环联接。(其他选项是merge join和hash join)。如果您看到嵌套循环,那么ok。如果您看到合并连接或散列连接,请告诉我们。

Order of table access: Go all the way to the top and scroll all the way to the right. The first step should be accessing a table. Which table is that and what method is used(index scan, clustered index scan)? What method is used to access the other table?

访问表的顺序:一直到顶部并向右滚动。第一步应该是访问表。这是哪个表,使用什么方法(索引扫描、聚类索引扫描)?使用什么方法访问另一个表?

Parallelism: You should see the little jaggedy arrows on almost all icons in the plan indicating that parallelism is being used. If you don't see this, there is a major problem!

并行性:您应该在计划中的几乎所有图标上看到小的锯齿箭头,表明正在使用并行性。如果你没有看到这一点,有一个大问题!

That udf_BUPDEF concerns me. Does it read from additional tables? Util.PADLEFT concerns me less, but still.. what is it? If it isn't a Database Object, then consider using this instead:

udf_BUPDEF担忧我。它是否从其他表中读取?跑龙套。PADLEFT不太关心我,但仍然。它是什么?如果它不是一个数据库对象,那么考虑使用以下方法:

RIGHT('z00000000000000000000000000' + columnName, 7)

Are there any triggers on JOINCCAC? How about indexes? With an insert this large, you'll want to drop all triggers and indexes on that table.

在JOINCCAC上有触发器吗?索引呢?使用如此大的插入,您将希望删除该表上的所有触发器和索引。

#2


2  

Continuing on what others a saying, DB functions that contained queries which are used in a select always made my queries extremely slow. Off the top of my head, I believe i had a query run in 45 seconds, then I removed the function, and then result was 0 seconds :)

继续别人所说的,DB函数包含在select中使用的查询,这使得我的查询非常缓慢。在我的脑海中,我相信我在45秒内运行了一个查询,然后我删除了函数,然后结果是0秒:)

So check udf_BUPDEF is not doing any queries.

所以检查udf_BUPDEF没有执行任何查询。

#3


1  

Break down the query to make it a plain simple cross join.

分解查询,使其成为简单的交叉连接。


   SELECT  CC.COST_CTR_NUM, GL.ACCOUNT_NO
              ,CC.COST_CTR_NUM AS CA_COSTCENT
              ,GL.ACCOUNT_NO AS CA_GLACCOUNT
              ,GL.LI_LNTM AS CA_LNTM
-- I don't know what is BUPDEF doing? but remove it from the query for time being
--              ,udf_BUPDEF(GL.ACCOUNT_NO, CC.COST_CTR_NUM, GL.LI_LNTM, 'N') AS CA_UNIT
       FROM   JOINGLAC AS GL
       CROSS JOIN COSTCENT AS CC

See how good is the simple cross join? (without any functions applied on it)

看看简单的交叉连接有多好?(没有任何功能)