I have a requirement to calculate the installed base for units with different placements/shipments in different countries with different "environments" over many years given a set of certain "retirement rates" assigned to each unit. The placements, curve definitions, and curve assignments are stored in different database tables (with DDL and sample data below, also on SQLFiddle.com). The formula for calculating installed base is as follows:

考虑到为每个单元分配了一定的“退休率”,我需要计算多年来具有不同“环境”的不同国家/地区的不同展示位置/出货量的单位的安装基数。放置,曲线定义和曲线分配存储在不同的数据库表中(下面是DDL和示例数据,也在SQLFiddle.com上)。计算安装基数的公式如下:

计算安装基数的最有效方法是什么？ where 1990 is the first year for which we have placement data.

1990年是我们有第一年的就业数据。

The problem:

Doing these calculations with datasets of 3 to 16 million rows of unit/country/environment/year placement combinations takes much more time than the target load/calculation time of 30 seconds to 1 minute.

使用3到1百万行单位/国家/环境/年放置组合的数据集进行这些计算所花费的时间比目标载荷/计算时间30秒到1分钟要多得多。

Sql Server approach

When PIVOTed so that each year becomes its own column, I get anywhere from 100,000 t0 400,000 returned rows of raw data (placements + rates), which takes about 8-15 seconds. However, if I were to calculate this manually via SQL statement as included below, it takes at least 10 minutes.

当PIVOTed使每年成为它自己的专栏时,我可以获得100,000 t0 400,000返回的原始数据行(展示位置+费率),大约需要8-15秒。但是,如果我通过下面包含的SQL语句手动计算,则需要至少10分钟。

We've also tried an SQL trigger solution that updated the installed base each time a placement or rate was modified, but that made database updates unreasonably slow on batch updates, and was also unreliable. I suppose this could merit more investigation if this were really the best option.

我们还尝试了一种SQL触发器解决方案,每次修改放置或速率时都会更新已安装的基础,但这使得数据库更新在批量更新时不合理地减慢,并且也不可靠。我想如果这真的是最好的选择,这值得进一步调查。

Excel-VSTO approach (so far, the fastest approach):

This data ultimately ends up in a C# VSTO powered Excel workbook where the data was calculated via a series of VLOOKUPs, but when loading 150,000 placements across 6 years by about 20 VLOOKUPs per cell (about 20 million VLOOKUPs), Excel crashes. When the VLOOKUPs are done in smaller batches and the formulas are converted into values, it doesn't crash but it still takes much longer than one minute to calculate.

这些数据最终以C#VSTO驱动的Excel工作簿结束,其中数据是通过一系列VLOOKUP计算的,但是当每个单元大约20个VLOOKUP(大约2000万个VLOOKUP)在6年内加载150,000个位置时,Excel崩溃。当VLOOKUP以较小的批次完成并且公式被转换为值时,它不会崩溃,但仍然需要比一分钟更长的时间来计算。

The question:

Is there some mathematical or programmatic construct that would help me to calculate this data via C# or SQL more efficiently than I've been doing? Brute force iteration is also too slow, so that's not an option either.

是否有一些数学或程序化的结构可以帮助我通过C#或SQL比我一直更有效地计算这些数据?蛮力迭代也太慢,所以这也不是一个选择。

DECLARE @Placements TABLE 
(
    UnitId int not null,
    Environment varchar(50) not null,
    Country varchar(100) not null,
    YearColumn smallint not null,
    Placement decimal(18,2) not null,
    PRIMARY KEY (UnitId, Environment, Country, YearColumn)
)


DECLARE @CurveAssignments TABLE 
(
    UnitId int not null,
    Environment varchar(50) not null,
    Country varchar(100) not null,
    YearColumn smallint not null,
    RateId int not null,
    PRIMARY KEY (UnitId, Environment, Country, YearColumn)
)

DECLARE @CurveDefinitions TABLE
(
    RateId int not null,
    YearOffset int not null,
    Rate decimal(18,2) not null,
    PRIMARY KEY (RateId, YearOffset)
)

INSERT INTO
    @Placements
    (
        UnitId,
        Country,
        YearColumn,
        Environment,
        Placement
    )
VALUES
    (
        1,
        'United States',
        1991,
        'Windows',
        100
    ),
    (
        1,
        'United States',
        1990,
        'Windows',
        100
    )

INSERT INTO
    @CurveAssignments
    (
        UnitId,
        Country,
        YearColumn,
        Environment,
        RateId
    )
VALUES
    (
        1,
        'United States',
        1991,
        'Windows',
        1
    )

INSERT INTO
    @CurveDefinitions
    (
        RateId,
        YearOffset,
        Rate
    )
VALUES
    (
        1,
        0,
        1
    ),
    (
        1,
        1,
        0.5
    )

SELECT
    P.UnitId,
    P.Country,
    P.YearColumn,
    P.Placement *
    (
        SELECT
            Rate
        FROM
            @CurveDefinitions CD
            INNER JOIN @CurveAssignments CA ON
                CD.RateId = CA.RateId
        WHERE
            CA.UnitId = P.UnitId
            AND CA.Environment = P.Environment
            AND CA.Country = P.Country
            AND CA.YearColumn = P.YearColumn - 0
            AND CD.YearOffset = 0
    )
    +
    (
        SELECT
            Placement
        FROM
            @Placements PP
        WHERE
            PP.UnitId = P.UnitId
            AND PP.Environment = P.Environment
            AND PP.Country = P.Country
            AND PP.YearColumn = P.YearColumn - 1
    )
    *
    (
        SELECT
            Rate
        FROM
            @CurveDefinitions CD
            INNER JOIN @CurveAssignments CA ON
                CD.RateId = CA.RateId
        WHERE
            CA.UnitId = P.UnitId
            AND CA.Environment = P.Environment
            AND CA.Country = P.Country
            AND CA.YearColumn = P.YearColumn
            AND CD.YearOffset = 1
    ) [Installed Base - 1993]
FROM
    @Placements P
WHERE
    P.UnitId = 1
    AND P.Country = 'United States'
    AND P.YearColumn = 1991
    AND P.Environment = 'Windows'

2 个解决方案

#1

In response the following statement:

作为回应,以下声明:

We've also tried an SQL trigger solution that updated the installed base each time a placement or rate was modified, but that made database updates unreasonably slow on batch updates, and was also unreliable. I suppose this could merit more investigation if this were really the best option.

我们还尝试了一种SQL触发器解决方案,每次修改放置或速率时都会更新已安装的基础,但这使得数据库更新在批量更新时不合理地减慢,并且也不可靠。我想如果这真的是最好的选择,这值得进一步调查。

Have you heard of SQL Service Broker? One of the things it does really well is allow you to queue data for asynchronous processing. If the trigger itself is too slow, you could use the trigger to queue records for asynchroneous processing.

你听说过SQL Service Broker吗?它做得非常好的一件事是允许您将数据排队以进行异步处理。如果触发器本身太慢,您可以使用触发器对记录进行排队以进行异步处理。

#2

Looks like this might turn out to be a case where asking the question leads to the right answer. It turns out the answer mostly lies in the query I'd given above, which was entirely inefficient. I've been able to get load times in the vicinity that I'm looking for by just optimizing the query as below.

看起来这可能是一个问题,导致正确的答案。事实证明,答案主要在于我上面给出的查询,这完全是低效的。我已经能够通过优化查询来获得我正在寻找的附近的加载时间,如下所示。

SELECT
    P.UnitId,
    P.Country,
    P.YearColumn,
    P.Environment,
    P.Placement,
    sum(IBP.Placement * FRR.Rate) InstalledBase
FROM
    @Placements P
    INNER JOIN @Placements IBP ON
        P.UnitId = IBP.UnitId
        AND P.Country = IBP.Country
        AND P.Environment = IBP.Environment
        AND P.YearColumn >= IBP.YearColumn
    INNER JOIN @CurveAssignments RR ON
        IBP.UnitId = RR.UnitId
        AND IBP.Country = RR.Country
        AND IBP.Environment = RR.Environment
        AND IBP.YearColumn = RR.YearColumn
    INNER JOIN @CurveDefinitions FRR ON
        Rr.RateId = FRR.RateId
        AND P.YearColumn - IBP.YearColumn = FRR.YearOffset
GROUP BY
    P.UnitId,
    P.YearColumn,
    P.Country,
    P.Environment,
    P.Placement

#1