如何在进行合并时避免hsqldb中的OOM?

时间:2022-09-15 20:51:46

I have two tables where the first is very large (>50M rows):

我有两个表,第一个非常大(> 50M行):

CREATE CACHED TABLE Alldistances (
    word1 VARCHAR(70), 
    word2 VARCHAR(70), 
    distance INTEGER, 
    distcount INTEGER
);

and a second that can be also quite large (>5M rows):

第二个也可以很大(> 5M行):

CREATE CACHED TABLE tempcach (
    word1 VARCHAR(70), 
    word2 VARCHAR(70), 
    distance INTEGER, 
    distcount INTEGER
);

Both tables have indexes:

两个表都有索引:

CREATE INDEX mulalldis ON Alldistances (word1, word2, distance);
CREATE INDEX multem ON tempcach (word1, word2, distance);

In my java program I am using prepared statements to fill/preorganize data in the tempcach table and then I merge the table to alldistances with:

在我的java程序中,我使用预处理语句来填充/预先组织tempcach表中的数据,然后将表合并到alldistances:

MERGE INTO Alldistances alld USING ( 

    SELECT word1, 
           word2, 
           distance, 
           distcount FROM tempcach 

    ) AS src (

        newword1, 
        newword2, 
        newdistance, 
        newcount

    ) ON (

            alld.word1 = src.newword1 
        AND alld.word2 = src.newword2 
        AND alld.distance = src.newdistance 

    ) WHEN MATCHED THEN 

        UPDATE SET alld.distcount = alld.distcount+src.newcount 

    WHEN NOT MATCHED THEN 

        INSERT (

            word1, 
            word2, 
            distance, 
            distcount

        ) VALUES (

            newword1, 
            newword2, 
            newdistance, 
            newcount
        );

The tempchach table is then dropped or truncated and filled with new data. During the merge I get the OOM, which is i guess because the whole table is loaded into memory during the merge. So I will have to merge in batches, but can i do that in SQL or do it in my java program. Or is there a smart way to avoid OOM while merging?

然后删除或截断tempchach表并填充新数据。在合并期间,我得到了OOM,我猜是因为整个表在合并期间被加载到内存中。所以我将必须批量合并,但我可以在SQL中执行此操作,还是在我的java程序中执行此操作。或者有一种聪明的方法可以在合并时避免OOM吗?

2 个解决方案

#1


0  

It is possible to merge in chunks (batches) in SQL. You need to

可以在SQL中以块(批处理)合并。你需要

  • limit the number of rows from the temp table in each chunk
  • 限制每个块中临时表的行数
  • delete those same rows
  • 删除那些相同的行
  • repeat
  • 重复

The SELECT statement should use an ORDER BY and LIMIT

SELECT语句应使用ORDER BY和LIMIT

SELECT word1, 
       word2, 
       distance, 
       distcount FROM tempcach
       ORDER BY primary key or unique columns 
       LIMIT 1000

) AS src (

After the merge, the delete statement will select the same rows to delete

合并后,delete语句将选择要删除的相同行

DELETE FROM tempcach WHERE primary key or unique columns IN
      (SELECT primary key or unique columns FROM tempcach 
       ORDER BY primary key or unique columns LIMIT 1000)

#2


0  

First, just because this kind of thing annoys me, why are you selecting all the fields of the temporary table in a subselect? Why not the simpler SQL:

首先,仅仅因为这种事情让我烦恼,为什么要在子选择中选择临时表的所有字段?为什么不是更简单的SQL:

MERGE INTO Alldistances alld USING tempcach AS src (
    newword1, 
    newword2, 
    newdistance, 
    newcount
) ON (
        alld.word1 = src.newword1 
    AND alld.word2 = src.newword2 
    AND alld.distance = src.newdistance 
) WHEN MATCHED THEN 
    UPDATE SET alld.distcount = alld.distcount+src.newcount 
WHEN NOT MATCHED THEN 
    INSERT (
        word1, 
        word2, 
        distance, 
        distcount
    ) VALUES (
        newword1, 
        newword2, 
        newdistance, 
        newcount
    );

What you need to have the database avoid loading the whole table into memory is indexing on both tables.

数据库避免将整个表加载到内存中需要的是对两个表进行索引。

CREATE INDEX all_data ON Alldistances (word1, word2, distance);
CREATE INDEX tempcach_data ON tempcach (word1, word2, distance);

#1


0  

It is possible to merge in chunks (batches) in SQL. You need to

可以在SQL中以块(批处理)合并。你需要

  • limit the number of rows from the temp table in each chunk
  • 限制每个块中临时表的行数
  • delete those same rows
  • 删除那些相同的行
  • repeat
  • 重复

The SELECT statement should use an ORDER BY and LIMIT

SELECT语句应使用ORDER BY和LIMIT

SELECT word1, 
       word2, 
       distance, 
       distcount FROM tempcach
       ORDER BY primary key or unique columns 
       LIMIT 1000

) AS src (

After the merge, the delete statement will select the same rows to delete

合并后,delete语句将选择要删除的相同行

DELETE FROM tempcach WHERE primary key or unique columns IN
      (SELECT primary key or unique columns FROM tempcach 
       ORDER BY primary key or unique columns LIMIT 1000)

#2


0  

First, just because this kind of thing annoys me, why are you selecting all the fields of the temporary table in a subselect? Why not the simpler SQL:

首先,仅仅因为这种事情让我烦恼,为什么要在子选择中选择临时表的所有字段?为什么不是更简单的SQL:

MERGE INTO Alldistances alld USING tempcach AS src (
    newword1, 
    newword2, 
    newdistance, 
    newcount
) ON (
        alld.word1 = src.newword1 
    AND alld.word2 = src.newword2 
    AND alld.distance = src.newdistance 
) WHEN MATCHED THEN 
    UPDATE SET alld.distcount = alld.distcount+src.newcount 
WHEN NOT MATCHED THEN 
    INSERT (
        word1, 
        word2, 
        distance, 
        distcount
    ) VALUES (
        newword1, 
        newword2, 
        newdistance, 
        newcount
    );

What you need to have the database avoid loading the whole table into memory is indexing on both tables.

数据库避免将整个表加载到内存中需要的是对两个表进行索引。

CREATE INDEX all_data ON Alldistances (word1, word2, distance);
CREATE INDEX tempcach_data ON tempcach (word1, word2, distance);