数据中内存泄漏缓慢。在j中返回命名列表时的表(试图重构一个数据表)。

时间:2022-09-16 11:59:50

Edit 3:

编辑3:

I created a much shorter example of the memory leak. I hope it makes it much easier to reason about what's going on. As the iterations proceed, you see steadily increasing gc() VCell memory use, while memory use reported by tables() stays the same. Somehow, the unlist(.SD) call seems to be responsible. Here it is:

我创建了一个更短的内存泄漏示例。我希望它能让我们更容易理解发生了什么。随着迭代的进行,您会看到越来越多的gc() VCell内存使用,而表()所报告的内存使用保持不变。不知何故,unlist(. sd)调用似乎是负责任的。这里是:

DT = data.table(k = 1:100, g = 1:20, val = rnorm(2e6))
for (i in 1:100){
  tmp = DT[ , unlist(.SD), by = 'k']
  print(gc())
  tables()
}

Original post:

原来的帖子:

I am seeing some memory behavior I don't understand when using the data.table package. I am using R-2.13.0 with data.table 1.8.8. I run on 64-bit suse linux.

我看到一些我在使用数据时不理解的记忆行为。表方案。我使用R-2.13.0的数据。表1.8.8。我在64位suse linux上运行。

My ultimate objective is to reshape a data.table from "long" to "wide" format using as little memory as possible. I followed a suggestion in another [SO post] ( Nested if else statements over a number of columns). Basically I try to reshape a data.table returning a named list in the j expression.

我的最终目标是重塑数据。表从“长”到“宽”格式,使用尽可能少的内存。我在另一个[SO post]中采纳了一个建议(如果else语句在多个列上嵌套的话)。基本上,我试着重塑数据。表返回j表达式中的指定列表。

I see a steadily increasing memory use, which seems like a memory leak. The total memory used by data.tables or other objects does not account for what is shown in gc(). in particular, Vcells starts at around 17 MB and ends at nearly 30 MB, while the total memory use reported by tables() is 19 MB (at the end). There are no other objects (that I can see) using any meaningful amount of memory. Running the code below repeatedly shows the increasing memory use with print(gc()) statements.

我看到内存使用量在稳步增加,这似乎是内存泄漏。数据使用的总内存。表或其他对象不考虑gc()中显示的内容。特别地,vcell从大约17 MB开始,以接近30mb的速度结束,而表()所报告的内存使用总量是19 MB(在最后)。没有其他对象(我可以看到)使用任何有意义的内存。运行下面的代码会不断地显示与print(gc())语句的内存使用量增加。

Am I missing something or is there an issue with some of the memory allocation in dogroups.c ?

我是否遗漏了什么,或者在dogroups中有一些内存分配的问题。c ?

Here is code to reproduce the problem I see. Any ideas? I really would like to be able to reshape a data.table relatively efficiently, with memory use being a bigger consideration than speed.

这里是代码来重现我所看到的问题。什么好主意吗?我真的希望能够重塑数据。表格相对有效,内存使用比速度更重要。

library(data.table)

if(!exists('DT')){
  cat('creating DT\n')
  # make a "long" matrix with 300 columns and keys v,d
  v = 1:250
  d = 1:50
  grid = expand.grid(v,d)
  DT = data.table(v = grid[,1], d = grid[,2])    
  # now add many columns
  DT[,sprintf('col%s',1:100) := 1:nrow(DT)]; 
  # set d as key, we don't care much about v for this example
  setkey(DT,'d')
}

# The following code attempts to cast a "long" data.table to "wide" format
# it is the equivalent the reshape2 call:
#
#   dcast(melt(DT, c('d','v')), d ~ v + variable, value_var='value')
#
# When I run the code I see ever-increasing memory use.  sourcing the file
# repeatedly shows that as well. The total memory used by the input
# and result data.table or any other objects do not account for the total use.


# casting patterned after
# https://*.com/questions/15510566/nested-if-else-statements-over-a-number-of-columns/15511689?noredirect=1#comment21968080_15511689

paste.dash <- function(...){ paste(..., sep='-')}    

# assumes keys is  a vector of characters
dt.melt <- function(dt, keys) {
  dt[, list(variable = names(.SD), value = unlist(.SD)), by = keys]
}

# assumes keys is  a vector of characters.
# all.names is all the column names we expect in the wide data.table
# we accommodate for the possibility of missing wide table values 
# for some groups by appending NAs for any column names not present.
# in the particular example above there are no missing values,
# but the data I intend to run this on does.
dt.recast<- function(dat, keys, all.names,verbose=FALSE){

  if (verbose){
    cat(sprintf('dt.recast(): keys = %s\n', paste(keys, collapse=',')))
    print(gc())
  }
  # id, variable, value
  m = dt.melt(dat, keys)

  # m.names will be the wide table column names.
  m.names = do.call(paste.dash, m[, c(keys,'variable'),  with=FALSE])

  #append anything that's missing in this group to end of list with NA values
  missing.names = setdiff(all.names, m.names)
  missing.vals = rep(NA_real_, length(missing.names))
  ret.val = c(m$value, missing.vals)
  # set names and make a list as required by data.table to generate a wide row
  ret.val = as.list(setattr(ret.val,'names', c(m.names,missing.names)))

  if (verbose){
    print(gc())
  }

  return(ret.val)
}

# turn to wide format row key 'd': columns are cartesian product of v and
# current non-key columns

all.wide.names = do.call(paste.dash, expand.grid(unique(DT$v), tail(names(DT),-2)))

print (gc())

DT.wide = DT[ , dt.recast(.SD, 'v', all.wide.names, verbose = TRUE),
  by = 'd',
  verbose=TRUE ]

print (gc())

Edit:

编辑:

#Here is the output of sessionInfo
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C   \
               LC_ADDRESS=C
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.8.8
>

Edit2: Here is some output from two consecutive runs.

Edit2:这是两个连续运行的输出。

> source('memory-leak.R')
data.table 1.8.8  For help type: help("data.table")
creating DT
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 231906 12.4     407500 21.8   350000 18.7
Vcells 272022  2.1     786432  6.0   773683  6.0
Finding groups (bysameorder=TRUE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as 'dt.recast(.SD, "v", all.wide.names, verbose = TRUE)'
Starting dogroups ... dt.recast(): keys = v
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 233168 12.5     467875   25   350000 18.7
Vcells 292303  2.3     786432    6   773683  6.0
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 258224 13.8     531268 28.4   350000 18.7
Vcells 474776  3.7     905753  7.0   773683  6.0
The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283206 15.2     531268 28.4   350000 18.7
Vcells 1699595 13.0    2029708 15.5  1699607 13.0
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308232 16.5     597831   32   350000 18.7
Vcells 1882303 14.4    2221551   17  2029708 15.5
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 1732347 13.3    2412628 18.5  2029708 15.5
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831   32   350000 18.7
Vcells 1915666 14.7    2613259   20  2284358 17.5
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 1764847 13.5    2823921 21.6  2284358 17.5
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 1948166 14.9    3045117 23.3  2316858 17.7
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 1797347 13.8    3045117 23.3  2316858 17.7
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 1980666 15.2    3277372 25.1  2349358 18.0
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 1829847 14.0    3277372 25.1  2349358 18.0
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2013166 15.4    3277372 25.1  2381858 18.2
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 1862347 14.3    3277372 25.1  2381858 18.2
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2045666 15.7    3277372 25.1  2414358 18.5
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 1894847 14.5    3277372 25.1  2414358 18.5
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2078166 15.9    3277372 25.1  2446858 18.7
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 1927347 14.8    3277372 25.1  2446858 18.7
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2110666 16.2    3277372 25.1  2479358 19.0
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 1959847 15.0    3277372 25.1  2479358 19.0
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2143166 16.4    3521240 26.9  2511858 19.2
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 1992347 15.3    3521240 26.9  2511858 19.2
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2175666 16.6    3521240 26.9  2544358 19.5
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2024847 15.5    3521240 26.9  2544358 19.5
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2208166 16.9    3521240 26.9  2576858 19.7
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2057347 15.7    3521240 26.9  2576858 19.7
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2240666 17.1    3521240 26.9  2609358 20.0
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2089847 16.0    3521240 26.9  2609358 20.0
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2273166 17.4    3521240 26.9  2641858 20.2
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2122347 16.2    3521240 26.9  2641858 20.2
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2305666 17.6    3521240 26.9  2674358 20.5
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2154847 16.5    3521240 26.9  2674358 20.5
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2338166 17.9    3777302 28.9  2706858 20.7
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2187347 16.7    3777302 28.9  2706858 20.7
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2370666 18.1    3777302 28.9  2739358 20.9
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2219847 17.0    3777302 28.9  2739358 20.9
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2403166 18.4    3777302 28.9  2771858 21.2
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2252347 17.2    3777302 28.9  2771858 21.2
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2435666 18.6    3777302 28.9  2804358 21.4
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2284847 17.5    3777302 28.9  2804358 21.4
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2468166 18.9    3777302 28.9  2836858 21.7
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2317347 17.7    3777302 28.9  2836858 21.7
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2500666 19.1    4046167 30.9  2869358 21.9
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2349847 18.0    4046167 30.9  2869358 21.9
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2533166 19.4    4046167 30.9  2901858 22.2
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2382347 18.2    4046167 30.9  2901858 22.2
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2565666 19.6    4046167 30.9  2934358 22.4
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2414847 18.5    4046167 30.9  2934358 22.4
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2598166 19.9    4046167 30.9  2966858 22.7
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2447347 18.7    4046167 30.9  2966858 22.7
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2630666 20.1    4046167 30.9  2999358 22.9
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2479847 19.0    4046167 30.9  2999358 22.9
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2663166 20.4    4046167 30.9  3031858 23.2
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2512347 19.2    4046167 30.9  3031858 23.2
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2695666 20.6    4328475 33.1  3064358 23.4
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2544847 19.5    4328475 33.1  3064358 23.4
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2728166 20.9    4328475 33.1  3096858 23.7
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2577347 19.7    4328475 33.1  3096858 23.7
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2760666 21.1    4328475 33.1  3129358 23.9
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2609847 20.0    4328475 33.1  3129358 23.9
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2793166 21.4    4328475 33.1  3161858 24.2
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2642347 20.2    4328475 33.1  3161858 24.2
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2825666 21.6    4328475 33.1  3194358 24.4
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2674847 20.5    4328475 33.1  3194358 24.4
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2858166 21.9    4328475 33.1  3226858 24.7
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2707347 20.7    4328475 33.1  3226858 24.7
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2890666 22.1    4624898 35.3  3259358 24.9
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2739847 21.0    4624898 35.3  3259358 24.9
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2923166 22.4    4624898 35.3  3291858 25.2
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2772347 21.2    4624898 35.3  3291858 25.2
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2955666 22.6    4624898 35.3  3324358 25.4
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2804847 21.4    4624898 35.3  3324358 25.4
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 2988166 22.8    4624898 35.3  3356858 25.7
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2837347 21.7    4624898 35.3  3356858 25.7
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 3020666 23.1    4624898 35.3  3389358 25.9
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 2869847 21.9    4624898 35.3  3389358 25.9

... <snip> ...

dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 3162347 24.2    5262949 40.2  3681858 28.1
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 3345666 25.6    5262949 40.2  3714358 28.4
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 3194847 24.4    5262949 40.2  3714358 28.4
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 3378166 25.8    5262949 40.2  3746858 28.6
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 3227347 24.7    5262949 40.2  3746858 28.6
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 3410666 26.1    5262949 40.2  3779358 28.9
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  283211 15.2     597831 32.0   350000 18.7
Vcells 3259847 24.9    5262949 40.2  3779358 28.9
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  308247 16.5     597831 32.0   350000 18.7
Vcells 3443166 26.3    5262949 40.2  3811858 29.1
done dogroups in 10.972 secs
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  258292 13.8     597831 32.0   350000 18.7
Vcells 3247919 24.8    5262949 40.2  3811858 29.1
> tables()
     NAME      NROW MB COLS                                                                             KEY
[1,] DT      12,500  5 v,d,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,c d  
[2,] DT.wide     50 14 d,1-col1,1-col2,1-col3,1-col4,1-col5,1-col6,1-col7,1-col8,1-col9,1-col10,1-col11 d  
Total: 19MB
> source('/memory-leak.R')
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  260024 13.9     597831 32.0   350000 18.7
Vcells 3279245 25.1    5262949 40.2  3859228 29.5
Finding groups (bysameorder=TRUE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as 'dt.recast(.SD, "v", all.wide.names, verbose = TRUE)'
Starting dogroups ... dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  260400 14.0     597831 32.0   350000 18.7
Vcells 3297670 25.2    5262949 40.2  3859228 29.5
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  285438 15.3     597831 32.0   350000 18.7
Vcells 3480986 26.6    5262949 40.2  3859228 29.5
The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  310409 16.6     597831 32.0   350000 18.7
Vcells 4705194 35.9    5606096 42.8  4781165 36.5
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  335445 18.0     597831 32.0   374617 20.1
Vcells 4888513 37.3    5966400 45.6  5257204 40.2
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  310409 16.6     597831 32.0   374617 20.1
Vcells 4737694 36.2    6344720 48.5  5257204 40.2
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  335445 18.0     597831 32.0   374617 20.1
Vcells 4921013 37.6    6741956 51.5  5289704 40.4
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  310409 16.6     597831 32.0   374617 20.1
Vcells 4770194 36.4    7159053 54.7  5289704 40.4
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  335445 18.0     597831   32   374617 20.1
Vcells 4953513 37.8    7597005   58  5322204 40.7
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  310409 16.6     597831   32   374617 20.1
Vcells 4802694 36.7    7597005   58  5322204 40.7
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  335445 18.0     597831   32   374617 20.1
Vcells 4986013 38.1    7597005   58  5354704 40.9
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  310409 16.6     597831   32   374617 20.1
Vcells 4835194 36.9    7597005   58  5354704 40.9
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  335445 18.0     597831   32   374617 20.1
Vcells 5018513 38.3    7597005   58  5387204 41.2
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  310409 16.6     597831   32   374617 20.1
Vcells 4867694 37.2    7597005   58  5387204 41.2
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  335445 18.0     597831   32   374617 20.1
Vcells 5051013 38.6    7597005   58  5419704 41.4
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  310409 16.6     597831   32   374617 20.1
Vcells 4900194 37.4    7597005   58  5419704 41.4
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  335445 18.0     597831   32   374617 20.1
Vcells 5083513 38.8    7597005   58  5452204 41.6
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  310409 16.6     597831   32   374617 20.1
Vcells 4932694 37.7    7597005   58  5452204 41.6
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  335445 18.0     597831   32   374617 20.1
Vcells 5116013 39.1    7597005   58  5484704 41.9
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  310409 16.6     597831   32   374617 20.1
Vcells 4965194 37.9    7597005   58  5484704 41.9
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  335445 18.0     597831   32   374617 20.1
Vcells 5148513 39.3    7597005   58  5517204 42.1
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  310409 16.6     597831   32   374617 20.1
Vcells 4997694 38.2    7597005   58  5517204 42.1
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  335445 18.0     597831 32.0   374617 20.1
Vcells 5181013 39.6    8056855 61.5  5549704 42.4
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  310409 16.6     597831 32.0   374617 20.1
Vcells 5030194 38.4    8056855 61.5  5549704 42.4
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  335445 18.0     597831 32.0   374617 20.1
Vcells 5213513 39.8    8056855 61.5  5582204 42.6
dt.recast(): keys = v
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  310409 16.6     597831 32.0   374617 20.1
Vcells 5062694 38.7    8056855 61.5  5582204 42.6
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  335445 18.0     597831 32.0   374617 20.1
Vcells 5246013 40.1    8056855 61.5  5614704 42.9
dt.recast(): keys = v

 ... <snip> ...

          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  310409 16.6     597831 32.0   374617 20.1
Vcells 6265194 47.8    9579015 73.1  6784704 51.8
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  335445 18.0     597831 32.0   374617 20.1
Vcells 6448513 49.2    9579015 73.1  6817204 52.1
done dogroups in 11.53 secs
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  260003 13.9     597831 32.0   374617 20.1
Vcells 4978149 38.0    9579015 73.1  6817204 52.1
> tables()
     NAME      NROW MB COLS                                                                             KEY
[1,] DT      12,500  5 v,d,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,c d  
[2,] DT.wide     50 14 d,1-col1,1-col2,1-col3,1-col4,1-col5,1-col6,1-col7,1-col8,1-col9,1-col10,1-col11 d  
Total: 19MB
> 

1 个解决方案

#1


6  

UPDATE - Now fixed in v1.8.11. From NEWS :

更新——现在固定在v1.8.11中。从新闻:

Long outstanding (usually small) memory leak in grouping fixed. When the last group is smaller than the largest group, the difference in those sizes was not being released. Also in non-trivial aggregations where each group returns a different number of rows. Most users run a grouping query once and will never have noticed, but anyone looping calls to grouping (such as when running in parallel) may have suffered, #2648. Tests added.

长期未完成(通常是小)的内存泄漏分组修复。当最后一个组小于最大组时,这些大小的差异没有被释放。在非平凡的聚合中,每个组返回不同的行数。大多数用户只运行一次分组查询,并且永远不会注意到,但是任何对分组的调用(如并行运行时)可能会受到影响,#2648。测试补充道。

Many thanks to vc273, Y T and others.

感谢vc273, Y T等人。


The particular (great) example at the top of this question is considered a "non-trivial" aggregation where the result of each group can be a different number of rows, not just a single aggregated in one row. Adding verbose=TRUE reveals :

在这个问题的顶部的特殊(great)示例被认为是一个“非平凡”的聚合,其中每个组的结果可以是不同的行数,而不仅仅是一行中的单个集合。添加详细= TRUE揭示:

Wrote less rows (4000000) than allocated (4488000).

写的行数少于分配(4488000)。

and that's where the leak was in this case. Only matters if you need to repeat grouping many times, as is needed sometimes. The result was correct.

这就是泄漏的原因。只有当您需要多次重复分组时才重要,因为有时需要这样做。结果是正确的。


Previous answer retained for posterity ...

先前的回答保留给后代……

Consider this part :

考虑这部分:

#now add many columns
for (i in 1:100){
    DT[[sprintf('col%s',i)]] = 1:nrow(DT);
}

That isn't using := or set() which are the data.table provided ways of adding columns by reference. = is the same as <-; i.e., on each and every iteration of this for loop the entire DT will be copied to make room for the single extra column. The memory leak you describe would be consistent with this for loop.

不使用:=或set(),这是数据。表提供了通过引用添加列的方法。=和<-;即。在这个for循环的每一个迭代中,都将复制整个DT,为单个额外的列腾出空间。您描述的内存泄漏将与这个for循环一致。

Some options are :

一些选项:

  • Add the many columns in one go using cbind
  • 使用cbind添加许多列。
  • Add the columns in one go using := e.g. DT[,sprintf('col%s',1:100):=1:nrow(DT)]
  • 使用:=例如DT,sprintf('col%s',1:100):=1:nrow(DT)]
  • Keep the for loop but use := or set() on each iteration
  • 保留for循环,但在每次迭代中使用:=或set()。

I haven't actually run your code to check so there may be other problems later as well.

实际上,我还没有运行您的代码来检查,所以以后可能还会出现其他问题。


UPDATE : I have now run your code and I think I might be able to guess what you mean about memory use. But guessing can use up a lot of time, especially in areas like this. Can you please expand significantly upon this :

更新:我现在已经运行了您的代码,我想我可能能够猜到您对内存使用的理解。但是猜测会耗费很多时间,尤其是在这样的地方。你能否在此基础上进一步扩展:

I see a steadily increasing memory use, which seems like a memory leak.

我看到内存使用量在稳步增加,这似乎是内存泄漏。

What precisely do you see; i.e., what are the numbers? What does it start at and what does it end at? How many times did you run it? Please also provide the output of sessionInfo(); although you give the version of R (2.13.0) which is helpful, it helps to know if you are 32bit or 64bit Linux, Mac or Windows as well.

你究竟看到了什么?即。数字是多少?它从什么开始,以什么结束?你跑了多少次?请提供sessionInfo()的输出;虽然您提供了R(2.13.0)的版本,但它有助于了解您是否有32位或64位Linux、Mac或Windows。

#1


6  

UPDATE - Now fixed in v1.8.11. From NEWS :

更新——现在固定在v1.8.11中。从新闻:

Long outstanding (usually small) memory leak in grouping fixed. When the last group is smaller than the largest group, the difference in those sizes was not being released. Also in non-trivial aggregations where each group returns a different number of rows. Most users run a grouping query once and will never have noticed, but anyone looping calls to grouping (such as when running in parallel) may have suffered, #2648. Tests added.

长期未完成(通常是小)的内存泄漏分组修复。当最后一个组小于最大组时,这些大小的差异没有被释放。在非平凡的聚合中,每个组返回不同的行数。大多数用户只运行一次分组查询,并且永远不会注意到,但是任何对分组的调用(如并行运行时)可能会受到影响,#2648。测试补充道。

Many thanks to vc273, Y T and others.

感谢vc273, Y T等人。


The particular (great) example at the top of this question is considered a "non-trivial" aggregation where the result of each group can be a different number of rows, not just a single aggregated in one row. Adding verbose=TRUE reveals :

在这个问题的顶部的特殊(great)示例被认为是一个“非平凡”的聚合,其中每个组的结果可以是不同的行数,而不仅仅是一行中的单个集合。添加详细= TRUE揭示:

Wrote less rows (4000000) than allocated (4488000).

写的行数少于分配(4488000)。

and that's where the leak was in this case. Only matters if you need to repeat grouping many times, as is needed sometimes. The result was correct.

这就是泄漏的原因。只有当您需要多次重复分组时才重要,因为有时需要这样做。结果是正确的。


Previous answer retained for posterity ...

先前的回答保留给后代……

Consider this part :

考虑这部分:

#now add many columns
for (i in 1:100){
    DT[[sprintf('col%s',i)]] = 1:nrow(DT);
}

That isn't using := or set() which are the data.table provided ways of adding columns by reference. = is the same as <-; i.e., on each and every iteration of this for loop the entire DT will be copied to make room for the single extra column. The memory leak you describe would be consistent with this for loop.

不使用:=或set(),这是数据。表提供了通过引用添加列的方法。=和<-;即。在这个for循环的每一个迭代中,都将复制整个DT,为单个额外的列腾出空间。您描述的内存泄漏将与这个for循环一致。

Some options are :

一些选项:

  • Add the many columns in one go using cbind
  • 使用cbind添加许多列。
  • Add the columns in one go using := e.g. DT[,sprintf('col%s',1:100):=1:nrow(DT)]
  • 使用:=例如DT,sprintf('col%s',1:100):=1:nrow(DT)]
  • Keep the for loop but use := or set() on each iteration
  • 保留for循环,但在每次迭代中使用:=或set()。

I haven't actually run your code to check so there may be other problems later as well.

实际上,我还没有运行您的代码来检查,所以以后可能还会出现其他问题。


UPDATE : I have now run your code and I think I might be able to guess what you mean about memory use. But guessing can use up a lot of time, especially in areas like this. Can you please expand significantly upon this :

更新:我现在已经运行了您的代码,我想我可能能够猜到您对内存使用的理解。但是猜测会耗费很多时间,尤其是在这样的地方。你能否在此基础上进一步扩展:

I see a steadily increasing memory use, which seems like a memory leak.

我看到内存使用量在稳步增加,这似乎是内存泄漏。

What precisely do you see; i.e., what are the numbers? What does it start at and what does it end at? How many times did you run it? Please also provide the output of sessionInfo(); although you give the version of R (2.13.0) which is helpful, it helps to know if you are 32bit or 64bit Linux, Mac or Windows as well.

你究竟看到了什么?即。数字是多少?它从什么开始,以什么结束?你跑了多少次?请提供sessionInfo()的输出;虽然您提供了R(2.13.0)的版本,但它有助于了解您是否有32位或64位Linux、Mac或Windows。