在R中缓存/内存化/哈希的选项

时间:2022-02-02 03:55:03

I am trying to find a simple way to use something like Perl's hash functions in R (essentially caching), as I intended to do both Perl-style hashing and write my own memoisation of calculations. However, others have beaten me to the punch and have packages for memoisation. The more I dig, the more I find, e.g.memoise and R.cache, but differences aren't readily clear. In addition, it's not clear how else one can get Perl-style hashes (or Python-style dictionaries) and write one's own memoization, other than to use the hash package, which doesn't seem to underpin the two memoization packages.

我正在尝试寻找一种简单的方法来在R中使用Perl的哈希函数(本质上是缓存),因为我打算同时进行Perl风格的哈希,并编写自己的计算记忆。然而,其他人抢先了我一步,给我准备了一些备忘录。我挖得越多,找到的就越多,比如,memand R。缓存,但是差异并不明显。此外,除了使用散列包之外,我们还不清楚如何获得perl风格的散列(或python风格的字典)并编写自己的内存化。

Since I can find no information on CRAN or elsewhere to distinguish between the options, perhaps this should be a community wiki question on SO: What are the options for memoization and caching in R, and what are their differences?

由于我找不到关于CRAN或其他地方的信息来区分选项,也许这应该是一个社区wiki问题:在R中进行内存化和缓存的选项是什么,它们的区别是什么?


As a basis for comparison, here is a list of the options I've found. Also, it seems to me that all depend on hashing, so I'll note the hashing options as well. Key/value storage is somewhat related, but opens a huge can of worms regarding DB systems (e.g. BerkeleyDB, Redis, MemcacheDB and scores of others).

作为比较的基础,下面是我找到的选项列表。而且,在我看来,所有这些都取决于哈希,因此我也将注意哈希选项。键/值存储在某种程度上是相关的,但是对于DB系统(例如BerkeleyDB、Redis、MemcacheDB和其他许多系统)来说,它打开了一个巨大的蠕虫罐头。

It looks like the options are:

看起来选项是:

Hashing

  • digest - provides hashing for arbitrary R objects.
  • 摘要-为任意R对象提供哈希。

Memoization

  • memoise - a very simple tool for memoization of functions.
  • memoise -一种非常简单的函数记忆工具。
  • R.cache - offers more functionality for memoization, though it seems some of the functions lack examples.
  • R。缓存——提供了更多的内存化功能,尽管有些函数似乎缺少示例。

Caching

  • hash - Provides caching functionality akin to Perl's hashes and Python dictionaries.
  • 散列—提供类似于Perl的散列和Python字典的缓存功能。

Key/value storage

These are basic options for external storage of R objects.

这些是R对象的外部存储的基本选项。

Checkpointing

  • cacher - this seems to be more akin to checkpointing.
  • cacher -这似乎更类似于检查点。
  • CodeDepends - An OmegaHat project that underpins cacher and provides some useful functionality.
  • CodeDepends - OmegaHat项目,支持cacher和提供一些有用的功能。
  • DMTCP (not an R package) - appears to support checkpointing in a bunch of languages, and a developer recently sought assistance testing DMTCP checkpointing in R.
  • DMTCP(不是R包)——似乎支持一组语言中的检查点,最近,一位开发人员在R中请求帮助测试DMTCP检查点。

Other

  • Base R supports: named vectors and lists, row and column names of data frames, and names of items in environments. It seems to me that using a list is a bit of a kludge. (There's also pairlist, but it is deprecated.)
  • 基本R支持:命名的向量和列表,数据帧的行和列名称,以及环境中的项目名称。在我看来,使用列表有点笨拙。(还有一对列表,但不赞成)
  • The data.table package supports rapid lookups of elements in a data table.
  • 数据。表包支持快速查找数据表中的元素。

Use case

Although I'm mostly interested in knowing the options, I have two basic use cases that arise:

虽然我最感兴趣的是知道选项,但我有两个基本的用例出现:

  1. Caching: Simple counting of strings. [Note: This isn't for NLP, but general use, so NLP libraries are overkill; tables are inadequate because I prefer not to wait until the entire set of strings are loaded into memory. Perl-style hashes are at the right level of utility.]
  2. 缓存:简单的字符串计数。[注意:这不是针对NLP的,而是通用的,所以NLP库是多余的;表是不够的,因为我不喜欢等到所有字符串都加载到内存中。perl风格的散列在实用程序的正确级别上。
  3. Memoization of monstrous calculations.
  4. 记忆的巨大的计算。

These really arise because I'm digging in to the profiling of some slooooow code and I'd really like to just count simple strings and see if I can speed up some calculations via memoization. Being able to hash the input values, even if I don't memoize, would let me see if memoization can help.

这些真的出现了,因为我正在深入分析一些slooooow代码,我真的想要数一下简单的字符串,看看我是否可以通过记忆化来加快一些计算。能够哈希输入值,即使我不记忆,也会让我看看记忆化是否有帮助。


Note 1: The CRAN Task View on Reproducible Research lists a couple of the packages (cacher and R.cache), but there is no elaboration on usage options.

注1:可重复性研究的CRAN任务视图列出了几个包(cacher和R.cache),但是没有详细说明使用选项。

Note 2: To aid others looking for related code, here a few notes on some of the authors or packages. Some of the authors use SO. :)

注意2:为了帮助其他人查找相关代码,这里有一些作者或包的注释。有些作者这样使用。:)

  • Dirk Eddelbuettel: digest - a lot of other packages depend on this.
  • Dirk Eddelbuettel: digest——许多其他包都依赖于此。
  • Roger Peng: cacher, filehash, stashR - these address different problems in different ways; see Roger's site for more packages.
  • 罗杰·彭:cacher, filehash, stashR——这些以不同的方式解决不同的问题;有关更多的包,请参见Roger的站点。
  • Christopher Brown: hash - Seems to be a useful package, but the links to ODG are down, unfortunately.
  • Christopher Brown: hash -似乎是一个有用的包,但不幸的是,ODG的链接已经关闭。
  • Henrik Bengtsson: R.cache & Hadley Wickham: memoise -- it's not yet clear when to prefer one package over the other.
  • Henrik Bengtsson:R。cache & Hadley Wickham: memoise——现在还不清楚哪种包装更受欢迎。

Note 3: Some people use memoise/memoisation others use memoize/memoization. Just a note if you're searching around. Henrik uses "z" and Hadley uses "s".

注3:有些人使用模因化/模因化,有些人使用模因化/模因化。如果你在搜索的话,只需要注意一下。亨里克用“z”,哈德利用“s”。

3 个解决方案

#1


9  

For simple counting of strings (and not using table or similar), a multiset data structure seems like a good fit. The environment object can be used to emulate this.

对于简单的字符串计数(不使用表或类似的数据),多集数据结构似乎很适合。可以使用environment对象来模拟这一点。

# Define the insert function for a multiset
msetInsert <- function(mset, s) {
    if (exists(s, mset, inherits=FALSE)) {
        mset[[s]] <- mset[[s]] + 1L
    } else {
        mset[[s]] <- 1L 
    }
}

# First we generate a bunch of strings
n <- 1e5L  # Total number of strings
nus <- 1e3L  # Number of unique strings
ustrs <- paste("Str", seq_len(nus))

set.seed(42)
strs <- sample(ustrs, n, replace=TRUE)


# Now we use an environment as our multiset    
mset <- new.env(TRUE, emptyenv()) # Ensure hashing is enabled

# ...and insert the strings one by one...
for (s in strs) {
    msetInsert(mset, s)
}

# Now we should have nus unique strings in the multiset    
identical(nus, length(mset))

# And the names should be correct
identical(sort(ustrs), sort(names(as.list(mset))))

# ...And an example of getting the count for a specific string
mset[["Str 3"]] # "Str 3" instance count (97)

#2


9  

I did not have luck with memoise because it gave too deep recursive problem to some function of a packaged I tried with. With R.cache I had better luck. Following is more annotated code I adapted from R.cache documentation. The code shows different options to do caching.

我对memoise并不是很满意,因为它给我试用过的软件包的某些功能带来了太多的递归问题。与R。缓存我运气更好。下面是我改编自R的注释代码。缓存文档。代码显示了执行缓存的不同选项。

# Workaround to avoid question when loading R.cache library
dir.create(path="~/.Rcache", showWarnings=F) 
library("R.cache")
setCacheRootPath(path="./.Rcache") # Create .Rcache at current working dir
# In case we need the cache path, but not used in this example.
cache.root = getCacheRootPath() 
simulate <- function(mean, sd) {
    # 1. Try to load cached data, if already generated
    key <- list(mean, sd)
    data <- loadCache(key)
    if (!is.null(data)) {
        cat("Loaded cached data\n")
        return(data);
    }
    # 2. If not available, generate it.
    cat("Generating data from scratch...")
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("ok\n")
    saveCache(data, key=key, comment="simulate()")
    data;
}
data <- simulate(2.3, 3.0)
data <- simulate(2.3, 3.5)
a = 2.3
b = 3.0
data <- simulate(a, b) # Will load cached data, params are checked by value
# Clean up
file.remove(findCache(key=list(2.3,3.0)))
file.remove(findCache(key=list(2.3,3.5)))

simulate2 <- function(mean, sd) {
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("Done generating data from scratch\n")
    data;
}
# Easy step to memoize a function
# aslo possible to resassign function name.
This would work with any functions from external packages. 
mzs <- addMemoization(simulate2)

data <- mzs(2.3, 3.0)
data <- mzs(2.3, 3.5)
data <- mzs(2.3, 3.0) # Will load cached data
# aslo possible to resassign function name.
# but different memoizations of the same 
# function will return the same cache result
# if input params are the same
simulate2 <- addMemoization(simulate2)
data <- simulate2(2.3, 3.0)

# If the expression being evaluated depends on
# "input" objects, then these must be be specified
# explicitly as "key" objects.
for (ii in 1:2) {
    for (kk in 1:3) {
        cat(sprintf("Iteration #%d:\n", kk))
        res <- evalWithMemoization({
            cat("Evaluating expression...")
            a <- kk
            Sys.sleep(1)
            cat("done\n")
            a
        }, key=list(kk=kk))
        # expressions inside 'res' are skipped on the repeated run
        print(res)
        # Sanity checks
        stopifnot(a == kk)
        # Clean up
        rm(a)
    } # for (kk ...)
} # for (ii ...)

#3


1  

Related to @biocyperman solution. R.cache has a wrapping function for avoiding the loading, saving and evaluation of the cache. See the modified function:

@biocyperman相关解决方案。R。缓存有一个包装函数,可以避免加载、保存和评估缓存。看到修改后的功能:

R.cache provide a wrapper for loading, evaluating, saving. You can simplify your code like that:

R。缓存为加载、计算和保存提供了一个包装器。可以这样简化代码:

simulate <- function(mean, sd) {
key <- list(mean, sd)
data <- evalWithMemoization(key = key, expr = {
    cat("Generating data from scratch...")
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("ok\n")
    data})
}

#1


9  

For simple counting of strings (and not using table or similar), a multiset data structure seems like a good fit. The environment object can be used to emulate this.

对于简单的字符串计数(不使用表或类似的数据),多集数据结构似乎很适合。可以使用environment对象来模拟这一点。

# Define the insert function for a multiset
msetInsert <- function(mset, s) {
    if (exists(s, mset, inherits=FALSE)) {
        mset[[s]] <- mset[[s]] + 1L
    } else {
        mset[[s]] <- 1L 
    }
}

# First we generate a bunch of strings
n <- 1e5L  # Total number of strings
nus <- 1e3L  # Number of unique strings
ustrs <- paste("Str", seq_len(nus))

set.seed(42)
strs <- sample(ustrs, n, replace=TRUE)


# Now we use an environment as our multiset    
mset <- new.env(TRUE, emptyenv()) # Ensure hashing is enabled

# ...and insert the strings one by one...
for (s in strs) {
    msetInsert(mset, s)
}

# Now we should have nus unique strings in the multiset    
identical(nus, length(mset))

# And the names should be correct
identical(sort(ustrs), sort(names(as.list(mset))))

# ...And an example of getting the count for a specific string
mset[["Str 3"]] # "Str 3" instance count (97)

#2


9  

I did not have luck with memoise because it gave too deep recursive problem to some function of a packaged I tried with. With R.cache I had better luck. Following is more annotated code I adapted from R.cache documentation. The code shows different options to do caching.

我对memoise并不是很满意,因为它给我试用过的软件包的某些功能带来了太多的递归问题。与R。缓存我运气更好。下面是我改编自R的注释代码。缓存文档。代码显示了执行缓存的不同选项。

# Workaround to avoid question when loading R.cache library
dir.create(path="~/.Rcache", showWarnings=F) 
library("R.cache")
setCacheRootPath(path="./.Rcache") # Create .Rcache at current working dir
# In case we need the cache path, but not used in this example.
cache.root = getCacheRootPath() 
simulate <- function(mean, sd) {
    # 1. Try to load cached data, if already generated
    key <- list(mean, sd)
    data <- loadCache(key)
    if (!is.null(data)) {
        cat("Loaded cached data\n")
        return(data);
    }
    # 2. If not available, generate it.
    cat("Generating data from scratch...")
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("ok\n")
    saveCache(data, key=key, comment="simulate()")
    data;
}
data <- simulate(2.3, 3.0)
data <- simulate(2.3, 3.5)
a = 2.3
b = 3.0
data <- simulate(a, b) # Will load cached data, params are checked by value
# Clean up
file.remove(findCache(key=list(2.3,3.0)))
file.remove(findCache(key=list(2.3,3.5)))

simulate2 <- function(mean, sd) {
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("Done generating data from scratch\n")
    data;
}
# Easy step to memoize a function
# aslo possible to resassign function name.
This would work with any functions from external packages. 
mzs <- addMemoization(simulate2)

data <- mzs(2.3, 3.0)
data <- mzs(2.3, 3.5)
data <- mzs(2.3, 3.0) # Will load cached data
# aslo possible to resassign function name.
# but different memoizations of the same 
# function will return the same cache result
# if input params are the same
simulate2 <- addMemoization(simulate2)
data <- simulate2(2.3, 3.0)

# If the expression being evaluated depends on
# "input" objects, then these must be be specified
# explicitly as "key" objects.
for (ii in 1:2) {
    for (kk in 1:3) {
        cat(sprintf("Iteration #%d:\n", kk))
        res <- evalWithMemoization({
            cat("Evaluating expression...")
            a <- kk
            Sys.sleep(1)
            cat("done\n")
            a
        }, key=list(kk=kk))
        # expressions inside 'res' are skipped on the repeated run
        print(res)
        # Sanity checks
        stopifnot(a == kk)
        # Clean up
        rm(a)
    } # for (kk ...)
} # for (ii ...)

#3


1  

Related to @biocyperman solution. R.cache has a wrapping function for avoiding the loading, saving and evaluation of the cache. See the modified function:

@biocyperman相关解决方案。R。缓存有一个包装函数,可以避免加载、保存和评估缓存。看到修改后的功能:

R.cache provide a wrapper for loading, evaluating, saving. You can simplify your code like that:

R。缓存为加载、计算和保存提供了一个包装器。可以这样简化代码:

simulate <- function(mean, sd) {
key <- list(mean, sd)
data <- evalWithMemoization(key = key, expr = {
    cat("Generating data from scratch...")
    data <- rnorm(1000, mean=mean, sd=sd)
    Sys.sleep(1) # Emulate slow algorithm
    cat("ok\n")
    data})
}

相关文章