在R中,从截断正态分布中生成数据。

时间:2023-02-09 19:35:52

I struggle with the following task: I need to generate data from a truncated normal distribution. The sample mean and standard deviation should match exactly those specified in the population. This is what I have so far:

我与以下任务进行了斗争:我需要从截断的正态分布中生成数据。样本均值和标准差应该与总体中指定的完全一致。这是我目前所拥有的:

    mean <- 100
    sd <- 5
    lower <- 40
    upper <- 120
    n <- 100   

    library(msm)    
    data <- as.numeric(mean+sd*scale(rtnorm(n, lower=40, upper=120)))

The sample that's created takes on exactly the mean and sd specified in the population. But some values exceed the intended bounds. Any idea how to fix this? I was thinking of just cutting off all values outside these bounds, but then mean and sd don't resemble those of the population anymore.

所创建的样本恰好具有总体中指定的均值和sd。但是有些值超出了预期的范围。你知道怎么解决这个问题吗?我想把这些界限之外的所有值都去掉,然后均值和sd就不再像这些了。

1 个解决方案

#1


2  

You could use an iterative answer. Here I add samples one by one to the vector, but only if the resulting scaled dataset remains within the boundaries that you set. It takes longer, but it works:

你可以用一个迭代的答案。在这里,我逐一向向量添加样本,但前提是得到的缩放数据集仍然在您设置的边界内。

n <- 10000
mean <- 100
sd <- 15
lower <- 40
upper <- 120

data <- rtnorm(1, lower=((lower - mean)/sd), upper=((upper - mean)/sd))
while (length(data) < n) {
  sample <- rtnorm(1, lower=((lower - mean)/sd), upper=((upper - mean)/sd))
  data_copy = c(data, sample)
  data_copy_scaled = mean + sd * scale(data_copy)
  if (min(data_copy_scaled) >= lower & max(data_copy_scaled) <= upper) {
    data = c(data, sample)
  }
}

scaled_data = as.numeric(mean + sd * scale(data))

summary(scaled_data)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  40.38   91.61  104.35  100.00  111.28  120.00

sd(scaled_data)

15

Below my old answer, which doesn't quite work

在我的老答案下面,这并不怎么管用

How about scaling the lower and upper limits of rtnorm with the mean and sd that you want?

用你想要的平均值和sd扩展rtnorm的下限和上限怎么样?

n <- 1000000
mean <- 100
sd <- 5

library(msm)

data <- as.numeric(mean+sd*scale(rtnorm(n, lower=((40 - mean)/sd), upper=((120 - mean)/sd))))

summary(data)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  76.91   96.63  100.00  100.00  103.37  120.00 

sd(data)

5

In this case, even with a sample of 1000000 you get the exact mean and sd, and the max and min values remain within your boundaries.

在这种情况下,即使样本是1000000,你也会得到准确的平均值和sd,最大值和最小值仍然在你的范围内。

#1


2  

You could use an iterative answer. Here I add samples one by one to the vector, but only if the resulting scaled dataset remains within the boundaries that you set. It takes longer, but it works:

你可以用一个迭代的答案。在这里,我逐一向向量添加样本,但前提是得到的缩放数据集仍然在您设置的边界内。

n <- 10000
mean <- 100
sd <- 15
lower <- 40
upper <- 120

data <- rtnorm(1, lower=((lower - mean)/sd), upper=((upper - mean)/sd))
while (length(data) < n) {
  sample <- rtnorm(1, lower=((lower - mean)/sd), upper=((upper - mean)/sd))
  data_copy = c(data, sample)
  data_copy_scaled = mean + sd * scale(data_copy)
  if (min(data_copy_scaled) >= lower & max(data_copy_scaled) <= upper) {
    data = c(data, sample)
  }
}

scaled_data = as.numeric(mean + sd * scale(data))

summary(scaled_data)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  40.38   91.61  104.35  100.00  111.28  120.00

sd(scaled_data)

15

Below my old answer, which doesn't quite work

在我的老答案下面,这并不怎么管用

How about scaling the lower and upper limits of rtnorm with the mean and sd that you want?

用你想要的平均值和sd扩展rtnorm的下限和上限怎么样?

n <- 1000000
mean <- 100
sd <- 5

library(msm)

data <- as.numeric(mean+sd*scale(rtnorm(n, lower=((40 - mean)/sd), upper=((120 - mean)/sd))))

summary(data)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  76.91   96.63  100.00  100.00  103.37  120.00 

sd(data)

5

In this case, even with a sample of 1000000 you get the exact mean and sd, and the max and min values remain within your boundaries.

在这种情况下,即使样本是1000000,你也会得到准确的平均值和sd,最大值和最小值仍然在你的范围内。