Normally, a random number generator returns a stream of bits for which the probability to observe a 0 or a 1 in each position is equal (i.e. 50%). Let's call this an unbiased PRNG.

通常情况下，随机数生成器返回一个比特流，在每个位置上观察到0或1的概率是相等的(即50%)。我们叫它无偏PRNG吧。

I need to generate a string of pseudo-random bits with the following property: the probability to see a 1 in each position is p (i.e. the probability to see a 0 is 1-p). The parameter p is a real number between 0 and 1; in my problem it happens that it has a resolution of 0.5%, i.e. it can take the values 0%, 0.5%, 1%, 1.5%, ..., 99.5%, 100%.

我需要生成一个伪随机比特串，其性质如下:在每个位置看到1的概率是p(即看到0的概率是1-p)。参数p是0到1之间的实数;在我的问题中，它的分辨率是0.5%，也就是说它可以取0、0.5%、1%、1.5%……、99.5%、100%。

Note that p is a probability and not an exact fraction. The actual number of bits set to 1 in a stream of n bits must follow the binomial distribution B(n, p).

注意p是一个概率而不是一个精确的分数。n位流中设置为1的比特数必须遵循二项分布B(n, p)。

There is a naive method that can use an unbiased PRNG to generate the value of each bit (pseudocode):

有一种朴素的方法可以使用无偏的PRNG来生成每个比特的值(伪代码):

generate_biased_stream(n, p):
  result = []
  for i in 1 to n:
    if random_uniform(0, 1) < p:
      result.append(1)
    else:
      result.append(0)
  return result

Such an implementation is much slower than one generating an unbiased stream, since it calls the random number generator function once per each bit; while an unbiased stream generator calls it once per word size (e.g. it can generate 32 or 64 random bits with a single call).

这样的实现比生成无偏流的实现要慢得多，因为它每比特调用一次随机数生成器函数;而无偏流生成器每字大小只调用一次(例如，一次调用就可以生成32或64个随机位)。

I want a faster implementation, even it it sacrifices randomness slightly. An idea that comes to mind is to precompute a lookup table: for each of the 200 possible values of p, compute C 8-bit values using the slower algorithm and save them in a table. Then the fast algorithm would just pick one of these at random to generate 8 skewed bits.

我想要一个更快的实现，即使它稍微牺牲了随机性。我想到的一个想法是预先计算查找表:对于p的200个可能值，使用较慢的算法计算C 8位值，并将它们保存在表中。然后快速算法会随机选择其中的一个产生8个倾斜位。

A back of the envelope calculation to see how much memory is needed: C should be at least 256 (the number of possible 8-bit values), probably more to avoid sampling effects; let's say 1024. Maybe the number should vary depending on p, but let's keep it simple and say the average is 1024. Since there are 200 values of p => total memory usage is 200 KB. This is not bad, and might fit in the L2 cache (256 KB). I still need to evaluate it to see if there are sampling effects that introduce biases, in which case C will have to be increased.

信封背面的计算显示需要多少内存:C至少应该是256(可能的8位值的个数)，可能更多是为了避免采样效果;假设1024年。也许这个数会随着p的变化而变化，但是让我们简单地假设平均值是1024。因为有200个p =>值，所以总内存使用量是200kb。这并不坏，并且可能适合L2缓存(256kb)。我仍然需要对它进行评估，看看是否存在引入偏差的抽样效应，在这种情况下，C必须增加。

A deficiency of this solution is that it can generate only 8 bits at once, even that with a lot of work, while an unbiased PRNG can generate 64 at once with just a few arithmetic instructions.

这个解决方案的不足之处在于，它一次只能生成8位，即使需要做大量的工作，而一个无偏见的PRNG只需要几个算术指令就可以同时生成64位。

I would like to know if there is a faster method, based on bit operations instead of lookup tables. For example modifying the random number generation code directly to introduce a bias for each bit. This would achieve the same performance as an unbiased PRNG.

我想知道是否有一种更快的方法，基于位操作而不是查找表。例如，直接修改随机数生成代码以引入每个比特的偏差。这将实现与不带偏见的PRNG相同的性能。

Edit March 5

Thank you all for your suggestions, I got a lot of interesting ideas and suggestions. Here are the top ones:

谢谢大家的建议，我得到了很多有趣的想法和建议。以下是最重要的几个:

Change the problem requirements so that p has a resolution of 1/256 instead of 1/200. This allows using bits more efficiently, and also gives more opportunities for optimization. I think I can make this change.
改变问题要求，使p的分辨率为1/256，而不是1/200。这允许更有效地使用位元，并提供更多的优化机会。我想我能做出改变。
Use arithmetic coding to efficiently consume bits from an unbiased generator. With the above change of resolution this becomes much easier.
使用算术编码有效地消耗来自无偏发生器的比特。随着上述决议的改变，这变得容易得多。
A few people suggested that PRNGs are very fast, thus using arithmetic coding might actually make the code slower due to the introduced overhead. Instead I should always consume the worst-case number of bits and optimize that code. See the benchmarks below.
一些人认为PRNGs非常快，因此由于引入的开销，使用算术编码实际上可能会使代码变慢。相反，我应该始终使用最坏的比特数并优化代码。参见下面的基准。
@rici suggested using SIMD. This is a nice idea, which works only if we always consume a fixed number of bits.
利用SIMD @rici建议。这是一个很好的想法，只有当我们总是消耗固定数量的比特时才有效。

Benchmarks (without arithmetic decoding)

Note: as many of you have suggested, I changed the resolution from 1/200 to 1/256.

注意:正如你们许多人所建议的，我将分辨率从1/200改成1/256。

I wrote several implementations of the naive method that simply takes 8 random unbiased bits and generates 1 biased bit:

我编写了几个朴素方法的实现，它只取8个随机无偏位，生成1个有偏位:

Without SIMD
没有SIMD
With SIMD using the Agner Fog's vectorclass library, as suggested by @rici
使用SIMD使用Agner Fog的vectorclass库，如@rici所建议
With SIMD using intrinsics
SIMD使用intrinsic

I use two unbiased pseudo-random number generators:

我使用两个无偏伪随机数生成器:

xorshift128plus
xorshift128plus
Ranvec1 (Mersenne Twister-like) from Agner Fog's library.
Ranvec1 (Mersenne twisterlike)来自Agner Fog的图书馆。

I also measure the speed of the unbiased PRNG for comparison. Here are the results:

我也测量了无偏PRNG的速度以作比较。这里是结果:


RNG: Ranvec1(Mersenne Twister for Graphics Processors + Multiply with Carry)

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 16.081 16.125 16.093 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 0.778 0.783 0.812 [Gb/s]
Number of ones: 104,867,269 104,867,269 104,867,269
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 2.176 2.184 2.145 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 2.129 2.151 2.183 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

SIMD increases performance by a factor of 3 compared to the scalar method. It is 8 times slower than the unbiased generator, as expected.

与标量方法相比，SIMD将性能提高了3倍。它比预期的无偏差发电机慢8倍。

The fastest biased generator achieves 2.1 Gb/s.

最快的偏置发生器达到2.1 Gb/s。


RNG: xorshift128plus

Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 18.300 21.486 21.483 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical   : 104,857,600

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 22.660 22.661 24.662 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 1.065 1.102 1.078 [Gb/s]
Number of ones: 104,868,930 104,868,930 104,868,930
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 4.972 4.971 4.970 [Gb/s]
Number of ones: 104,869,407 104,869,407 104,869,407
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 4.955 4.971 4.971 [Gb/s]
Number of ones: 104,869,407 104,869,407 104,869,407
Theoretical   : 104,857,600

For xorshift, SIMD increases performance by a factor of 5 compared to the scalar method. It is 4 times slower than the unbiased generator. Note that this is a scalar implementation of xorshift.

对于xorshift，与标量方法相比，SIMD将性能提高了5倍。它比无偏发生器慢4倍。注意，这是xorshift的标量实现。

The fastest biased generator achieves 4.9 Gb/s.

最快的偏置发电机达到4.9 Gb/s。


RNG: xorshift128plus_avx2

Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 18.754 21.494 21.878 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical   : 104,857,600

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 54.126 54.071 54.145 [Gb/s]
Number of ones: 536,874,540 536,880,718 536,891,316
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 1.093 1.103 1.063 [Gb/s]
Number of ones: 104,868,930 104,868,930 104,868,930
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 19.567 19.578 19.555 [Gb/s]
Number of ones: 104,836,115 104,846,215 104,835,129
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 19.551 19.589 19.557 [Gb/s]
Number of ones: 104,831,396 104,837,429 104,851,100
Theoretical   : 104,857,600

This implementation uses AVX2 to run 4 unbiased xorshift generators in parallel.

这个实现使用AVX2并行地运行4个无偏xorshift生成器。

The fastest biased generator achieves 19.5 Gb/s.

最快的偏置发生器达到19.5 Gb/s。

Benchmarks for arithmetic decoding

Simple tests show that the arithmetic decoding code is the bottleneck, not the PRNG. So I am only benchmarking the most expensive PRNG.

简单的测试表明，算术译码是瓶颈，而不是PRNG。所以我只对最贵的PRNG做基准测试。


RNG: Ranvec1(Mersenne Twister for Graphics Processors + Multiply with Carry)

Method: Arithmetic decoding (floating point)
Gbps/s: 0.068 0.068 0.069 [Gb/s]
Number of ones: 10,235,580 10,235,580 10,235,580
Theoretical   : 10,240,000

Method: Arithmetic decoding (fixed point)
Gbps/s: 0.263 0.263 0.263 [Gb/s]
Number of ones: 10,239,367 10,239,367 10,239,367
Theoretical   : 10,240,000

Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 12.687 12.686 12.684 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical   : 104,857,600

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 14.536 14.536 14.536 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 0.754 0.754 0.754 [Gb/s]
Number of ones: 104,867,269 104,867,269 104,867,269
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 2.094 2.095 2.094 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 2.094 2.094 2.095 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

The simple fixed point method achieves 0.25 Gb/s, while the naive scalar method is 3x faster, and the naive SIMD method is 8x faster. There might be ways to optimize and/or parallelize the arithmetic decoding method further, but due to its complexity I have decided to stop here and choose the naive SIMD implementation.

简单的定点法达到0.25 Gb/s，单纯标量法快3倍，单纯SIMD法快8倍。可能有一些方法可以进一步优化和/或并行化算术解码方法，但是由于它的复杂性，我决定在这里停下来，选择简单的SIMD实现。

Thank you all for the help.

谢谢大家的帮助。

10 个解决方案

#1

If you're prepared to approximate p based on 256 possible values, and you have a PRNG which can generate uniform values in which the individual bits are independent of each other, then you can use vectorized comparison to produce multiple biased bits from a single random number.

如果你准备基于256个可能的值来近似p，并且你有一个PRNG，它可以生成统一的值，在这个值中，每个位是相互独立的，那么你可以使用矢量化比较从一个随机数生成多个有偏位。

That's only worth doing if (1) you worry about random number quality and (2) you are likely to need a large number of bits with the same bias. The second requirement seems to be implied by the original question, which criticizes a proposed solution, as follows: "A deficiency of this solution is that it can generate only 8 bits at once, even that with a lot of work, while an unbiased PRNG can generate 64 at once with just a few arithmetic instructions." Here, the implication seems to be that it is useful to generate a large block of biased bits in a single call.

只有当(1)你担心随机数质量(2)你可能需要大量具有相同偏差的比特时，这才值得去做。第二个需求似乎暗示了原始的问题,批评建议的解决方案,如下:“这个解决方案的不足是它只能产生8位,甚至,在很多工作,而一个无偏PRNG可以产生64几算术指令。”这里的含义似乎是，在一次调用中生成一大块有偏差的比特是有用的。

Random-number quality is a difficult subject. It's hard if not impossible to measure, and therefore different people will propose different metrics which emphasize and/or devalue different aspects of "randomness". It is generally possible to trade off speed of random-number generation for lower "quality"; whether this is worth doing depends on your precise application.

随机数质量是一个难题。很难衡量，因此不同的人会提出不同的衡量标准，强调和/或贬低“随机性”的不同方面。通常可以用随机数生成的速度换取较低的“质量”;是否值得这样做取决于您的精确应用。

The simplest possible tests of random number quality involve the distribution of individual values and the cycle length of the generator. Standard implementations of the C library rand and Posix random functions will typically pass the distribution test, but the cycle lengths are not adequate for long-running applications.

对随机数质量的最简单的测试涉及单个值的分布和生成器的周期长度。C库rand和Posix随机函数的标准实现通常会通过分发测试，但是对于长时间运行的应用程序来说，循环长度不够。

These generators are typically extremely fast, though: the glibc implementation of random requires only a few cycles, while the classic linear congruential generator (LCG) requires a multiply and an addition. (Or, in the case of the glibc implementation, three of the above to generate 31 bits.) If that's sufficient for your quality requirements, then there is little point trying to optimize, particularly if the bias probability changes frequently.

这些生成器通常非常快:随机的glibc实现只需要几个周期，而经典的线性一致生成器(LCG)需要一个乘法和一个加法。(或者，在glibc实现的情况下，上面的三个可以生成31位。)如果这足够满足您的质量需求，那么尝试优化是没有意义的，特别是当偏差概率频繁变化时。

Bear in mind that the cycle length should be a lot longer than the number of samples expected; ideally, it should be greater than the square of that number, so a linear-congruential generator (LCG) with a cycle length of 2³¹ is not appropriate if you expect to generate gigabytes of random data. Even the Gnu trinomial nonlinear additive-feedback generator, whose cycle length is claimed to be approximately 2³⁵, shouldn't be used in applications which will require millions of samples.

记住，循环长度应该比预期的样本数量长得多;理想情况下，它应该大于这个数字的平方，因此，如果你期望生成千兆字节的随机数据，那么一个周期长度为231的线性同余发生器(LCG)是不合适的。即使是号称周期长度约为235的Gnu三叉非线性附加反馈生成器，也不应该在需要数百万个样本的应用程序中使用。

Another quality issue, which is much harder to test, relates to the independence on consecutive samples. Short cycle lengths completely fail on this metric, because once the repeat starts, the generated random numbers are precisely correlated with historical values. The Gnu trinomial algorithm, although its cycle is longer, has a clear correlation as a result of the fact that the i^th random number generated, r_i, is always one of the two values r_i−3+r_i−31 or r_i−3+r_i−31+1. This can have surprising or at least puzzling consequences, particularly with Bernoulli experiments.

另一个更难测试的质量问题与连续样本的独立性有关。短周期长度在这个度量上完全失败，因为一旦重复开始，生成的随机数就与历史值精确相关。Gnu三叉树算法,尽管它的周期较长,有一个明确的相关结果的第i个随机数生成,ri,总是两个值之一ri−3 + ri−31或扶轮扶轮−−3 + 31 + 1。这可能会让人感到惊讶，或者至少是令人困惑的结果，尤其是伯努利实验。

Here's an implementation using Agner Fog's useful vector class library, which abstracts away a lot of the annoying details in SSE intrinsics, and also helpfully comes with a fast vectorized random number generator (found in special.zip inside the vectorclass.zip archive), which lets us generate 256 bits from eight calls to the 256-bit PRNG. You can read Dr. Fog's explanation of why he finds even the Mersenne twister to have quality issues, and his proposed solution; I'm not qualified to comment, really, but it does at least appear to give expected results in the Bernoulli experiments I have tried with it.

这里有一个使用Agner Fog的有用的向量类库的实现，该库抽象出了SSE intrinsic中很多烦人的细节，并且还附带了一个快速矢量随机数生成器(在special中可以找到)。邮政vectorclass内。它允许我们从对256位PRNG的8次调用中生成256位。你可以读到福格博士的解释，为什么他发现即使是梅尔森·特威斯特也有质量问题，以及他提出的解决方案;我没有资格评论，真的，但它至少在伯努利实验中给出了预期的结果，我已经尝试过了。

#include "vectorclass/vectorclass.h"
#include "vectorclass/ranvec1.h"

class BiasedBits {
  public:
    // Default constructor, seeded with fixed values
    BiasedBits() : BiasedBits(1)  {}
    // Seed with a single seed; other possibilities exist.
    BiasedBits(int seed) : rng(3) { rng.init(seed); }

    // Generate 256 random bits, each with probability `p/256` of being 1.
    Vec8ui random256(unsigned p) {
      if (p >= 256) return Vec8ui{ 0xFFFFFFFF };
      Vec32c output{ 0 };
      Vec32c threshold{ 127 - p };
      for (int i = 0; i < 8; ++i) {
        output += output;
        output -= Vec32c(Vec32c(rng.uniform256()) > threshold);
      }
      return Vec8ui(output);
    }

  private:
    Ranvec1 rng;
};

In my test, that produced and counted 268435456 bits in 260 ms, or one bit per nanosecond. The test machine is an i5, so it doesn't have AVX2; YMMV.

在我的测试中，它在260毫秒内产生并计数了268435456位，或者说每纳秒1位。测试机器是i5，所以没有AVX2;YMMV。

In the actual use case, with 201 possible values for p, the computation of 8-bit threshold values will be annoyingly imprecise. If that imprecision is undesired, you could adapt the above to use 16-bit thresholds, at the cost of generating twice as many random numbers.

在实际的用例中，如果p有201个可能的值，那么8位阈值的计算将非常不精确。如果不希望出现这种不精确，可以使用上面的16位阈值，代价是生成两倍的随机数。

Alternatively, you could hand-roll a vectorization based on 10-bit thresholds, which would give you a very good approximation to 0.5% increments, using the standard bit-manipulation hack of doing the vectorized threshold comparison by checking for borrow on every 10th bit of the subtraction of the vector of values and the repeated threshold. Combined with, say, std::mt19937_64, that would give you an average of six bits each 64-bit random number.

或者,您可以使用一个基于10位向量化阈值,这将给你一个很好的近似增量的0.5%,使用标准的位操作黑客做矢量化的阈值比较,检查在每10日借的减法和重复阈值向量的值。结合std::mt19937_64，每个64位随机数平均有6位。

#2

One thing you can do is to sample from the underlying unbiased generator multiple times, getting several 32-bit or 64-bit words, and then performing bitwise boolean arithmetic. As an example, for 4 words b1,b2,b3,b4, you can get the following distributions:

您可以做的一件事是从底层的无偏生成器中抽取多次样本，获取几个32位或64位的单词，然后执行位布尔运算。例如，对于4个单词b1,b2,b3,b4，你可以得到以下分布:

    expression             | p(bit is 1)
    -----------------------+-------------
    b1 & b2 & b3 & b4      |  6.25%
    b1 & b2 & b3           | 12.50%
    b1 & b2 & (b3 | b4)    | 18.75%
    b1 & b2                | 25.00%
    b1 | (b2 & (b3 | b4))  | 31.25%
    b1 & (b2 | b3)         | 37.50%
    b1 & (b2 | b3 | b4))   | 43.75%
    b1                     | 50.00%

Similar constructions can be made for finer resolutions. It gets a bit tedious and still requires more generator calls, but at least not one per bit. This is similar to a3f's answer, but is probably easier to implement and, I suspect, faster than scanning words for 0xF nybbles.

类似的结构可以用于更精细的分辨率。它变得有点乏味，并且仍然需要更多的生成器调用，但至少不是每一点。这类似于a3f的答案，但可能比扫描0xF nybbles更快、更容易实现。

Note that for your desired 0.5% resolution, you would need 8 unbiased words for one biased word, which would give you a resolution of (0.5^8) = 0.390625%.

注意,0.5%所需的分辨率,需要8无偏词一个偏见,这将给你一个解决(0.5 ^ 8)= 0.390625%。

#3

From an information-theoretic point of view, a biased stream of bits (with p != 0.5) has less information in it than an unbiased stream, so in theory it should take (on average) less than 1 bit of the unbiased input to produce a single bit of the biased output stream. For example, the entropy of a Bernoulli random variable with p = 0.1 is -0.1 * log2(0.1) - 0.9 * log2(0.9) bits, which is around 0.469 bits. That suggests that for the case p = 0.1 we should be able to produce a little over two bits of the output stream per unbiased input bit.

从信息理论的观点来看，一个有偏置的比特流(p != 0.5)比一个无偏置的流含有更少的信息，所以在理论上，它应该(平均)使用小于1位的无偏置输入来产生一个有偏置的输出流。例如，p = 0.1的伯努利随机变量的熵为-0.1 * log2(0.1) - 0.9 * log2(0.9)位，约为0.469位。这表明，对于p = 0.1的情况，我们应该能够在每个不带偏见的输入位上产生一点超过两比特的输出流。

Below, I give two methods for producing the biased bits. Both achieve close to optimal efficiency, in the sense of requiring as few input unbiased bits as possible.

下面，我将给出两种产生偏置位的方法。两者都能达到最优效率，即要求输入尽可能少的无偏位。

Method 1: arithmetic (de)coding

A practical method is to decode your unbiased input stream using arithmetic (de)coding, as already described in the answer from alexis. For this simple a case, it's not hard to code something up. Here's some unoptimised pseudocode (cough, Python) that does this:

一个实用的方法是使用算术(去)编码解码您的无偏输入流，正如亚历克西斯的答案中已经描述的那样。对于这个简单的例子来说，编写代码并不难。下面是一些未优化的伪代码(咳，Python)，它们可以做到这一点:

import random

def random_bits():
    """
    Infinite generator generating a stream of random bits,
    with 0 and 1 having equal probability.
    """
    global bit_count  # keep track of how many bits were produced
    while True:
        bit_count += 1
        yield random.choice([0, 1])

def bernoulli(p):
    """
    Infinite generator generating 1-bits with probability p
    and 0-bits with probability 1 - p.
    """
    bits = random_bits()

    low, high = 0.0, 1.0
    while True:
        if high <= p:
            # Generate 1, rescale to map [0, p) to [0, 1)
            yield 1
            low, high = low / p, high / p
        elif low >= p:
            # Generate 0, rescale to map [p, 1) to [0, 1)
            yield 0
            low, high = (low - p) / (1 - p), (high - p) / (1 - p)
        else:
            # Use the next random bit to halve the current interval.
            mid = 0.5 * (low + high)
            if next(bits):
                low = mid
            else:
                high = mid

Here's an example usage:

这里有一个例子使用方法:

import itertools
bit_count = 0

# Generate a million deviates.
results = list(itertools.islice(bernoulli(0.1), 10**6))

print("First 50:", ''.join(map(str, results[:50])))
print("Biased bits generated:", len(results))
print("Unbiased bits used:", bit_count)
print("mean:", sum(results) / len(results))

The above gives the following sample output:

上面给出了以下示例输出:

First 50: 00000000000001000000000110010000001000000100010000
Biased bits generated: 1000000
Unbiased bits used: 469036
mean: 0.100012

As promised, we've generated 1 million bits of our output biased stream using fewer than five hundred thousand from the source unbiased stream.

如前所述，我们已经使用不到50万的源无偏流产生了100万比特的输出流。

For optimisation purposes, when translating this into C / C++ it may make sense to code this up using integer-based fixed-point arithmetic rather than floating-point.

出于优化目的，当将其转换为C / c++时，使用基于整数的定点算法而不是浮点数来编写代码是有意义的。

Method 2: integer-based algorithm

Rather than trying to convert the arithmetic decoding method to use integers directly, here's a simpler approach. It's not quite arithmetic decoding any more, but it's not totally unrelated, and it achieves close to the same output-biased-bit / input-unbiased-bit ratio as the floating-point version above. It's organised so that all quantities fit into an unsigned 32-bit integer, so should be easy to translate to C / C++. The code is specialised to the case where p is an exact multiple of 1/200, but this approach would work for any p that can be expressed as a rational number with reasonably small denominator.

与其尝试将算术解码方法转换为直接使用整数，这里有一个更简单的方法。它不再是算术解码，但也不是完全不相关的，它与上面的浮点版本的输出-偏比特/输入-非偏比特比接近。它是有组织的，以便所有的量都适合一个无符号的32位整数，因此应该很容易转换为C / c++。这段代码专门适用于p是1/200的精确倍数的情况，但是这种方法适用于任何能以合理的最小分母表示为有理数的p。

def bernoulli_int(p):
    """
    Infinite generator generating 1-bits with probability p
    and 0-bits with probability 1 - p.

    p should be an integer multiple of 1/200.
    """
    bits = random_bits()
    # Assuming that p has a resolution of 0.05, find p / 0.05.
    p_int = int(round(200*p))

    value, high = 0, 1
    while True:
        if high < 2**31:
            high = 2 * high
            value = 2 * value + next(bits)
        else:
            # Throw out everything beyond the last multiple of 200, to
            # avoid introducing a bias.
            discard = high - high % 200
            split = high // 200 * p_int
            if value >= discard:  # rarer than 1 time in 10 million
                value -= discard
                high -= discard
            elif value >= split:
                yield 0
                value -= split
                high = discard - split
            else:
                yield 1
                high = split

The key observation is that every time we reach the beginning of the while loop, value is uniformly distributed amongst all integers in [0, high), and is independent of all previously-output bits. If you care about speed more than perfect correctness, you can get rid of discard and the value >= discard branch: that's just there to ensure that we output 0 and 1 with exactly the right probabilities. Leave that complication out, and you'll just get almost the right probabilities instead. Also, if you make the resolution for p equal to 1/256 rather than 1/200, then the potentially time-consuming division and modulo operations can be replaced with bit operations.

关键的观察是每次我们到达while循环的开始时，值在[0,high)中的所有整数中均匀分布，并且独立于所有先前输出的位。如果你关心的是速度，而不是完美的正确性，你可以去掉丢弃和值>=丢弃的分支:这只是为了确保我们输出0和1的概率是正确的。把这个复杂的问题抛在脑后，你会得到几乎正确的概率。同样，如果你让p的分辨率为1/256而不是1/200，那么可能耗时的除法和模块操作就可以用位操作代替。

With the same test code as before, but using bernoulli_int in place of bernoulli, I get the following results for p=0.1:

使用与之前相同的测试代码，但是使用bernoulli_int代替bernoulli，我得到了p=0.1的如下结果:

First 50: 00000010000000000100000000000000000000000110000100
Biased bits generated: 1000000
Unbiased bits used: 467997
mean: 0.099675

#4

Let's say the probability of a 1 appearing is 6,25% (1/16). There are 16 possible bit patterns for a 4 bit-number: 0000,0001, ..., 1110,1111.

假设1出现的概率是6 25% (1/16)4位数字有16种可能的位模式:0000,0001，…,1110,1111。

Now, just generate a random number like you used to and replace every 1111 at a nibble-boundary with a 1 and turn everything else to a 0.

现在，像以前一样生成一个随机数，把每个1111在一个nibbll边界上替换为1，把其他的都变成0。

Adjust accordingly for other probabilities.

根据其他可能性进行相应调整。

#5

You'll get theoretically optimal behavior, i.e. make truly minimal use of the random number generator and be able to model any probability p exactly, if you approach this using arithmetic coding.

你会得到理论上的最优行为，也就是说，尽可能少地使用随机数生成器，如果你用算术编码的方法，你就能精确地建模任意概率p。

Arithmetic coding is a form of data compression that represents the message as a sub-interval of a number range. It provides theoretically optimal encoding, and can use a fractional number of bits for each input symbol.

算术编码是一种数据压缩形式，它将消息表示为数字范围的子区间。它提供了理论上的最优编码，并且可以为每个输入符号使用分数位数。

The idea is this: Imagine that you have a sequence of random bits, which are 1 with probability p. For convenience, I will instead use q for the probability of the bit being zero. (q = 1-p). Arithmetic coding assigns to each bit part of the number range. For the first bit, assign the interval [0, q) if the input is 0, and the interval [q, 1) if the input is 1. Subsequent bits assign proportional sub-intervals of the current range. For example, suppose that q = 1/3 The input 1 0 0 will be encoded like this:

这个想法是这样的:假设你有一个随机位序列，它是1，概率p，为了方便，我用q来表示比特为0的概率。(q = 1 - p)。算术编码分配给每个位的数字范围。对于第一个位，如果输入为0，则分配区间[0,q]，如果输入为1，则分配区间[q, 1]。后续的位分配当前范围的比例子区间。例如，假设q = 1/3时输入的100将被这样编码:

Initially       [0, 1),             range = 1
After 1         [0.333, 1),         range = 0.6666        
After 0         [0.333, 0.5555),    range = 0.2222   
After 0         [0.333, 0.407407),  range = 0.074074

The first digit, 1, selects the top two-thirds (1-q) of the range; the second digit, 0, selects the bottom third of that, and so on. After the first and second step, the interval stradles the midpoint; but after the third step it is entirely below the midpoint, so the first compressed digit can be output: 0. The process continues, and a special EOF symbol is added as a terminator.

第一个数字1，选择范围的前三分之二(1-q);第二个数字，0，选择了底部的三分之一，以此类推。第一步和第二步后，间隔跨越中点;但在第三步之后，它完全低于中点，所以第一个压缩数字可以输出:0。进程继续，并添加一个特殊的EOF符号作为终止符。

What does this have to do with your problem? The compressed output will have random zeros and ones with equal probability. So, to obtain bits with probability p, just pretend that the output of your RNG is the result of arithmetic coding as above, and apply the decoder process to it. That is, read bits as if they subdivide the line interval into smaller and smaller pieces. For example, after we read 01 from the RNG, we will be in the range [0.25, 0.5). Keep reading bits until enough output is "decoded". Since you're mimicking decompressing, you'll get more random bits out than you put in. Because arithmetic coding is theoretically optimal, there's no possible way to turn the RNG output into more biased bits without sacrificing randomness: you're getting the true maximum.

这和你的问题有什么关系?压缩后的输出将有随机的0和等概率的1。因此，要获得具有概率p的比特，只需假设RNG的输出是上述算术编码的结果，并将解码器进程应用到它。也就是说，读位就好像他们把线间隔细分成越来越小的块。例如，当我们从RNG中读取01后，我们将在[0.25,0.5]范围内。一直读位直到有足够的输出被“解码”。因为你在模仿减压，你会得到更多的随机比特。因为算术编码在理论上是最优的，所以不可能在不牺牲随机性的情况下将RNG输出转换成更有偏置的比特:您将得到真正的最大值。

The catch is that you can't do this in a couple of lines of code, and I don't know of a library I can point you to (though there must be some you could use). Still, it's pretty simple. The above article provides code for a general-purpose encoder and decoder, in C. It's pretty straightforward, and it supports multiple input symbols with arbitrary probabilities; in your case a far simpler implementation is possible (as Mark Dickinson's answer now shows), since the probability model is trivial. For extended use, a bit more work would be needed to produce a robust implementation that does not do a lot of floating-point computation for each bit.

问题是，您不能在几行代码中这样做，而且我也不知道我可以向您指出的库(尽管一定有一些您可以使用的)。不过,很简单。上面的文章为c语言的通用编码器和解码器提供了代码，它非常简单，并且支持具有任意概率的多个输入符号;在您的例子中，更简单的实现是可能的(正如Mark Dickinson现在的答案所示)，因为概率模型是微不足道的。对于扩展使用，需要做更多的工作来生成一个健壮的实现，它不需要为每个比特做大量的浮点计算。

Wikipedia also has an interesting discussion of arithmetic encoding considered as change of radix, which is another way to view your task.

*还对算术编码进行了有趣的讨论，认为它是对基数的改变，这是查看任务的另一种方式。

#6

Uh, pseudo-random number generators are generally quite fast. I'm not sure what language this is (Python, perhaps), but "result.append" (which almost certainly contains memory allocation) is likely slower than "random_uniform" (which just does a little math).

伪随机数生成器通常都是非常快的。我不确定这是什么语言(也许是Python)，而是“结果”。append(几乎可以肯定包含内存分配)可能比“random_uniform”(只做一点数学运算)要慢。

If you want to optimize the performance of this code:

如果您想优化此代码的性能:

Verify that it is a problem. Optimizations are a bit of work and make the code harder to maintain. Don't do them unless necessary.
确认它是一个问题。优化是一项工作，使代码更难维护。除非必要，否则不要做。
Profile it. Run some tests to determine which parts of the code are actually the slowest. Those are the parts you need to speed up.
概要文件。运行一些测试来确定代码的哪些部分实际上是最慢的。这些是你需要加速的部分。
Make your changes, and verify that they actually are faster. Compilers are pretty smart; often clear code will compile into better code that something complex than might appear faster.
进行更改，并验证它们实际上更快。编译器很聪明;通常，清晰的代码会编译成更好的代码，而复杂的代码可能会出现得更快。

If you are working in a compiled language (even JIT compiled), you take a performance hit for every transfer of control (if, while, function call, etc). Eliminate what you can. Memory allocation is also (usually) quite expensive.

如果您使用的是编译语言(甚至是JIT编译)，那么在每次传输控件时(如果、while、函数调用等)都会受到性能影响。消除你所能。内存分配(通常)也相当昂贵。

If you are working in an interpreted language, all bets are off. The simplest code is very likely the best. The overhead of the interpreter will dwarf whatever you are doing, so reduce its work as much as possible.

如果您使用的是一种解释语言，那么一切都是徒劳的。最简单的代码很可能是最好的。解释器的开销将使您所做的工作变得微不足道，因此尽可能减少它的工作。

I can only guess where your performance problems are:

我只能猜测你的表现问题在哪里:

Memory allocation. Pre-allocate the array at its full size and fill in the entries later. This ensures that the memory won't need to be reallocated while you're adding the entries.
内存分配。预分配完整大小的数组，稍后填充条目。这确保在添加条目时不需要重新分配内存。
Branches. You might be able to avoid the "if" by casting the result or something similar. This will depend a lot on the compiler. Check the assembly (or profile) to verify that it does what you want.
分支。你也许可以通过选择结果或者类似的方法来避免“如果”。这在很大程度上取决于编译器。检查程序集(或概要文件)，以验证它是否符合您的要求。
Numeric types. Find out the type your random number generator uses natively, and do your arithmetic in that type. For example, if the generator naturally returns 32-bit unsigned integers, scale "p" to that range first, then use it for the comparison.
数字类型。找出您的随机数生成器使用本机的类型，并在该类型中执行您的算术。例如，如果生成器自然地返回32位无符号整数，那么首先将“p”缩放到该范围，然后使用它进行比较。

By the way, if you really want to use the least bits of randomness possible, use "arithmetic coding" to decode your random stream. It won't be fast.

顺便说一下，如果你真的想使用最少的随机性，可以使用“算术编码”来解码你的随机流。它不会很快。

#7

One way that would give a precise result is to first randomly generate for a k-bit block the number of 1 bits following the binomial distribution, and then generate a k-bit word with exactly that many bits using one of the methods here. For example the method by mic006 needs only about log k k-bit random numbers, and mine needs only one.

一种给出精确结果的方法是首先为k位块随机生成1比特的数目，然后用这里的一个方法生成一个k比特的数目。例如，mic006的方法只需要log k位随机数，而我的方法只需要一个。

#8

If p is close to 0, you can calculate the probability that the n-th bit is the first bit that is 1; then you calculate a random number between 0 and 1 and pick n accordingly. For example if p = 0.005 (0.5%), and the random number is 0.638128, you might calculate (I'm guessing here) n = 321, so you fill with 321 0 bits and one bit set.

如果p接近0，你可以计算n位是1的第一个位的概率;然后你计算一个0到1之间的随机数然后相应地选择n。例如，如果p = 0.005(0.5%)，随机数为0.638128，你可以计算(我猜在这里)n = 321，所以你填入321 0位和1位。

If p is close to 1, use 1-p instead of p, and set 1 bits plus one 0 bit.

如果p接近1，用1-p代替p，设置1位加1 0位。

If p isn't close to 1 or 0, make a table of all 256 sequences of 8 bits, calculate their cumulative probabilities, then get a random number, do a binary search in the array of cumulative probabilities, and you can set 8 bits.

如果p不接近于1或0，那么就把所有的256个序列都列成一个表，计算它们的累积概率，然后得到一个随机数，在累积概率的数组中进行二分查找，你可以设置8位。

#9

Assuming that you have access to a generator of random bits, you can generate a value to compare with p bit by bit, and abort as soon as you can prove that the generated value is less-than or greater-or-equal-to p.

假设您可以访问一个随机比特的生成器，您可以生成一个值与p比特进行逐比特比较，并在证明生成的值小于或大于或等于p时终止。

Proceed as follows to create one item in a stream with given probability p:

按照以下步骤，在一个给定概率p的流中创建一个项目:

Start with 0. in binary
从0开始。在二进制
Append a random bit; assuming that a 1 has been drawn, you'll get 0.1
附加一个随机位;假设一个1被画出来，你会得到0。1
If the result (in binary notation) is provably smaller than p, output a 1
如果结果(二进制记数)被证明小于p，则输出1
If the result is provably larger or equal to p, output a 0
如果结果证明大于或等于p，则输出a 0
Otherwise (if neither can be ruled out), proceed with step 2.
否则(如果两者都不能排除)，继续执行步骤2。

Let's assume that p in binary notation is 0.1001101...; if this process generates any of 0.0, 0.1000, 0.10010, ..., the value cannot become larger or equal than p anymore; if any of 0.11, 0.101, 0.100111, ... is generated, the value cannot become smaller than p.

假设二元表示法中的p是0。1001101…;如果这个过程产生任何0.0,0.1000,0.10010，…，值不能再大于或等于p;如果是0.11,0.101,0.100111，…生成的值不能小于p。

To me, it looks like this method uses about two random bits in expectation. Arithmetic coding (as shown in the answer by Mark Dickinson) consumes at most one random bit per biased bit (on average) for fixed p; the cost of modifying p is unclear.

在我看来，这个方法在期望中使用了大约两个随机位。算术编码(如Mark Dickinson的回答所示)在固定的p中，每偏位(平均)最多消耗一个随机位;修改p的成本尚不清楚。

#10

What it does

This implementation makes single call to random device kernel module via interface of "/dev/urandom" special character file to get number of random data needed to represent all values in given resolution. Maximum possible resolution is 1/256^2 so that 0.005 can be represented by:

该实现通过“/dev/urandom”特殊字符文件接口对随机设备内核模块进行单次调用，以获得在给定分辨率下表示所有值所需的随机数据个数。最大可能的决议1/256 ^ 2,0.005可以表示为:

328/256^2,

328/256 ^ 2,

i.e:

即:

resolution: 256*256

分辨率:256 * 256

x: 328

x:328

with error 0.000004883.

0.000004883与错误。

How it does that

The implementation calculates the number of bits bits_per_byte which is number of uniformly distributed bits needed to handle given resolution, i.e. represent all @resolution values. It makes then a single call to randomization device ("/dev/urandom" if URANDOM_DEVICE is defined, otherwise it will use additional noise from device drivers via call to "/dev/random" which may block if there is not enough entropy in bits) to get required number of uniformly distributed bytes and fills in array rnd_bytes of random bytes. Finally it reads number of needed bits per each Bernoulli sample from each bytes_per_byte bytes of rnd_bytes array and compares the integer value of these bits to probability of success in single Bernoulli outcome given by x/resolution. If value hits, i.e. it falls in segment of x/resolution length which we arbitrarily choose to be [0, x/resolution) segment then we note success and insert 1 into resulting array.

该实现计算位的个数bits_per_byte是处理给定分辨率所需的均匀分布位的个数，即表示所有的@resolution值。这让那么一个调用随机化设备(/ dev / urandom“如果URANDOM_DEVICE定义,否则它将使用额外的噪音从设备驱动程序通过调用“/ dev /随机”,这可能会阻止如果没有足够的熵在比特)所需数量的均匀分布字节和填充rnd_bytes随机字节数组。最后，它从rnd_bytes数组的bytes_per_byte字节中读取每个Bernoulli样本所需的比特数，并将这些比特的整数值与x/resolution给出的单个Bernoulli结果的成功概率进行比较。如果值命中，即它落在x/分辨率长度的片段中，我们可以任意选择为[0,x/resolution]片段，那么我们会注意到成功，并将1插入到结果数组中。

Read from random device:

从随机读取装置:

/* if defined use /dev/urandom (will not block),
 * if not defined use /dev/random (may block)*/
#define URANDOM_DEVICE 1

/*
 * @brief   Read @outlen bytes from random device
 *          to array @out.
 */
int
get_random_samples(char *out, size_t outlen)
{
    ssize_t res;
#ifdef URANDOM_DEVICE
    int fd = open("/dev/urandom", O_RDONLY);
    if (fd == -1) return -1;
    res = read(fd, out, outlen);
    if (res < 0) {
        close(fd);
        return -2;
    }
#else
    size_t read_n;
    int fd = open("/dev/random", O_RDONLY);
    if (fd == -1) return -1;
    read_n = 0;
    while (read_n < outlen) {
       res = read(fd, out + read_n, outlen - read_n);
       if (res < 0) {
           close(fd);
           return -3;
       }
       read_n += res;
    }
#endif /* URANDOM_DEVICE */
    close(fd);
    return 0;
}

Fill in vector of Bernoulli samples:

填充伯努利样本向量:

/*
 * @brief   Draw vector of Bernoulli samples.
 * @details @x and @resolution determines probability
 *          of success in Bernoulli distribution
 *          and accuracy of results: p = x/resolution.
 * @param   resolution: number of segments per sample of output array 
 *          as power of 2: max resolution supported is 2^24=16777216
 * @param   x: determines used probability, x = [0, resolution - 1]
 * @param   n: number of samples in result vector
 */
int
get_bernoulli_samples(char *out, uint32_t n, uint32_t resolution, uint32_t x)
{
    int res;
    size_t i, j;
    uint32_t bytes_per_byte, word;
    unsigned char *rnd_bytes;
    uint32_t uniform_byte;
    uint8_t bits_per_byte;

    if (out == NULL || n == 0 || resolution == 0 || x > (resolution - 1))
        return -1;

    bits_per_byte = log_int(resolution);
    bytes_per_byte = bits_per_byte / BITS_PER_BYTE + 
                        (bits_per_byte % BITS_PER_BYTE ? 1 : 0);
    rnd_bytes = malloc(n * bytes_per_byte);
    if (rnd_bytes == NULL)
        return -2;
    res = get_random_samples(rnd_bytes, n * bytes_per_byte);
    if (res < 0)
    {
        free(rnd_bytes);
        return -3;
    }

    i = 0;
    while (i < n)
    {
        /* get Bernoulli sample */
        /* read byte */
        j = 0;
        word = 0;
        while (j < bytes_per_byte)
        {
            word |= (rnd_bytes[i * bytes_per_byte + j] << (BITS_PER_BYTE * j));
            ++j;
        }
        uniform_byte = word & ((1u << bits_per_byte) - 1);
        /* decision */
        if (uniform_byte < x)
            out[i] = 1;
        else
            out[i] = 0;
        ++i;
    }

    free(rnd_bytes);    
    return 0;
}

Usage:

用法:

int
main(void)
{
    int res;
    char c[256];

    res = get_bernoulli_samples(c, sizeof(c), 256*256, 328); /* 328/(256^2) = 0.0050 */
    if (res < 0) return -1;

    return 0;
}

Complete code, results.

完整的代码,结果。

#1

#include "vectorclass/vectorclass.h"
#include "vectorclass/ranvec1.h"

class BiasedBits {
  public:
    // Default constructor, seeded with fixed values
    BiasedBits() : BiasedBits(1)  {}
    // Seed with a single seed; other possibilities exist.
    BiasedBits(int seed) : rng(3) { rng.init(seed); }

    // Generate 256 random bits, each with probability `p/256` of being 1.
    Vec8ui random256(unsigned p) {
      if (p >= 256) return Vec8ui{ 0xFFFFFFFF };
      Vec32c output{ 0 };
      Vec32c threshold{ 127 - p };
      for (int i = 0; i < 8; ++i) {
        output += output;
        output -= Vec32c(Vec32c(rng.uniform256()) > threshold);
      }
      return Vec8ui(output);
    }

  private:
    Ranvec1 rng;
};

In my test, that produced and counted 268435456 bits in 260 ms, or one bit per nanosecond. The test machine is an i5, so it doesn't have AVX2; YMMV.

在我的测试中，它在260毫秒内产生并计数了268435456位，或者说每纳秒1位。测试机器是i5，所以没有AVX2;YMMV。

#2

    expression             | p(bit is 1)
    -----------------------+-------------
    b1 & b2 & b3 & b4      |  6.25%
    b1 & b2 & b3           | 12.50%
    b1 & b2 & (b3 | b4)    | 18.75%
    b1 & b2                | 25.00%
    b1 | (b2 & (b3 | b4))  | 31.25%
    b1 & (b2 | b3)         | 37.50%
    b1 & (b2 | b3 | b4))   | 43.75%
    b1                     | 50.00%

Note that for your desired 0.5% resolution, you would need 8 unbiased words for one biased word, which would give you a resolution of (0.5^8) = 0.390625%.

注意,0.5%所需的分辨率,需要8无偏词一个偏见,这将给你一个解决(0.5 ^ 8)= 0.390625%。

#3

Below, I give two methods for producing the biased bits. Both achieve close to optimal efficiency, in the sense of requiring as few input unbiased bits as possible.

下面，我将给出两种产生偏置位的方法。两者都能达到最优效率，即要求输入尽可能少的无偏位。

Method 1: arithmetic (de)coding

import random

def random_bits():
    """
    Infinite generator generating a stream of random bits,
    with 0 and 1 having equal probability.
    """
    global bit_count  # keep track of how many bits were produced
    while True:
        bit_count += 1
        yield random.choice([0, 1])

def bernoulli(p):
    """
    Infinite generator generating 1-bits with probability p
    and 0-bits with probability 1 - p.
    """
    bits = random_bits()

    low, high = 0.0, 1.0
    while True:
        if high <= p:
            # Generate 1, rescale to map [0, p) to [0, 1)
            yield 1
            low, high = low / p, high / p
        elif low >= p:
            # Generate 0, rescale to map [p, 1) to [0, 1)
            yield 0
            low, high = (low - p) / (1 - p), (high - p) / (1 - p)
        else:
            # Use the next random bit to halve the current interval.
            mid = 0.5 * (low + high)
            if next(bits):
                low = mid
            else:
                high = mid

Here's an example usage:

这里有一个例子使用方法:

import itertools
bit_count = 0

# Generate a million deviates.
results = list(itertools.islice(bernoulli(0.1), 10**6))

print("First 50:", ''.join(map(str, results[:50])))
print("Biased bits generated:", len(results))
print("Unbiased bits used:", bit_count)
print("mean:", sum(results) / len(results))

The above gives the following sample output:

上面给出了以下示例输出:

First 50: 00000000000001000000000110010000001000000100010000
Biased bits generated: 1000000
Unbiased bits used: 469036
mean: 0.100012

As promised, we've generated 1 million bits of our output biased stream using fewer than five hundred thousand from the source unbiased stream.

如前所述，我们已经使用不到50万的源无偏流产生了100万比特的输出流。

For optimisation purposes, when translating this into C / C++ it may make sense to code this up using integer-based fixed-point arithmetic rather than floating-point.

出于优化目的，当将其转换为C / c++时，使用基于整数的定点算法而不是浮点数来编写代码是有意义的。

Method 2: integer-based algorithm

def bernoulli_int(p):
    """
    Infinite generator generating 1-bits with probability p
    and 0-bits with probability 1 - p.

    p should be an integer multiple of 1/200.
    """
    bits = random_bits()
    # Assuming that p has a resolution of 0.05, find p / 0.05.
    p_int = int(round(200*p))

    value, high = 0, 1
    while True:
        if high < 2**31:
            high = 2 * high
            value = 2 * value + next(bits)
        else:
            # Throw out everything beyond the last multiple of 200, to
            # avoid introducing a bias.
            discard = high - high % 200
            split = high // 200 * p_int
            if value >= discard:  # rarer than 1 time in 10 million
                value -= discard
                high -= discard
            elif value >= split:
                yield 0
                value -= split
                high = discard - split
            else:
                yield 1
                high = split

With the same test code as before, but using bernoulli_int in place of bernoulli, I get the following results for p=0.1:

使用与之前相同的测试代码，但是使用bernoulli_int代替bernoulli，我得到了p=0.1的如下结果:

First 50: 00000010000000000100000000000000000000000110000100
Biased bits generated: 1000000
Unbiased bits used: 467997
mean: 0.099675

#4

Let's say the probability of a 1 appearing is 6,25% (1/16). There are 16 possible bit patterns for a 4 bit-number: 0000,0001, ..., 1110,1111.

假设1出现的概率是6 25% (1/16)4位数字有16种可能的位模式:0000,0001，…,1110,1111。

Now, just generate a random number like you used to and replace every 1111 at a nibble-boundary with a 1 and turn everything else to a 0.

现在，像以前一样生成一个随机数，把每个1111在一个nibbll边界上替换为1，把其他的都变成0。

Adjust accordingly for other probabilities.

根据其他可能性进行相应调整。

#5

你会得到理论上的最优行为，也就是说，尽可能少地使用随机数生成器，如果你用算术编码的方法，你就能精确地建模任意概率p。

算术编码是一种数据压缩形式，它将消息表示为数字范围的子区间。它提供了理论上的最优编码，并且可以为每个输入符号使用分数位数。

Initially       [0, 1),             range = 1
After 1         [0.333, 1),         range = 0.6666        
After 0         [0.333, 0.5555),    range = 0.2222   
After 0         [0.333, 0.407407),  range = 0.074074

Wikipedia also has an interesting discussion of arithmetic encoding considered as change of radix, which is another way to view your task.

*还对算术编码进行了有趣的讨论，认为它是对基数的改变，这是查看任务的另一种方式。

#6

If you want to optimize the performance of this code:

如果您想优化此代码的性能:

Verify that it is a problem. Optimizations are a bit of work and make the code harder to maintain. Don't do them unless necessary.
确认它是一个问题。优化是一项工作，使代码更难维护。除非必要，否则不要做。
Profile it. Run some tests to determine which parts of the code are actually the slowest. Those are the parts you need to speed up.
概要文件。运行一些测试来确定代码的哪些部分实际上是最慢的。这些是你需要加速的部分。
Make your changes, and verify that they actually are faster. Compilers are pretty smart; often clear code will compile into better code that something complex than might appear faster.
进行更改，并验证它们实际上更快。编译器很聪明;通常，清晰的代码会编译成更好的代码，而复杂的代码可能会出现得更快。

I can only guess where your performance problems are:

我只能猜测你的表现问题在哪里:

Memory allocation. Pre-allocate the array at its full size and fill in the entries later. This ensures that the memory won't need to be reallocated while you're adding the entries.
内存分配。预分配完整大小的数组，稍后填充条目。这确保在添加条目时不需要重新分配内存。
Branches. You might be able to avoid the "if" by casting the result or something similar. This will depend a lot on the compiler. Check the assembly (or profile) to verify that it does what you want.
分支。你也许可以通过选择结果或者类似的方法来避免“如果”。这在很大程度上取决于编译器。检查程序集(或概要文件)，以验证它是否符合您的要求。
Numeric types. Find out the type your random number generator uses natively, and do your arithmetic in that type. For example, if the generator naturally returns 32-bit unsigned integers, scale "p" to that range first, then use it for the comparison.
数字类型。找出您的随机数生成器使用本机的类型，并在该类型中执行您的算术。例如，如果生成器自然地返回32位无符号整数，那么首先将“p”缩放到该范围，然后使用它进行比较。

By the way, if you really want to use the least bits of randomness possible, use "arithmetic coding" to decode your random stream. It won't be fast.

顺便说一下，如果你真的想使用最少的随机性，可以使用“算术编码”来解码你的随机流。它不会很快。

#7

#8

If p is close to 1, use 1-p instead of p, and set 1 bits plus one 0 bit.

如果p接近1，用1-p代替p，设置1位加1 0位。

#9

假设您可以访问一个随机比特的生成器，您可以生成一个值与p比特进行逐比特比较，并在证明生成的值小于或大于或等于p时终止。

Proceed as follows to create one item in a stream with given probability p:

按照以下步骤，在一个给定概率p的流中创建一个项目:

Start with 0. in binary
从0开始。在二进制
Append a random bit; assuming that a 1 has been drawn, you'll get 0.1
附加一个随机位;假设一个1被画出来，你会得到0。1
If the result (in binary notation) is provably smaller than p, output a 1
如果结果(二进制记数)被证明小于p，则输出1
If the result is provably larger or equal to p, output a 0
如果结果证明大于或等于p，则输出a 0
Otherwise (if neither can be ruled out), proceed with step 2.
否则(如果两者都不能排除)，继续执行步骤2。

假设二元表示法中的p是0。1001101…;如果这个过程产生任何0.0,0.1000,0.10010，…，值不能再大于或等于p;如果是0.11,0.101,0.100111，…生成的值不能小于p。

#10

What it does

328/256^2,

328/256 ^ 2,

i.e:

即:

resolution: 256*256

分辨率:256 * 256

x: 328

x:328

with error 0.000004883.

0.000004883与错误。

How it does that

Read from random device:

从随机读取装置:

/* if defined use /dev/urandom (will not block),
 * if not defined use /dev/random (may block)*/
#define URANDOM_DEVICE 1

/*
 * @brief   Read @outlen bytes from random device
 *          to array @out.
 */
int
get_random_samples(char *out, size_t outlen)
{
    ssize_t res;
#ifdef URANDOM_DEVICE
    int fd = open("/dev/urandom", O_RDONLY);
    if (fd == -1) return -1;
    res = read(fd, out, outlen);
    if (res < 0) {
        close(fd);
        return -2;
    }
#else
    size_t read_n;
    int fd = open("/dev/random", O_RDONLY);
    if (fd == -1) return -1;
    read_n = 0;
    while (read_n < outlen) {
       res = read(fd, out + read_n, outlen - read_n);
       if (res < 0) {
           close(fd);
           return -3;
       }
       read_n += res;
    }
#endif /* URANDOM_DEVICE */
    close(fd);
    return 0;
}

Fill in vector of Bernoulli samples:

填充伯努利样本向量:

/*
 * @brief   Draw vector of Bernoulli samples.
 * @details @x and @resolution determines probability
 *          of success in Bernoulli distribution
 *          and accuracy of results: p = x/resolution.
 * @param   resolution: number of segments per sample of output array 
 *          as power of 2: max resolution supported is 2^24=16777216
 * @param   x: determines used probability, x = [0, resolution - 1]
 * @param   n: number of samples in result vector
 */
int
get_bernoulli_samples(char *out, uint32_t n, uint32_t resolution, uint32_t x)
{
    int res;
    size_t i, j;
    uint32_t bytes_per_byte, word;
    unsigned char *rnd_bytes;
    uint32_t uniform_byte;
    uint8_t bits_per_byte;

    if (out == NULL || n == 0 || resolution == 0 || x > (resolution - 1))
        return -1;

    bits_per_byte = log_int(resolution);
    bytes_per_byte = bits_per_byte / BITS_PER_BYTE + 
                        (bits_per_byte % BITS_PER_BYTE ? 1 : 0);
    rnd_bytes = malloc(n * bytes_per_byte);
    if (rnd_bytes == NULL)
        return -2;
    res = get_random_samples(rnd_bytes, n * bytes_per_byte);
    if (res < 0)
    {
        free(rnd_bytes);
        return -3;
    }

    i = 0;
    while (i < n)
    {
        /* get Bernoulli sample */
        /* read byte */
        j = 0;
        word = 0;
        while (j < bytes_per_byte)
        {
            word |= (rnd_bytes[i * bytes_per_byte + j] << (BITS_PER_BYTE * j));
            ++j;
        }
        uniform_byte = word & ((1u << bits_per_byte) - 1);
        /* decision */
        if (uniform_byte < x)
            out[i] = 1;
        else
            out[i] = 0;
        ++i;
    }

    free(rnd_bytes);    
    return 0;
}

Usage:

用法:

int
main(void)
{
    int res;
    char c[256];

    res = get_bernoulli_samples(c, sizeof(c), 256*256, 328); /* 328/(256^2) = 0.0050 */
    if (res < 0) return -1;

    return 0;
}

Complete code, results.

完整的代码,结果。

秒客网

在给定概率为0或1的情况下，快速生成伪随机位。

Edit March 5

Benchmarks (without arithmetic decoding)

Benchmarks for arithmetic decoding

10 个解决方案

#1

#2

#3

Method 1: arithmetic (de)coding

Method 2: integer-based algorithm

#4

#5

#6

#7

#8

#9

#10

What it does

How it does that

#1

#2

#3

Method 1: arithmetic (de)coding

Method 2: integer-based algorithm

#4

#5

#6

#7

#8

#9

#10

What it does

How it does that

相关文章