符号变化时,从int到float和back

时间:2022-10-06 16:38:05

Consider the following code, which is an SSCCE of my actual problem:

考虑下面的代码,这是我实际问题的一个SSCCE:

#include <iostream>

int roundtrip(int x)
{
    return int(float(x));
}

int main()
{
    int a = 2147483583;
    int b = 2147483584;
    std::cout << a << " -> " << roundtrip(a) << '\n';
    std::cout << b << " -> " << roundtrip(b) << '\n';
}

The output on my computer (Xubuntu 12.04.3 LTS) is:

我电脑上的输出(Xubuntu 12.04.3 LTS)是:

2147483583 -> 2147483520
2147483584 -> -2147483648

Note how the positive number b ends up negative after the roundtrip. Is this behavior well-specified? I would have expected int-to-float round-tripping to at least preserve the sign correctly...

注意正号b在往返之后是如何变成负数的。这是良好的行为吗?我本以为准浮动往返是为了至少保持标志正确……

Hm, on ideone, the output is different:

嗯,在ideone上,输出是不同的:

2147483583 -> 2147483520
2147483584 -> 2147483647

Did the g++ team fix a bug in the meantime, or are both outputs perfectly valid?

g++团队是否同时修复了一个bug,或者两个输出都是完全有效的?

2 个解决方案

#1


69  

Your program is invoking undefined behavior because of an overflow in the conversion from floating-point to integer. What you see is only the usual symptom on x86 processors.

您的程序正在调用未定义的行为,因为从浮点数到整数的转换中出现了溢出。您看到的只是x86处理器上的常见症状。

The float value nearest to 2147483584 is 231 exactly (the conversion from integer to floating-point usually rounds to the nearest, which can be up, and is up in this case. To be specific, the behavior when converting from integer to floating-point is implementation-defined, most implementations define rounding as being “according to the FPU rounding mode”, and the FPU's default rounding mode is to round to the nearest).

最接近2147483584的浮点值是231(从整数到浮点数的转换通常是到最近的,也可以是向上的,在这种情况下是向上的)。具体来说,从整数到浮点数转换的行为是实现定义的,大多数实现都将舍入定义为“根据FPU的舍入模式”,FPU的默认舍入模式是四舍五入到最近)。

Then, while converting from the float representing 231 to int, an overflow occurs. This overflow is undefined behavior. Some processors raise an exception, others saturate. The IA-32 instruction cvttsd2si typically generated by compilers happens to always return INT_MIN in case of overflow, regardless of whether the float is positive or negative.

然后,当从表示231的浮点数转换为int数时,会发生溢出。这个溢出是未定义的行为。有些处理器会引发异常,有些则会饱和。通常由编译器生成的IA-32指令cvttsd2si在溢出时总是返回INT_MIN,无论浮点数是正值还是负值。

You should not rely on this behavior even if you know you are targeting an Intel processor: when targeting x86-64, compilers can emit, for the conversion from floating-point to integer, sequences of instructions that take advantage of the undefined behavior to return results other than what you might otherwise expect for the destination integer type.

你不应该依赖于这种行为即使你知道你的目标是一个英特尔处理器:针对x86 - 64时,编译器可以发出,从浮点转换为整数,的指令序列,利用未定义的行为来返回结果的其他比你可能预期目标整数类型。

#2


10  

Pascal's answer is OK - but lacks details which entails that some users do not get it ;-) . If you are interested in how it looks on lower level (assuming coprocessor and not software handles floating point operations) - read on.

Pascal的回答是可以的,但是缺少一些细节,使得一些用户无法理解;如果您对它在较低级别上的外观感兴趣(假设协处理器处理浮点操作而不是软件处理),请继续阅读。

In 32 bits of float (IEEE 754) you can store all of integers from within [-224...224] range. Integers outside the range may also have exact representation as float but not all of them have. The problem is that you can have only 24 significant bits to play with in float.

在32位浮动(IEEE 754)中,您可以从内部存储所有整数(-224…224]范围。范围之外的整数也可能具有浮点数的精确表示,但并非所有的整数都具有浮点数。问题是,在float中只能使用24个有效位。

Here is how conversion from int->float typically looks like on low level:

下面是从int->浮点数的转换在低电平上的典型情况:

fild dword ptr[your int]
fstp dword ptr[your float]

It is just sequence of 2 coprocessor instructions. First loads 32bit int onto comprocessor's stack and converts it into 80 bit wide float.

它只是两个协处理器指令的序列。首先将32bit int加载到comprocessor栈中,并将其转换为80位宽的浮点数。

Intel® 64 and IA-32 Architectures Software Developer’s Manual

Intel®64和ia - 32架构软件开发人员的手册

(PROGRAMMING WITH THE X87 FPU):

(X87 FPU编程):

When floating-point, integer, or packed BCD integer values are loaded from memory into any of the x87 FPU data registers, the values are automatically converted into double extended-precision floating-point format (if they are not already in that format).

当浮点、整型或打包的BCD整型值从内存加载到任何一个x87 FPU数据寄存器时,这些值将自动转换为双扩展精度浮点格式(如果它们不是那种格式的话)。

Since FPU registers are 80bit wide floats - there is no issue with fild here as 32bit int perfectly fits in 64bit significand of floating point format.

因为FPU寄存器是80位宽的浮点数—这里的fild没有问题,因为32位int完全适合64位的浮点数格式。

So far so good.

目前为止一切都很顺利。

The second part - fstp is bit tricky and may be surprising. It is supposed to store 80bit floating point in 32bit float. Although it is all about integer values (in the question) coprocessor may actually perform 'rounding'. Ke? How do you round integer value even if it is stored in floating point format? ;-).

第二部分- fstp有点棘手,可能会令人惊讶。它应该在32位浮点数中存储80位浮点数。尽管它都是关于整数值(在问题中),但协处理器实际上可能执行“舍入”。柯?即使整数值以浮点格式存储,也要如何循环?;-)。

I'll explain it shortly - let's first see what rounding modes x87 provides (they are IEE 754 rounding modes' incarnation). X87 fpu has 4 rounding modes controlled by bits #10 and #11 of fpu's control word:

我将很快解释它——让我们首先看看x87提供的舍入模式(它们是IEE 754舍入模式的化身)。X87 fpu有4种四舍五入方式,由fpu的控制字#10和#11位控制:

  • 00 - to nearest even - Rounded result is the closest to the infinitely precise result. If two values are equally close, the result is the even value (that is, the one with the least-significant bit of zero). Default
  • 从00到最近的均匀圆的结果是最接近无限精确的结果。如果两个值相等,则结果是偶数(即最小有效位为0的值)。默认的
  • 01 - toward -Inf
  • 01——向负
  • 10 - toward +inf
  • 10 - +正无穷
  • 11 - toward 0 (ie. truncate)
  • 11 -接近0。截断)

You can play with rounding modes using this simple code (although it may be done differently - showing low level here):

您可以使用这个简单的代码来使用舍入模式(尽管它可以以不同的方式进行—在这里显示为低级别):

enum ROUNDING_MODE
{
    RM_TO_NEAREST  = 0x00,
    RM_TOWARD_MINF = 0x01,
    RM_TOWARD_PINF = 0x02,
    RM_TOWARD_ZERO = 0x03 // TRUNCATE
};

void set_round_mode(enum ROUNDING_MODE rm)
{
    short csw;
    short tmp = rm;

    _asm
    {
        push ax
        fstcw [csw]
        mov ax, [csw]
        and ax, ~(3<<10)
        shl [tmp], 10
        or ax, tmp
        mov [csw], ax
        fldcw [csw]
        pop ax
    }
}

Ok nice but still how is that related to integer values? Patience ... to understand why you might need rounding modes involved in int to float conversion check most obvious way of converting int to float - truncation (not default) - that may look like this:

很好,但是这和整数值有什么关系呢?耐心……要理解为什么需要使用整数转换为浮点数的舍入模式,请检查将整数转换为浮点数的最明显的方法——截断(不是默认)——这可能看起来像这样:

  • record sign
  • 记录信号
  • negate your int if less than zero
  • 如果你的int数小于零,就把它消掉
  • find position of leftmost 1
  • 找到最左1的位置
  • shift int to the right/left so that 1 found above is positioned on bit #23
  • 将int移到右/左以便上面的1位于第23位
  • record number of shifts during the process so that you can calculate exponent
  • 记录过程中的移位数,以便计算指数

And the code simulating this bahavior may look like this:

巴哈维奥的代码可能是这样的

float int2float(int value)
{
    // handles all values from [-2^24...2^24]
    // outside this range only some integers may be represented exactly
    // this method will use truncation 'rounding mode' during conversion

    // we can safely reinterpret it as 0.0
    if (value == 0) return 0.0;

    if (value == (1U<<31)) // ie -2^31
    {
        // -(-2^31) = -2^31 so we'll not be able to handle it below - use const
        value = 0xCF000000;
        return *((float*)&value);
    }

    int sign = 0;

    // handle negative values
    if (value < 0)
    {
        sign = 1U << 31;
        value = -value;
    }

    // although right shift of signed is undefined - all compilers (that I know) do
    // arithmetic shift (copies sign into MSB) is what I prefer here
    // hence using unsigned abs_value_copy for shift
    unsigned int abs_value_copy = value;

    // find leading one
    int bit_num = 31;
    int shift_count = 0;

    for(; bit_num > 0; bit_num--)
    {
        if (abs_value_copy & (1U<<bit_num))
        {
            if (bit_num >= 23)
            {
                // need to shift right
                shift_count = bit_num - 23;
                abs_value_copy >>= shift_count;
            }
            else
            {
                // need to shift left
                shift_count = 23 - bit_num;
                abs_value_copy <<= shift_count;
            }
            break;
        }
    }

    // exponent is biased by 127
    int exp = bit_num + 127;

    // clear leading 1 (bit #23) (it will implicitly be there but not stored)
    int coeff = abs_value_copy & ~(1<<23);

    // move exp to the right place
    exp <<= 23;

    int ret = sign | exp | coeff;

    return *((float*)&ret);
}

Now example - truncation mode converts 2147483583 to 2147483520.

现在示例-截断模式将2147483583转换为2147483520。

2147483583 = 01111111_11111111_11111111_10111111

During int->float conversion you must shift leftmost 1 to bit #23. Now leading 1 is in bit#30. In order to place it in bit #23 you must perform right shift by 7 positions. During that you loose (they will not fit in 32bit float format) 7 lsb bits from the right (you truncate/chop). They were:

在int->浮动转换期间,您必须将最左的1移动到第23位。现在领先的1位在30位。为了把它放在第23位,你必须执行7个位置的右移位。在此期间,您将(他们将不符合32位浮动格式)7 lsb位从右边(您截断/切割)。他们是:

01111111 = 63

And 63 is what original number lost:

63是原始数字的损失:

2147483583 -> 2147483520 + 63

Truncating is easy but may not necessarily be what you want and/or is best for all cases. Consider below example:

截断很容易,但不一定是你想要的,或者对所有情况都是最好的。考虑下面的例子:

67108871 = 00000100_00000000_00000000_00000111

Above value cannot be exactly represented by float but check what truncation does to it. As previously - we need to shift leftmost 1 to bit #23. This requires value to be shifted right exactly 3 positions loosing 3 LSB bits (as of now I'll write numbers differently showing where implicit 24th bit of float is and will bracket explicit 23bits of significand):

上面的值不能用浮点数来表示,但是要检查截断对它有什么影响。如前所述——我们需要将最左边的1移动到第23位。这要求值要正确地移动3个位置并释放3个LSB位(到目前为止,我将以不同的方式编写数字,显示隐式第24位浮点数的位置,并将显式的23位表示意义):

00000001.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)

Truncation chops 3 trailing bits leaving us with 67108864 (67108864+7(3 chopped bits)) = 67108871 (remember although we shift we compensate with exponent manipulation - omitted here).

截断3个尾位,剩下67108864(67108864+7(3个斩位))= 67108871(记住,虽然我们进行移位,但我们用指数操作进行补偿——此处省略)。

Is that good enough? Hey 67108872 is perfectly representable by 32bit float and should be much better than 67108864 right? CORRECT and this is where you might want to talk about rounding when converting int to 32bit float.

这是足够好吗?嘿,67108872完全可以用32位浮点数表示应该比67108864更好,对吧?在将int转换为32位浮点数时,您可能需要讨论舍入。

Now let's see how default 'rounding to nearest even' mode works and what are its implications in OP's case. Consider the same example one more time.

现在让我们来看看默认的“四舍五入”模式是如何工作的,以及它在OP中的含义。再考虑一个相同的例子。

67108871 = 00000100_00000000_00000000_00000111

As we know we need 3 right shifts to place leftmost 1 in bit #23:

正如我们所知道的,我们需要3个右移来将最左的1放在第23位:

00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)

Procedure of 'rounding to nearest even' involves finding 2 numbers that bracket input value 67108871 from bottom and above as close as possible. Keep in mind that we still operate within FPU on 80bits so although I show some bits being shifted out they are still in FPU reg but will be removed during rounding operation when storing output value.

“四舍五入到最接近的偶数”的过程包括从底部和上面查找两个数字,将输入值67108871从最下面和最上面填入。请记住,我们仍然在FPU中对80bit进行操作,因此尽管我展示了一些被移出的位,但它们仍然在FPU reg中,但是在存储输出值的舍入操作中会被删除。

00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)

2 values that closely bracket 00000000_1.[0000000_00000000_00000000] 111 * 2^26 are:

值接近00000000_1。(0000000 _00000000_00000000)111 * 2 ^ 26:

from top:

从上:

  00000000_1.[0000000_00000000_00000000] 111 * 2^26
                                     +1
= 00000000_1.[0000000_00000000_00000001] * 2^26 = 67108872

and from below:

从下面:

  00000000_1.[0000000_00000000_00000000] * 2^26 = 67108864

Obviously 67108872 is much closer to 67108871 than 67108864 hence conversion from 32bit int value 67108871 gives 67108872 (in rounding to nearest even mode).

显然67108872比67108871更接近67108871,因此从32位int值67108871转换为67108872(四舍五入至最接近的偶数模式)。

Now OP's numbers (still rounding to nearest even):

现在OP的数(仍然四舍五入到最接近的偶数):

 2147483583 = 01111111_11111111_11111111_10111111
= 00000000_1.[1111111_11111111_11111111] 0111111 * 2^30

bracket values:

托架值:

top:

上图:

  00000000_1.[1111111_111111111_11111111] 0111111 * 2^30
                                      +1
= 00000000_10.[0000000_00000000_00000000] * 2^30
=  00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648

bottom:

底:

00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520

Keep in mind that even word in 'rounding to nearest even' matters only when input value is halfway between bracket values. Only then word even matters and 'decides' which bracket value should be selected. In the above case even does not matter and we must simply choose nearer value, which is 2147483520

请记住,只有当输入值在括号值之间的中间时,“舍入到最接近偶数”中的偶数才重要。只有这样,word才更重要,并“决定”应该选择哪个括号值。在上述情况下,甚至不重要,我们必须选择更接近的值,即2147483520

Last OP's case shows the problem where even word matters. :

最后一个OP的例子显示了连词都重要的问题。:

 2147483584 = 01111111_11111111_11111111_11000000
= 00000000_1.[1111111_11111111_11111111] 1000000 * 2^30

bracket values are the same as previously:

括号值与之前相同:

top: 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648

最高:00000000 _1。(0000000 _00000000_00000000)* 2 ^ 31 = 2147483648

bottom: 00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520

底部:00000000 _1。(1111111 _111111111_11111111)* 2 ^ 30 = 2147483520

There is no nearer value now (2147483648-2147483584=64=2147483584-2147483520) so we must rely on even and select top (even) value 2147483648.

现在没有更接近的值(2147483648-2147483584=64=2147483584- 2147483535 - 214748353520),所以我们必须依赖于偶数,选择top (even)值2147483648。

And here OP's problem is that Pascal had briefly described. FPU works only on signed values and 2147483648 cannot be stored as signed int as its max value is 2147483647 hence issues.

OP的问题是Pascal曾简要描述过。FPU只在有符号的值上工作,2147483648不能存储为符号整数,因为它的最大值是2147483647。

Simple proof (without documentation quotes) that FPU works only on signed values ie. treats every value as signed is by debugging this:

简单的证明(没有文档引号)FPU只适用于有符号的值ie。将每个值视为已签名,调试如下:

unsigned int test = (1u << 31);

_asm
{
    fild [test]
}

Although it looks like test value should be treated as unsigned it will be loaded as -231 as there is no separate instructions for loading signed and unsigned values into FPU. Likewise you'll not find instructions that will allow you to store unsigned value from FPU to mem. Everything is just a bit pattern treated as signed regardless of how you might have declared it in your program.

虽然看起来测试值应该被视为无符号,但它将被加载为-231,因为没有单独的指令将有符号和无符号的值加载到FPU中。同样,您也不会找到允许将无符号值从FPU存储到mem的指令。任何东西都只是一个被视为签名的位模式,不管您如何在程序中声明它。

Was long but hope someone will learn something out of it.

时间很长,但希望有人能从中学到一些东西。

#1


69  

Your program is invoking undefined behavior because of an overflow in the conversion from floating-point to integer. What you see is only the usual symptom on x86 processors.

您的程序正在调用未定义的行为,因为从浮点数到整数的转换中出现了溢出。您看到的只是x86处理器上的常见症状。

The float value nearest to 2147483584 is 231 exactly (the conversion from integer to floating-point usually rounds to the nearest, which can be up, and is up in this case. To be specific, the behavior when converting from integer to floating-point is implementation-defined, most implementations define rounding as being “according to the FPU rounding mode”, and the FPU's default rounding mode is to round to the nearest).

最接近2147483584的浮点值是231(从整数到浮点数的转换通常是到最近的,也可以是向上的,在这种情况下是向上的)。具体来说,从整数到浮点数转换的行为是实现定义的,大多数实现都将舍入定义为“根据FPU的舍入模式”,FPU的默认舍入模式是四舍五入到最近)。

Then, while converting from the float representing 231 to int, an overflow occurs. This overflow is undefined behavior. Some processors raise an exception, others saturate. The IA-32 instruction cvttsd2si typically generated by compilers happens to always return INT_MIN in case of overflow, regardless of whether the float is positive or negative.

然后,当从表示231的浮点数转换为int数时,会发生溢出。这个溢出是未定义的行为。有些处理器会引发异常,有些则会饱和。通常由编译器生成的IA-32指令cvttsd2si在溢出时总是返回INT_MIN,无论浮点数是正值还是负值。

You should not rely on this behavior even if you know you are targeting an Intel processor: when targeting x86-64, compilers can emit, for the conversion from floating-point to integer, sequences of instructions that take advantage of the undefined behavior to return results other than what you might otherwise expect for the destination integer type.

你不应该依赖于这种行为即使你知道你的目标是一个英特尔处理器:针对x86 - 64时,编译器可以发出,从浮点转换为整数,的指令序列,利用未定义的行为来返回结果的其他比你可能预期目标整数类型。

#2


10  

Pascal's answer is OK - but lacks details which entails that some users do not get it ;-) . If you are interested in how it looks on lower level (assuming coprocessor and not software handles floating point operations) - read on.

Pascal的回答是可以的,但是缺少一些细节,使得一些用户无法理解;如果您对它在较低级别上的外观感兴趣(假设协处理器处理浮点操作而不是软件处理),请继续阅读。

In 32 bits of float (IEEE 754) you can store all of integers from within [-224...224] range. Integers outside the range may also have exact representation as float but not all of them have. The problem is that you can have only 24 significant bits to play with in float.

在32位浮动(IEEE 754)中,您可以从内部存储所有整数(-224…224]范围。范围之外的整数也可能具有浮点数的精确表示,但并非所有的整数都具有浮点数。问题是,在float中只能使用24个有效位。

Here is how conversion from int->float typically looks like on low level:

下面是从int->浮点数的转换在低电平上的典型情况:

fild dword ptr[your int]
fstp dword ptr[your float]

It is just sequence of 2 coprocessor instructions. First loads 32bit int onto comprocessor's stack and converts it into 80 bit wide float.

它只是两个协处理器指令的序列。首先将32bit int加载到comprocessor栈中,并将其转换为80位宽的浮点数。

Intel® 64 and IA-32 Architectures Software Developer’s Manual

Intel®64和ia - 32架构软件开发人员的手册

(PROGRAMMING WITH THE X87 FPU):

(X87 FPU编程):

When floating-point, integer, or packed BCD integer values are loaded from memory into any of the x87 FPU data registers, the values are automatically converted into double extended-precision floating-point format (if they are not already in that format).

当浮点、整型或打包的BCD整型值从内存加载到任何一个x87 FPU数据寄存器时,这些值将自动转换为双扩展精度浮点格式(如果它们不是那种格式的话)。

Since FPU registers are 80bit wide floats - there is no issue with fild here as 32bit int perfectly fits in 64bit significand of floating point format.

因为FPU寄存器是80位宽的浮点数—这里的fild没有问题,因为32位int完全适合64位的浮点数格式。

So far so good.

目前为止一切都很顺利。

The second part - fstp is bit tricky and may be surprising. It is supposed to store 80bit floating point in 32bit float. Although it is all about integer values (in the question) coprocessor may actually perform 'rounding'. Ke? How do you round integer value even if it is stored in floating point format? ;-).

第二部分- fstp有点棘手,可能会令人惊讶。它应该在32位浮点数中存储80位浮点数。尽管它都是关于整数值(在问题中),但协处理器实际上可能执行“舍入”。柯?即使整数值以浮点格式存储,也要如何循环?;-)。

I'll explain it shortly - let's first see what rounding modes x87 provides (they are IEE 754 rounding modes' incarnation). X87 fpu has 4 rounding modes controlled by bits #10 and #11 of fpu's control word:

我将很快解释它——让我们首先看看x87提供的舍入模式(它们是IEE 754舍入模式的化身)。X87 fpu有4种四舍五入方式,由fpu的控制字#10和#11位控制:

  • 00 - to nearest even - Rounded result is the closest to the infinitely precise result. If two values are equally close, the result is the even value (that is, the one with the least-significant bit of zero). Default
  • 从00到最近的均匀圆的结果是最接近无限精确的结果。如果两个值相等,则结果是偶数(即最小有效位为0的值)。默认的
  • 01 - toward -Inf
  • 01——向负
  • 10 - toward +inf
  • 10 - +正无穷
  • 11 - toward 0 (ie. truncate)
  • 11 -接近0。截断)

You can play with rounding modes using this simple code (although it may be done differently - showing low level here):

您可以使用这个简单的代码来使用舍入模式(尽管它可以以不同的方式进行—在这里显示为低级别):

enum ROUNDING_MODE
{
    RM_TO_NEAREST  = 0x00,
    RM_TOWARD_MINF = 0x01,
    RM_TOWARD_PINF = 0x02,
    RM_TOWARD_ZERO = 0x03 // TRUNCATE
};

void set_round_mode(enum ROUNDING_MODE rm)
{
    short csw;
    short tmp = rm;

    _asm
    {
        push ax
        fstcw [csw]
        mov ax, [csw]
        and ax, ~(3<<10)
        shl [tmp], 10
        or ax, tmp
        mov [csw], ax
        fldcw [csw]
        pop ax
    }
}

Ok nice but still how is that related to integer values? Patience ... to understand why you might need rounding modes involved in int to float conversion check most obvious way of converting int to float - truncation (not default) - that may look like this:

很好,但是这和整数值有什么关系呢?耐心……要理解为什么需要使用整数转换为浮点数的舍入模式,请检查将整数转换为浮点数的最明显的方法——截断(不是默认)——这可能看起来像这样:

  • record sign
  • 记录信号
  • negate your int if less than zero
  • 如果你的int数小于零,就把它消掉
  • find position of leftmost 1
  • 找到最左1的位置
  • shift int to the right/left so that 1 found above is positioned on bit #23
  • 将int移到右/左以便上面的1位于第23位
  • record number of shifts during the process so that you can calculate exponent
  • 记录过程中的移位数,以便计算指数

And the code simulating this bahavior may look like this:

巴哈维奥的代码可能是这样的

float int2float(int value)
{
    // handles all values from [-2^24...2^24]
    // outside this range only some integers may be represented exactly
    // this method will use truncation 'rounding mode' during conversion

    // we can safely reinterpret it as 0.0
    if (value == 0) return 0.0;

    if (value == (1U<<31)) // ie -2^31
    {
        // -(-2^31) = -2^31 so we'll not be able to handle it below - use const
        value = 0xCF000000;
        return *((float*)&value);
    }

    int sign = 0;

    // handle negative values
    if (value < 0)
    {
        sign = 1U << 31;
        value = -value;
    }

    // although right shift of signed is undefined - all compilers (that I know) do
    // arithmetic shift (copies sign into MSB) is what I prefer here
    // hence using unsigned abs_value_copy for shift
    unsigned int abs_value_copy = value;

    // find leading one
    int bit_num = 31;
    int shift_count = 0;

    for(; bit_num > 0; bit_num--)
    {
        if (abs_value_copy & (1U<<bit_num))
        {
            if (bit_num >= 23)
            {
                // need to shift right
                shift_count = bit_num - 23;
                abs_value_copy >>= shift_count;
            }
            else
            {
                // need to shift left
                shift_count = 23 - bit_num;
                abs_value_copy <<= shift_count;
            }
            break;
        }
    }

    // exponent is biased by 127
    int exp = bit_num + 127;

    // clear leading 1 (bit #23) (it will implicitly be there but not stored)
    int coeff = abs_value_copy & ~(1<<23);

    // move exp to the right place
    exp <<= 23;

    int ret = sign | exp | coeff;

    return *((float*)&ret);
}

Now example - truncation mode converts 2147483583 to 2147483520.

现在示例-截断模式将2147483583转换为2147483520。

2147483583 = 01111111_11111111_11111111_10111111

During int->float conversion you must shift leftmost 1 to bit #23. Now leading 1 is in bit#30. In order to place it in bit #23 you must perform right shift by 7 positions. During that you loose (they will not fit in 32bit float format) 7 lsb bits from the right (you truncate/chop). They were:

在int->浮动转换期间,您必须将最左的1移动到第23位。现在领先的1位在30位。为了把它放在第23位,你必须执行7个位置的右移位。在此期间,您将(他们将不符合32位浮动格式)7 lsb位从右边(您截断/切割)。他们是:

01111111 = 63

And 63 is what original number lost:

63是原始数字的损失:

2147483583 -> 2147483520 + 63

Truncating is easy but may not necessarily be what you want and/or is best for all cases. Consider below example:

截断很容易,但不一定是你想要的,或者对所有情况都是最好的。考虑下面的例子:

67108871 = 00000100_00000000_00000000_00000111

Above value cannot be exactly represented by float but check what truncation does to it. As previously - we need to shift leftmost 1 to bit #23. This requires value to be shifted right exactly 3 positions loosing 3 LSB bits (as of now I'll write numbers differently showing where implicit 24th bit of float is and will bracket explicit 23bits of significand):

上面的值不能用浮点数来表示,但是要检查截断对它有什么影响。如前所述——我们需要将最左边的1移动到第23位。这要求值要正确地移动3个位置并释放3个LSB位(到目前为止,我将以不同的方式编写数字,显示隐式第24位浮点数的位置,并将显式的23位表示意义):

00000001.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)

Truncation chops 3 trailing bits leaving us with 67108864 (67108864+7(3 chopped bits)) = 67108871 (remember although we shift we compensate with exponent manipulation - omitted here).

截断3个尾位,剩下67108864(67108864+7(3个斩位))= 67108871(记住,虽然我们进行移位,但我们用指数操作进行补偿——此处省略)。

Is that good enough? Hey 67108872 is perfectly representable by 32bit float and should be much better than 67108864 right? CORRECT and this is where you might want to talk about rounding when converting int to 32bit float.

这是足够好吗?嘿,67108872完全可以用32位浮点数表示应该比67108864更好,对吧?在将int转换为32位浮点数时,您可能需要讨论舍入。

Now let's see how default 'rounding to nearest even' mode works and what are its implications in OP's case. Consider the same example one more time.

现在让我们来看看默认的“四舍五入”模式是如何工作的,以及它在OP中的含义。再考虑一个相同的例子。

67108871 = 00000100_00000000_00000000_00000111

As we know we need 3 right shifts to place leftmost 1 in bit #23:

正如我们所知道的,我们需要3个右移来将最左的1放在第23位:

00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)

Procedure of 'rounding to nearest even' involves finding 2 numbers that bracket input value 67108871 from bottom and above as close as possible. Keep in mind that we still operate within FPU on 80bits so although I show some bits being shifted out they are still in FPU reg but will be removed during rounding operation when storing output value.

“四舍五入到最接近的偶数”的过程包括从底部和上面查找两个数字,将输入值67108871从最下面和最上面填入。请记住,我们仍然在FPU中对80bit进行操作,因此尽管我展示了一些被移出的位,但它们仍然在FPU reg中,但是在存储输出值的舍入操作中会被删除。

00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)

2 values that closely bracket 00000000_1.[0000000_00000000_00000000] 111 * 2^26 are:

值接近00000000_1。(0000000 _00000000_00000000)111 * 2 ^ 26:

from top:

从上:

  00000000_1.[0000000_00000000_00000000] 111 * 2^26
                                     +1
= 00000000_1.[0000000_00000000_00000001] * 2^26 = 67108872

and from below:

从下面:

  00000000_1.[0000000_00000000_00000000] * 2^26 = 67108864

Obviously 67108872 is much closer to 67108871 than 67108864 hence conversion from 32bit int value 67108871 gives 67108872 (in rounding to nearest even mode).

显然67108872比67108871更接近67108871,因此从32位int值67108871转换为67108872(四舍五入至最接近的偶数模式)。

Now OP's numbers (still rounding to nearest even):

现在OP的数(仍然四舍五入到最接近的偶数):

 2147483583 = 01111111_11111111_11111111_10111111
= 00000000_1.[1111111_11111111_11111111] 0111111 * 2^30

bracket values:

托架值:

top:

上图:

  00000000_1.[1111111_111111111_11111111] 0111111 * 2^30
                                      +1
= 00000000_10.[0000000_00000000_00000000] * 2^30
=  00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648

bottom:

底:

00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520

Keep in mind that even word in 'rounding to nearest even' matters only when input value is halfway between bracket values. Only then word even matters and 'decides' which bracket value should be selected. In the above case even does not matter and we must simply choose nearer value, which is 2147483520

请记住,只有当输入值在括号值之间的中间时,“舍入到最接近偶数”中的偶数才重要。只有这样,word才更重要,并“决定”应该选择哪个括号值。在上述情况下,甚至不重要,我们必须选择更接近的值,即2147483520

Last OP's case shows the problem where even word matters. :

最后一个OP的例子显示了连词都重要的问题。:

 2147483584 = 01111111_11111111_11111111_11000000
= 00000000_1.[1111111_11111111_11111111] 1000000 * 2^30

bracket values are the same as previously:

括号值与之前相同:

top: 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648

最高:00000000 _1。(0000000 _00000000_00000000)* 2 ^ 31 = 2147483648

bottom: 00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520

底部:00000000 _1。(1111111 _111111111_11111111)* 2 ^ 30 = 2147483520

There is no nearer value now (2147483648-2147483584=64=2147483584-2147483520) so we must rely on even and select top (even) value 2147483648.

现在没有更接近的值(2147483648-2147483584=64=2147483584- 2147483535 - 214748353520),所以我们必须依赖于偶数,选择top (even)值2147483648。

And here OP's problem is that Pascal had briefly described. FPU works only on signed values and 2147483648 cannot be stored as signed int as its max value is 2147483647 hence issues.

OP的问题是Pascal曾简要描述过。FPU只在有符号的值上工作,2147483648不能存储为符号整数,因为它的最大值是2147483647。

Simple proof (without documentation quotes) that FPU works only on signed values ie. treats every value as signed is by debugging this:

简单的证明(没有文档引号)FPU只适用于有符号的值ie。将每个值视为已签名,调试如下:

unsigned int test = (1u << 31);

_asm
{
    fild [test]
}

Although it looks like test value should be treated as unsigned it will be loaded as -231 as there is no separate instructions for loading signed and unsigned values into FPU. Likewise you'll not find instructions that will allow you to store unsigned value from FPU to mem. Everything is just a bit pattern treated as signed regardless of how you might have declared it in your program.

虽然看起来测试值应该被视为无符号,但它将被加载为-231,因为没有单独的指令将有符号和无符号的值加载到FPU中。同样,您也不会找到允许将无符号值从FPU存储到mem的指令。任何东西都只是一个被视为签名的位模式,不管您如何在程序中声明它。

Was long but hope someone will learn something out of it.

时间很长,但希望有人能从中学到一些东西。