如何使Unicode iostream i / o在Windows和Unix上都可以工作?

时间:2020-12-08 20:43:35

Note: This is a question-with-answer in order to document a technique that others might find useful, and in order to perhaps become aware of others’ even better solutions. Do feel free to add critique or questions as comments. Also do feel free to add additional answers. :)

注意:这是一个问题的答案,以便记录其他人可能觉得有用的技术,以便可能意识到其他人更好的解决方案。请随意添加批评或问题作为评论。也可以随意添加其他答案。 :)


Problem #1:

  • Console support for Unicode via streams is severely limited at the Windows API level. The only relevant codepage available for ordinary desktop applications is 65001, UTF-8. And then interactive input fails at the API level, and even output of non-ASCII characters fails – and the C++ standard library implementations do not provide work-arounds for this problem.
  • 通过流的Unicode控制台支持在Windows API级别受到严格限制。普通桌面应用程序唯一可用的相关代码页是65001,UTF-8。然后交互式输入在API级别失败,甚至非ASCII字符的输出也失败 - 并且C ++标准库实现不提供此问题的解决方法。

#include <iostream>
#include <string>
using namespace std;

auto main() -> int
{
    wstring username;
    wcout << L"Hi, what’s your name? ";
    getline( wcin, username );
    wcout << "Pleased to meet you, " << username << "!\n";
}
H:\personal\web\blog alf on programming at wordpress\002\code>chcp 65001
Active code page: 65001

H:\personal\web\blog alf on programming at wordpress\002\code>g++ problem.input.cpp -std=c++14

H:\personal\web\blog alf on programming at wordpress\002\code>a
Hi, whatSøren Moskégård
                             ← No visible output.
H:\personal\web\blog alf on programming at wordpress\002\code>_

At the Windows API level a solution is to use non-stream-based direct console i/o when the relevant standard stream is bound to the console. For example, using the WriteConsole API function. And as an extension supported by both Visual C++ and MinGW g++ standard libraries, a mode can be set for the standard wide streams where WriteConsole is used, and there is also a mode for converting to/from UTF-8 as the external encoding.

在Windows API级别,解决方案是在相关标准流绑定到控制台时使用非基于流的直接控制台i / o。例如,使用WriteConsole API函数。作为Visual C ++和MinGW g ++标准库支持的扩展,可以为使用WriteConsole的标准宽流设置模式,并且还有一种模式用于转换为UTF-8或从UTF-8转换为外部编码。

And in Unix-land, a single call to setlocale( LC_ALL, "" ), or its higher level C++ equivalent, suffices to make the wide streams work.

在Unix-land中,对setlocale(LC_ALL,“”)或其更高级别C ++等效的单个调用足以使宽流工作。

But how can such modes be set transparently & automatically, so that the same ordinary standard C++ code using the wide streams will work both in Windows and Unix-land?

但是如何透明地自动设置这些模式,以便使用宽流的相同普通标准C ++代码在Windows和Unix-land中都可以工作?

Noting, for the readers who shudder at the thought of using wide text in a Unix-land program, that this is in effect a pre-requisite for portable code that uses UTF-8 narrow text console i/o in Unix-land. Namely, code that automatically uses UTF-8 narrow text in Unix-land and wide text in Windows becomes possible and can be built on top of support for Unicode in Windows. But without such support, no portability for the general case.

注意到,对于那些在Unix-land程序中使用宽文本的想法不寒而栗的读者来说,这实际上是在Unix-land中使用UTF-8窄文本控制台i / o的可移植代码的先决条件。也就是说,在Windows中自动使用Unix-land和宽文本中的UTF-8窄文本的代码变得可能,并且可以建立在W​​indows中对Unicode的支持之上。但是没有这样的支持,一般情况下都没有可移植性。


Problem #2:

  • With use of wide streams, default conversion of output items to wchar_t const* doesn't work.
  • 使用宽流,输出项到wchar_t const *的默认转换不起作用。

#include <iostream>
using namespace std;

struct Byte_string
{ operator char const* () const { return "Hurray, it works!"; } };

struct Wide_string
{ operator wchar_t const* () const { return L"Hurray, it works!"; } };

auto main() -> int
{
    wcout << "Byte string pointer: " << Byte_string() << endl;
    wcout << "Wide string pointer: " << Wide_string() << endl;
}
Byte string pointer: Hurray, it works!
Wide string pointer: 0x4ad018

This is a defect of the inconsistency type at the implementation level in the standard, that I reported long ago. I'm not sure of the status, it may have been forgotten (I never got any mailings about it), or maybe a fix will be applied in C++17. Anyway, how can one work around that?

这是我很久以前报告的标准中实现级别的不一致类型的缺陷。我不确定状态,它可能已被遗忘(我从来没有收到任何关于它的邮件),或者可能会在C ++ 17中应用修复程序。无论如何,如何解决这个问题?


In short, how can one make standard C++ code that uses Unicode wide text console i/o, work and be practical in both Windows and Unix-land?

简而言之,如何制作使用Unicode宽文本控制台i / o的标准C ++代码,在Windows和Unix-land中工作和实用?

1 个解决方案

#1


Fix for the conversion problem:

cppx/stdlib/iostreams_conversion_defect.fix.hpp
#pragma once
//----------------------------------------------------------------------------------------
//    PROBLEM DESCRIPTION.
//
//    Output of wchar_t const* is only supported via an operator<< template. User-defined
//    conversions are not considered for template matching. This results in actual argument
//    with user conversion to wchar_t const*, for a wide stream, being presented as the
//    pointer value instead of the string.

#include <iostream>

#ifndef CPPX_NO_IOSTREAM_CONVERSION_FIX
    namespace std{
        template< class Char_traits >
        inline auto operator<<(
            basic_ostream<wchar_t, Char_traits>&    stream,
            wchar_t const                           ch
            )
            -> basic_ostream<wchar_t, Char_traits>&
        { return operator<< <wchar_t, Char_traits>( stream, ch ); }

        template< class Char_traits >
        inline auto operator<<(
            basic_ostream<wchar_t, Char_traits>&    stream,
            wchar_t const* const                    s
            )
            -> basic_ostream<wchar_t, Char_traits>&
        { return operator<< <wchar_t, Char_traits>( stream, s ); }
    }  // namespace std
#endif

Setting direct i/o mode in Windows:

This is a standard library extension that's supported by both Visual C++ and MinGW g++.

这是Visual C ++和MinGW g ++支持的标准库扩展。

First, just because it's used in the code, definition of the Ptr type builder (the main drawback of library-provided type builders is that ordinary type inference doesn't kick in, i.e. it's necessary in some cases to still use the raw operator notation):

首先,仅仅因为它在代码中使用,Ptr类型构建器的定义(库提供的类型构建器的主要缺点是普通类型推断不起作用,即在某些情况下仍然需要使用原始运算符表示法):

cppx/core_language/type_builders.hpp
⋮
    template< class T >         using Ptr           = T*;
⋮

A helper definition, because it's used in more than one file:

帮助器定义,因为它在多个文件中使用:

cppx/stdlib/Iostream_mode.hpp
#pragma once
// Mode for a possibly console-attached iostream, such as std::wcout.

namespace cppx {
    enum Iostream_mode: int { unknown, utf_8, direct_io };
}  // namespace cppx

Mode setters (base functionality):

模式设定器(基本功能):

cppx/stdlib/impl/utf8_mode.for_windows.hpp
#pragma once
// UTF-8 mode for a stream in Windows.
#ifndef _WIN32
#   error This is a Windows only implementation.
#endif

#include <cppx/stdlib/Iostream_mode.hpp>

#include <stdio.h>      // FILE, stdin, stdout, stderr, etc.

// Non-standard headers, which are de facto standard in Windows:
#include <io.h>         // _setmode, _isatty, _fileno etc.
#include <fcntl.h>      // _O_WTEXT etc.

namespace cppx {

    inline
    auto set_utf8_mode( const Ptr< FILE > f )
        -> Iostream_mode
    {
        const int file_number = _fileno( f );       // See docs for error handling.
        if( file_number == -1 ) { return Iostream_mode::unknown; }
        const int new_mode = (_isatty( file_number )? _O_WTEXT : _O_U8TEXT);
        const int previous_mode = _setmode( file_number, new_mode );
        return (0?Iostream_mode()
            : previous_mode == -1?      Iostream_mode::unknown
            : new_mode == _O_WTEXT?     Iostream_mode::direct_io
            :                           Iostream_mode::utf_8
            );
    }

}  // namespace cppx
cppx/stdlib/impl/utf8_mode.generic.hpp
#pragma once
#include <stdio.h>      // FILE, stdin, stdout, stderr, etc.
#include <cppx/core_language/type_builders.hpp>     // cppx::Ptr

namespace cppx {

    inline
    auto set_utf8_mode( const Ptr< FILE > )
        -> Iostream_mode
    { return Iostream_mode::unknown; }

}  // namespace cppx
cppx/stdlib/utf8_mode.hpp
#pragma once
// UTF-8 mode for a stream. For Unix-land this is a no-op & the locale must be UTF-8.

#include <cppx/core_language/type_builders.hpp>     // cppx::Ptr
#include <cppx/stdlib/Iostream_mode.hpp>

namespace cppx {
    inline
    auto set_utf8_mode( const Ptr< FILE > ) -> Iostream_mode;
}  // namespace cppx

#ifdef _WIN32   // This also covers 64-bit Windows.
#   include "impl/utf8_mode.for_windows.hpp"    // Using Windows-specific _setmode.
#else
#   include "impl/utf8_mode.generic.hpp"        // A do-nothing implementation.
#endif

Configuring the standard streams.

In addition to setting direct console i/o mode or UTF-8 as appropriate in Windows, this fixes the implicit conversion defect; (indirectly) calls setlocale so that wide streams work in Unix-land; sets boolalpha just for good measure, as a more reasonable default; and includes all standard library headers to do with iostreams (I don't show the separate header file that does that, and it is to a degree a personal preference how much to include or whether to do such inclusion at all):

除了在Windows中适当地设置直接控制台I / O模式或UTF-8之外,这还修复了隐式转换缺陷; (间接)调用setlocale,以便宽流在Unix-land中工作;设置boolalpha只是为了更好的衡量,作为一个更合理的默认;并包含与iostreams相关的所有标准库头文件(我没有显示那样做的单独头文件,并且在某种程度上个人偏好包含多少内容或是否完全包含此内容):

cppx/stdlib/iostreams.hpp
#pragma once
// Standard iostreams but configured to work, plus, as utility, with boolalpha set.

#include <raw_stdlib/iostreams.hpp>         // <iostream>, <sstream>, <fstream> etc. for convenience.

#include <cppx/core_language/type_builders.hpp>     // cppx::Ptr
#include <cppx/stdlib/utf8_mode.hpp>        // stdin etc., stdlib::set_utf8_mode
#include <locale>                           // std::locale
#include <string>                           // std::string

#include <cppx/stdlib/impl/iostreams_conversion_defect.fix.hpp> // Support arg conv.

inline auto operator<< ( std::wostream& stream, const std::string& s )
    -> std::wostream&
{ return (stream << s.c_str()); }

// The following code's sole purpose is to automatically initialize the streams.
namespace cppx { namespace utf8_iostreams {
    using std::locale;
    using std::ostream;
    using std::cin; using std::cout; using std::cerr; using std::clog;
    using std::wostream;
    using std::wcin; using std::wcout; using std::wcerr; using std::wclog;
    using std::boolalpha;

    namespace detail {
        using std::wstreambuf;

        // Based on "Filtering streambufs" code by James Kanze published at
        // <url: http://gabisoft.free.fr/articles/fltrsbf1.html>.
        class Correcting_input_buffer
            : public wstreambuf
        {
        private:
            wstreambuf*     provider_;
            wchar_t         buffer_;

        protected:
            auto underflow()
                -> int_type override
            {
                if( gptr() < egptr() )  { return *gptr(); }

                const int_type result = provider_->sbumpc();
                if( result == L'\n' )
                {
                    // Ad hoc workaround for g++ extra newline undesirable behavior:
                    provider_->pubsync();
                }

                if( traits_type::not_eof( result ) )
                {
                    buffer_ = result;
                    setg( &buffer_, &buffer_, &buffer_ + 1 );
                }
                return result ;
            }

        public:
            Correcting_input_buffer( wstreambuf* a_provider )
                : provider_( a_provider )
            {}
        };
    }  // namespace detail

    class Usage
    {
    private:
        static
        void init_once()
        {
            // In Windows there is no UTF-8 encoding spec for the locale, in Unix-land
            // it's the default. From Microsoft's documentation: "If you provide a code
            // page like UTF-7 or UTF-8, setlocale will fail, returning NULL". Still
            // this call is essential for making the wide streams work correctly in
            // Unix-land.
            locale::global( locale( "" ) ); // Effects a `setlocale( LC_ALL, "" )`.

            for( const Ptr<FILE> c_stream : {stdin, stdout, stderr} )
            {
                const auto new_mode = set_utf8_mode( c_stream );
                if( c_stream == stdin && new_mode == Iostream_mode::direct_io )
                {
                    static detail::Correcting_input_buffer  correcting_buffer( wcin.rdbuf() );
                    wcin.rdbuf( &correcting_buffer );
                }
            }

            for( const Ptr<ostream> stream_ptr : {&cout, &cerr, &clog} )
            {
                *stream_ptr << boolalpha;
            }

            for( const Ptr<wostream> stream_ptr : {&wcout, &wcerr, &wclog} )
            {
                *stream_ptr << boolalpha;
            }
        }

    public:
        Usage()
        { static const bool dummy = (init_once(), true); (void) dummy; }
    };

    namespace detail {
        const Usage usage;
    }  // namespace detail

}}  // namespace cppx::utf8_iostreams

The two example programs in the question are fixed simply by including the above header instead of or in addition to <iostream>. When it's in addition to it can be in a separate translation unit (except for the implicit conversion defect fix, if that's desired the header for it must be included somehow). Or e.g. as a forced include in the build command.

问题中的两个示例程序仅通过包含上述标题而不是 或除了 之外的其他标题来修复。除此之外,它可以在一个单独的翻译单元中(隐式转换缺陷修复除外,如果需要,必须以某种方式包含它的标题)。或者例如作为强制包含在构建命令中。

#1


Fix for the conversion problem:

cppx/stdlib/iostreams_conversion_defect.fix.hpp
#pragma once
//----------------------------------------------------------------------------------------
//    PROBLEM DESCRIPTION.
//
//    Output of wchar_t const* is only supported via an operator<< template. User-defined
//    conversions are not considered for template matching. This results in actual argument
//    with user conversion to wchar_t const*, for a wide stream, being presented as the
//    pointer value instead of the string.

#include <iostream>

#ifndef CPPX_NO_IOSTREAM_CONVERSION_FIX
    namespace std{
        template< class Char_traits >
        inline auto operator<<(
            basic_ostream<wchar_t, Char_traits>&    stream,
            wchar_t const                           ch
            )
            -> basic_ostream<wchar_t, Char_traits>&
        { return operator<< <wchar_t, Char_traits>( stream, ch ); }

        template< class Char_traits >
        inline auto operator<<(
            basic_ostream<wchar_t, Char_traits>&    stream,
            wchar_t const* const                    s
            )
            -> basic_ostream<wchar_t, Char_traits>&
        { return operator<< <wchar_t, Char_traits>( stream, s ); }
    }  // namespace std
#endif

Setting direct i/o mode in Windows:

This is a standard library extension that's supported by both Visual C++ and MinGW g++.

这是Visual C ++和MinGW g ++支持的标准库扩展。

First, just because it's used in the code, definition of the Ptr type builder (the main drawback of library-provided type builders is that ordinary type inference doesn't kick in, i.e. it's necessary in some cases to still use the raw operator notation):

首先,仅仅因为它在代码中使用,Ptr类型构建器的定义(库提供的类型构建器的主要缺点是普通类型推断不起作用,即在某些情况下仍然需要使用原始运算符表示法):

cppx/core_language/type_builders.hpp
⋮
    template< class T >         using Ptr           = T*;
⋮

A helper definition, because it's used in more than one file:

帮助器定义,因为它在多个文件中使用:

cppx/stdlib/Iostream_mode.hpp
#pragma once
// Mode for a possibly console-attached iostream, such as std::wcout.

namespace cppx {
    enum Iostream_mode: int { unknown, utf_8, direct_io };
}  // namespace cppx

Mode setters (base functionality):

模式设定器(基本功能):

cppx/stdlib/impl/utf8_mode.for_windows.hpp
#pragma once
// UTF-8 mode for a stream in Windows.
#ifndef _WIN32
#   error This is a Windows only implementation.
#endif

#include <cppx/stdlib/Iostream_mode.hpp>

#include <stdio.h>      // FILE, stdin, stdout, stderr, etc.

// Non-standard headers, which are de facto standard in Windows:
#include <io.h>         // _setmode, _isatty, _fileno etc.
#include <fcntl.h>      // _O_WTEXT etc.

namespace cppx {

    inline
    auto set_utf8_mode( const Ptr< FILE > f )
        -> Iostream_mode
    {
        const int file_number = _fileno( f );       // See docs for error handling.
        if( file_number == -1 ) { return Iostream_mode::unknown; }
        const int new_mode = (_isatty( file_number )? _O_WTEXT : _O_U8TEXT);
        const int previous_mode = _setmode( file_number, new_mode );
        return (0?Iostream_mode()
            : previous_mode == -1?      Iostream_mode::unknown
            : new_mode == _O_WTEXT?     Iostream_mode::direct_io
            :                           Iostream_mode::utf_8
            );
    }

}  // namespace cppx
cppx/stdlib/impl/utf8_mode.generic.hpp
#pragma once
#include <stdio.h>      // FILE, stdin, stdout, stderr, etc.
#include <cppx/core_language/type_builders.hpp>     // cppx::Ptr

namespace cppx {

    inline
    auto set_utf8_mode( const Ptr< FILE > )
        -> Iostream_mode
    { return Iostream_mode::unknown; }

}  // namespace cppx
cppx/stdlib/utf8_mode.hpp
#pragma once
// UTF-8 mode for a stream. For Unix-land this is a no-op & the locale must be UTF-8.

#include <cppx/core_language/type_builders.hpp>     // cppx::Ptr
#include <cppx/stdlib/Iostream_mode.hpp>

namespace cppx {
    inline
    auto set_utf8_mode( const Ptr< FILE > ) -> Iostream_mode;
}  // namespace cppx

#ifdef _WIN32   // This also covers 64-bit Windows.
#   include "impl/utf8_mode.for_windows.hpp"    // Using Windows-specific _setmode.
#else
#   include "impl/utf8_mode.generic.hpp"        // A do-nothing implementation.
#endif

Configuring the standard streams.

In addition to setting direct console i/o mode or UTF-8 as appropriate in Windows, this fixes the implicit conversion defect; (indirectly) calls setlocale so that wide streams work in Unix-land; sets boolalpha just for good measure, as a more reasonable default; and includes all standard library headers to do with iostreams (I don't show the separate header file that does that, and it is to a degree a personal preference how much to include or whether to do such inclusion at all):

除了在Windows中适当地设置直接控制台I / O模式或UTF-8之外,这还修复了隐式转换缺陷; (间接)调用setlocale,以便宽流在Unix-land中工作;设置boolalpha只是为了更好的衡量,作为一个更合理的默认;并包含与iostreams相关的所有标准库头文件(我没有显示那样做的单独头文件,并且在某种程度上个人偏好包含多少内容或是否完全包含此内容):

cppx/stdlib/iostreams.hpp
#pragma once
// Standard iostreams but configured to work, plus, as utility, with boolalpha set.

#include <raw_stdlib/iostreams.hpp>         // <iostream>, <sstream>, <fstream> etc. for convenience.

#include <cppx/core_language/type_builders.hpp>     // cppx::Ptr
#include <cppx/stdlib/utf8_mode.hpp>        // stdin etc., stdlib::set_utf8_mode
#include <locale>                           // std::locale
#include <string>                           // std::string

#include <cppx/stdlib/impl/iostreams_conversion_defect.fix.hpp> // Support arg conv.

inline auto operator<< ( std::wostream& stream, const std::string& s )
    -> std::wostream&
{ return (stream << s.c_str()); }

// The following code's sole purpose is to automatically initialize the streams.
namespace cppx { namespace utf8_iostreams {
    using std::locale;
    using std::ostream;
    using std::cin; using std::cout; using std::cerr; using std::clog;
    using std::wostream;
    using std::wcin; using std::wcout; using std::wcerr; using std::wclog;
    using std::boolalpha;

    namespace detail {
        using std::wstreambuf;

        // Based on "Filtering streambufs" code by James Kanze published at
        // <url: http://gabisoft.free.fr/articles/fltrsbf1.html>.
        class Correcting_input_buffer
            : public wstreambuf
        {
        private:
            wstreambuf*     provider_;
            wchar_t         buffer_;

        protected:
            auto underflow()
                -> int_type override
            {
                if( gptr() < egptr() )  { return *gptr(); }

                const int_type result = provider_->sbumpc();
                if( result == L'\n' )
                {
                    // Ad hoc workaround for g++ extra newline undesirable behavior:
                    provider_->pubsync();
                }

                if( traits_type::not_eof( result ) )
                {
                    buffer_ = result;
                    setg( &buffer_, &buffer_, &buffer_ + 1 );
                }
                return result ;
            }

        public:
            Correcting_input_buffer( wstreambuf* a_provider )
                : provider_( a_provider )
            {}
        };
    }  // namespace detail

    class Usage
    {
    private:
        static
        void init_once()
        {
            // In Windows there is no UTF-8 encoding spec for the locale, in Unix-land
            // it's the default. From Microsoft's documentation: "If you provide a code
            // page like UTF-7 or UTF-8, setlocale will fail, returning NULL". Still
            // this call is essential for making the wide streams work correctly in
            // Unix-land.
            locale::global( locale( "" ) ); // Effects a `setlocale( LC_ALL, "" )`.

            for( const Ptr<FILE> c_stream : {stdin, stdout, stderr} )
            {
                const auto new_mode = set_utf8_mode( c_stream );
                if( c_stream == stdin && new_mode == Iostream_mode::direct_io )
                {
                    static detail::Correcting_input_buffer  correcting_buffer( wcin.rdbuf() );
                    wcin.rdbuf( &correcting_buffer );
                }
            }

            for( const Ptr<ostream> stream_ptr : {&cout, &cerr, &clog} )
            {
                *stream_ptr << boolalpha;
            }

            for( const Ptr<wostream> stream_ptr : {&wcout, &wcerr, &wclog} )
            {
                *stream_ptr << boolalpha;
            }
        }

    public:
        Usage()
        { static const bool dummy = (init_once(), true); (void) dummy; }
    };

    namespace detail {
        const Usage usage;
    }  // namespace detail

}}  // namespace cppx::utf8_iostreams

The two example programs in the question are fixed simply by including the above header instead of or in addition to <iostream>. When it's in addition to it can be in a separate translation unit (except for the implicit conversion defect fix, if that's desired the header for it must be included somehow). Or e.g. as a forced include in the build command.

问题中的两个示例程序仅通过包含上述标题而不是 或除了 之外的其他标题来修复。除此之外,它可以在一个单独的翻译单元中(隐式转换缺陷修复除外,如果需要,必须以某种方式包含它的标题)。或者例如作为强制包含在构建命令中。