我在哪里可以开始使用C语言中的Unicode友好编程?

时间:2022-08-30 15:06:42

So, I’m working on a plain-C (ANSI 9899:1999) project, and am trying to figure out where to get started re: Unicode, UTF-8, and all that jazz.

所以,我正在研究一个普通的C(ANSI 9899:1999)项目,我正在试图弄清楚从哪里开始:Unicode,UTF-8和所有爵士乐。

Specifically, it’s a language interpreter project, and I have two primary places where I’ll need to handle Unicode: reading in source files (the language ostensibly supports Unicode identifiers and such), and in ‘string’ objects.

具体来说,它是一个语言解释器项目,我有两个主要的地方,我需要处理Unicode:读取源文件(表面上支持Unicode标识符的语言等),以及'字符串'对象。

I’m familiar with all the obvious basics about Unicode, UTF-7/8/16/32 & UCS-2/4, so on and so forth… I’m mostly looking for useful, C-specific (that is, please no C++ or C#, which is all that’s been documented here on SO previously) resources as to my ‘next steps’ to implement Unicode-friendly stuff… in C.

我熟悉所有关于Unicode的明显基础知识,UTF-7/8/16/32和UCS-2/4,依此类推......我主要是寻找有用的,特定于C的(也就是说,请没有C ++或C#,这就是之前在SO上记录的所有资源)关于实现Unicode友好的东西的“后续步骤”的资源......在C.

Any links, manpages, Wikipedia articles, example code, is all extremely welcome. I’ll also try to maintain a list of such resources here in the original question, for anybody who happens across it later.

任何链接,联机帮助页,*文章,示例代码都非常受欢迎。我还将尝试在原始问题中维护此类资源的列表,以供稍后发生的任何人使用。


  • A must read before considering anything else, if you’re unfamiliar with Unicode, and what an encoding actually is: http://www.joelonsoftware.com/articles/Unicode.html
  • 如果你不熟悉Unicode,那么在考虑其他任何事情之前必须阅读,以及编码实际上是什么:http://www.joelonsoftware.com/articles/Unicode.html

  • The UTF-8 home-page: http://www.utf-8.com/
  • UTF-8主页:http://www.utf-8.com/

  • man 3 iconv (as well as iconv_open and iconvctl)
  • man 3 iconv(以及iconv_open和iconvctl)

  • International Components for Unicode (via Geoff Reedy)
  • Unicode的国际组件(通过Geoff Reedy)

  • libbasekit, which seems to include light Unicode-handling tools
  • libbasekit,似乎包括轻量级Unicode处理工具

  • Glib has some Unicode functions
  • Glib有一些Unicode功能

  • A basic UTF-8 detector function, by Christoph
  • Christoph基本的UTF-8探测器功能

3 个解决方案

#1


10  

International Components for Unicode provides a portable C library for handling unicode. Here's their elevator pitch for ICU4C:

International Components for Unicode提供了一个用于处理unicode的可移植C库。这是ICU4C的电梯间距:

The C and C++ languages and many operating system environments do not provide full support for Unicode and standards-compliant text handling services. Even though some platforms do provide good Unicode text handling services, portable application code can not make use of them. The ICU4C libraries fills in this gap. ICU4C provides an open, flexible, portable foundation for applications to use for their software globalization requirements. ICU4C closely tracks industry standards, including Unicode and CLDR (Common Locale Data Repository).

C和C ++语言以及许多操作系统环境不能完全支持Unicode和符合标准的文本处理服务。尽管某些平台确实提供了良好的Unicode文本处理服务,但便携式应用程序代码无法使用它们。 ICU4C库填补了这一空白。 ICU4C为应用程序提供了一个开放,灵活,可移植的基础,用于满足其软件全球化要求。 ICU4C密切跟踪行业标准,包括Unicode和CLDR(通用区域数据存储库)。

#2


3  

GLib has some Unicode functions and is a pretty lightweight library. It's not near the same level of functionality that ICU provides, but it might be good enough for some applications. The other features of GLib are good to have for portable C programs too.

GLib有一些Unicode函数,是一个非常轻量级的库。它与ICU提供的功能并不相同,但对某些应用程序来说可能已经足够了。 GLib的其他功能也适用于便携式C程序。

GTK+ is built on top of GLib. GLib provides the fundamental algorithmic language constructs commonly duplicated in applications. This library has features such as (this list is not a comprehensive list):

GTK +建立在GLib之上。 GLib提供了通常在应用程序中复制的基本算法语言结构。该库具有如下功能(此列表不是一个全面的列表):

  • Object and type system
  • 对象和类型系统

  • Main loop
  • Dynamic loading of modules (i.e. plug-ins)
  • 动态加载模块(即插件)

  • Thread support
  • Timer support
  • Memory allocator
  • Threaded Queues (synchronous and asynchronous)
  • 线程队列(同步和异步)

  • Lists (singly linked, doubly linked, double ended)
  • 列表(单链接,双链接,双端)

  • Hash tables
  • Arrays
  • Trees (N-ary and binary balanced)
  • 树木(N元和二元平衡)

  • String utilities and charset handling
  • 字符串实用程序和字符集处理

  • Lexical scanner and XML parser
  • 词法扫描程序和XML解析器

  • Base64 (encoding & decoding)
  • Base64(编码和解码)

#3


0  

I think one of the interesting questions is - what should your canonical internal format for strings be? The 2 obvious choices (to me at least) are

我认为其中一个有趣的问题是 - 字符串的规范内部格式应该是什么?两个明显的选择(至少对我来说)是

a) utf8 in vanilla c-strings b) utf16 in unsigned short arrays

a)utf8 in vanilla c-strings b)utf16 in unsigned short arrays

In previous projects I have always chosen utf-8. Why ; because its the path of least resistance in the C world. Everything you are interfacing with (stdio, string.h etc) will work fine.

在以前的项目中,我总是选择utf-8。为什么因为它是C世界中阻力最小的路径。你正在连接的所有东西(stdio,string.h等)都可以正常工作。

Next comes - what file format. The problem here is that its visible to your users (unless you provide the only editor for your language). Here I guess you have to take what they give you and try to guess by peeking (byte order marks help)

接下来 - 什么文件格式。这里的问题是它对您的用户可见(除非您为您的语言提供唯一的编辑器)。在这里,我想你必须采取他们给你的东西,并尝试通过偷看猜测(字节顺序标记帮助)

#1


10  

International Components for Unicode provides a portable C library for handling unicode. Here's their elevator pitch for ICU4C:

International Components for Unicode提供了一个用于处理unicode的可移植C库。这是ICU4C的电梯间距:

The C and C++ languages and many operating system environments do not provide full support for Unicode and standards-compliant text handling services. Even though some platforms do provide good Unicode text handling services, portable application code can not make use of them. The ICU4C libraries fills in this gap. ICU4C provides an open, flexible, portable foundation for applications to use for their software globalization requirements. ICU4C closely tracks industry standards, including Unicode and CLDR (Common Locale Data Repository).

C和C ++语言以及许多操作系统环境不能完全支持Unicode和符合标准的文本处理服务。尽管某些平台确实提供了良好的Unicode文本处理服务,但便携式应用程序代码无法使用它们。 ICU4C库填补了这一空白。 ICU4C为应用程序提供了一个开放,灵活,可移植的基础,用于满足其软件全球化要求。 ICU4C密切跟踪行业标准,包括Unicode和CLDR(通用区域数据存储库)。

#2


3  

GLib has some Unicode functions and is a pretty lightweight library. It's not near the same level of functionality that ICU provides, but it might be good enough for some applications. The other features of GLib are good to have for portable C programs too.

GLib有一些Unicode函数,是一个非常轻量级的库。它与ICU提供的功能并不相同,但对某些应用程序来说可能已经足够了。 GLib的其他功能也适用于便携式C程序。

GTK+ is built on top of GLib. GLib provides the fundamental algorithmic language constructs commonly duplicated in applications. This library has features such as (this list is not a comprehensive list):

GTK +建立在GLib之上。 GLib提供了通常在应用程序中复制的基本算法语言结构。该库具有如下功能(此列表不是一个全面的列表):

  • Object and type system
  • 对象和类型系统

  • Main loop
  • Dynamic loading of modules (i.e. plug-ins)
  • 动态加载模块(即插件)

  • Thread support
  • Timer support
  • Memory allocator
  • Threaded Queues (synchronous and asynchronous)
  • 线程队列(同步和异步)

  • Lists (singly linked, doubly linked, double ended)
  • 列表(单链接,双链接,双端)

  • Hash tables
  • Arrays
  • Trees (N-ary and binary balanced)
  • 树木(N元和二元平衡)

  • String utilities and charset handling
  • 字符串实用程序和字符集处理

  • Lexical scanner and XML parser
  • 词法扫描程序和XML解析器

  • Base64 (encoding & decoding)
  • Base64(编码和解码)

#3


0  

I think one of the interesting questions is - what should your canonical internal format for strings be? The 2 obvious choices (to me at least) are

我认为其中一个有趣的问题是 - 字符串的规范内部格式应该是什么?两个明显的选择(至少对我来说)是

a) utf8 in vanilla c-strings b) utf16 in unsigned short arrays

a)utf8 in vanilla c-strings b)utf16 in unsigned short arrays

In previous projects I have always chosen utf-8. Why ; because its the path of least resistance in the C world. Everything you are interfacing with (stdio, string.h etc) will work fine.

在以前的项目中,我总是选择utf-8。为什么因为它是C世界中阻力最小的路径。你正在连接的所有东西(stdio,string.h等)都可以正常工作。

Next comes - what file format. The problem here is that its visible to your users (unless you provide the only editor for your language). Here I guess you have to take what they give you and try to guess by peeking (byte order marks help)

接下来 - 什么文件格式。这里的问题是它对您的用户可见(除非您为您的语言提供唯一的编辑器)。在这里,我想你必须采取他们给你的东西,并尝试通过偷看猜测(字节顺序标记帮助)