在Go中读取非UTF-8文本文件

时间:2023-01-05 15:25:35

I need to read a text file that is encoded in GBK. The standard library in Go programming language assumes that all text is encoded in UTF-8.

我需要读取一个用GBK编码的文本文件。Go编程语言中的标准库假设所有文本都用UTF-8编码。

How can I read files in other encodings?

如何读取其他编码中的文件?

2 个解决方案

#1


14  

Previously (as mentioned in an older answer) the "easy" way to do this involved using third party packages that needed cgo and wrapped the iconv library. That is undesirable for many reasons. Thankfully, for quite a while now there has been a superior all Go way of doing this using only packages provided by the Go Authors (not in the main set of packages but in the Go Sub-Repositories).

以前(如旧的回答中所提到的)使用第三方包(需要cgo并封装iconv库)是“简单”的方法。这是不可取的原因有很多。值得庆幸的是,在相当长的一段时间内,使用Go作者提供的包(不是主要的包,而是在Go的子存储库中),已经有了一种更好的方法。

The golang.org/x/text/encoding package defines an interface for generic character encodings that can convert to/from UTF-8. The golang.org/x/text/encoding/simplifiedchinese sub-package provides GB18030, GBK and HZ-GB2312 encoding implementations.

golang.org/x/text/编码包定义了一个接口,用于通用字符编码,可转换为/从UTF-8转换。golang.org/x/text/encoding/simplifiedchinese子包提供了GB18030、GBK和HZ-GB2312编码实现。

Here is an example of reading and writing a GBK encoded file. Note that the io.Reader and io.Writer do the encoding "on the fly" as data is being read/written.

下面是一个读取和编写GBK编码文件的示例。注意,io。读者和io。写入器在读取/写入数据时进行“动态”编码。

package main

import (
    "bufio"
    "fmt"
    "log"
    "os"

    "golang.org/x/text/encoding/simplifiedchinese"
    "golang.org/x/text/transform"
)

// Encoding to use. Since this implements the encoding.Encoding
// interface from golang.org/x/text/encoding you can trivially
// change this out for any of the other implemented encoders,
// e.g. `traditionalchinese.Big5`, `charmap.Windows1252`,
// `korean.EUCKR`, etc.
var enc = simplifiedchinese.GBK

func main() {
    const filename = "example_GBK_file"
    exampleWriteGBK(filename)
    exampleReadGBK(filename)
}

func exampleReadGBK(filename string) {
    // Read UTF-8 from a GBK encoded file.
    f, err := os.Open(filename)
    if err != nil {
        log.Fatal(err)
    }
    r := transform.NewReader(f, enc.NewDecoder())

    // Read converted UTF-8 from `r` as needed.
    // As an example we'll read line-by-line showing what was read:
    sc := bufio.NewScanner(r)
    for sc.Scan() {
        fmt.Printf("Read line: %s\n", sc.Bytes())
    }
    if err = sc.Err(); err != nil {
        log.Fatal(err)
    }

    if err = f.Close(); err != nil {
        log.Fatal(err)
    }
}

func exampleWriteGBK(filename string) {
    // Write UTF-8 to a GBK encoded file.
    f, err := os.Create(filename)
    if err != nil {
        log.Fatal(err)
    }
    w := transform.NewWriter(f, enc.NewEncoder())

    // Write UTF-8 to `w` as desired.
    // As an example we'll write some text from the Wikipedia
    // GBK page that includes Chinese.
    _, err = fmt.Fprintln(w,
        `In 1995, China National Information Technology Standardization
Technical Committee set down the Chinese Internal Code Specification
(Chinese: 汉字内码扩展规范(GBK); pinyin: Hànzì Nèimǎ
Kuòzhǎn Guīfàn (GBK)), Version 1.0, known as GBK 1.0, which is a
slight extension of Codepage 936. The newly added 95 characters were not
found in GB 13000.1-1993, and were provisionally assigned Unicode PUA
code points.`)
    if err != nil {
        log.Fatal(err)
    }

    if err = f.Close(); err != nil {
        log.Fatal(err)
    }
}

#2


5  

Try go-iconv. It wraps iconv and implements io.Reader and io.Writer.

go-iconv试试。它封装iconv并实现io。读者和io.Writer。

This message in golang-china discussion group is mentioning a few examples of go-iconv usage.

在中国—戈兰论坛上的这则消息提到了go-iconv的一些用法。

#1


14  

Previously (as mentioned in an older answer) the "easy" way to do this involved using third party packages that needed cgo and wrapped the iconv library. That is undesirable for many reasons. Thankfully, for quite a while now there has been a superior all Go way of doing this using only packages provided by the Go Authors (not in the main set of packages but in the Go Sub-Repositories).

以前(如旧的回答中所提到的)使用第三方包(需要cgo并封装iconv库)是“简单”的方法。这是不可取的原因有很多。值得庆幸的是,在相当长的一段时间内,使用Go作者提供的包(不是主要的包,而是在Go的子存储库中),已经有了一种更好的方法。

The golang.org/x/text/encoding package defines an interface for generic character encodings that can convert to/from UTF-8. The golang.org/x/text/encoding/simplifiedchinese sub-package provides GB18030, GBK and HZ-GB2312 encoding implementations.

golang.org/x/text/编码包定义了一个接口,用于通用字符编码,可转换为/从UTF-8转换。golang.org/x/text/encoding/simplifiedchinese子包提供了GB18030、GBK和HZ-GB2312编码实现。

Here is an example of reading and writing a GBK encoded file. Note that the io.Reader and io.Writer do the encoding "on the fly" as data is being read/written.

下面是一个读取和编写GBK编码文件的示例。注意,io。读者和io。写入器在读取/写入数据时进行“动态”编码。

package main

import (
    "bufio"
    "fmt"
    "log"
    "os"

    "golang.org/x/text/encoding/simplifiedchinese"
    "golang.org/x/text/transform"
)

// Encoding to use. Since this implements the encoding.Encoding
// interface from golang.org/x/text/encoding you can trivially
// change this out for any of the other implemented encoders,
// e.g. `traditionalchinese.Big5`, `charmap.Windows1252`,
// `korean.EUCKR`, etc.
var enc = simplifiedchinese.GBK

func main() {
    const filename = "example_GBK_file"
    exampleWriteGBK(filename)
    exampleReadGBK(filename)
}

func exampleReadGBK(filename string) {
    // Read UTF-8 from a GBK encoded file.
    f, err := os.Open(filename)
    if err != nil {
        log.Fatal(err)
    }
    r := transform.NewReader(f, enc.NewDecoder())

    // Read converted UTF-8 from `r` as needed.
    // As an example we'll read line-by-line showing what was read:
    sc := bufio.NewScanner(r)
    for sc.Scan() {
        fmt.Printf("Read line: %s\n", sc.Bytes())
    }
    if err = sc.Err(); err != nil {
        log.Fatal(err)
    }

    if err = f.Close(); err != nil {
        log.Fatal(err)
    }
}

func exampleWriteGBK(filename string) {
    // Write UTF-8 to a GBK encoded file.
    f, err := os.Create(filename)
    if err != nil {
        log.Fatal(err)
    }
    w := transform.NewWriter(f, enc.NewEncoder())

    // Write UTF-8 to `w` as desired.
    // As an example we'll write some text from the Wikipedia
    // GBK page that includes Chinese.
    _, err = fmt.Fprintln(w,
        `In 1995, China National Information Technology Standardization
Technical Committee set down the Chinese Internal Code Specification
(Chinese: 汉字内码扩展规范(GBK); pinyin: Hànzì Nèimǎ
Kuòzhǎn Guīfàn (GBK)), Version 1.0, known as GBK 1.0, which is a
slight extension of Codepage 936. The newly added 95 characters were not
found in GB 13000.1-1993, and were provisionally assigned Unicode PUA
code points.`)
    if err != nil {
        log.Fatal(err)
    }

    if err = f.Close(); err != nil {
        log.Fatal(err)
    }
}

#2


5  

Try go-iconv. It wraps iconv and implements io.Reader and io.Writer.

go-iconv试试。它封装iconv并实现io。读者和io.Writer。

This message in golang-china discussion group is mentioning a few examples of go-iconv usage.

在中国—戈兰论坛上的这则消息提到了go-iconv的一些用法。