iconv任何编码到UTF-8。

时间:2023-01-06 21:46:34

I am trying to point iconv to a directory and all files will be converted UTF-8 regardless of the current encoding

我试图将iconv指向一个目录,所有文件都将被转换为UTF-8,而不考虑当前的编码。

I am using this script but you have to specify what encoding you are going FROM. How can I make it autdetect the current encoding?

我正在使用这个脚本,但是您必须指定您要从哪个编码。如何使它自动检测当前编码?

dir_iconv.sh

dir_iconv.sh

#!/bin/bash

ICONVBIN='/usr/bin/iconv' # path to iconv binary

if [ $# -lt 3 ]
then
    echo "$0 dir from_charset to_charset"
    exit
fi

for f in $1/*
do
    if test -f $f
    then
        echo -e "\nConverting $f"
        /bin/mv $f $f.old
        $ICONVBIN -f $2 -t $3 $f.old > $f
    else
        echo -e "\nSkipping $f - not a regular file";
    fi
done

terminal line

端子线

sudo convert/dir_iconv.sh convert/books CURRENT_ENCODING utf8

6 个解决方案

#1


17  

Maybe you are looking for enca:

也许你在找enca:

Enca is an Extremely Naive Charset Analyser. It detects character set and encoding of text files and can also convert them to other encodings using either a built-in converter or external libraries and tools like libiconv, librecode, or cstocs.

Enca是一种非常简单的字符集分析器。它可以检测文本文件的字符集和编码,还可以使用内置的转换器或外部库和工具(如libiconv、librecode或cstocs)将它们转换为其他编码。

Currently it supports Belarusian, Bulgarian, Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Russian, Slovak, Slovene, Ukrainian, Chinese, and some multibyte encodings independently on language.

目前它支持白俄罗斯、保加利亚、克罗地亚、捷克、爱沙尼亚、匈牙利、拉脱维亚、立陶宛、波兰、俄罗斯、斯洛伐克、斯洛文尼亚、乌克兰、中国和一些独立于语言的多字节编码。

Note that in general, autodetection of current encoding is a difficult process (the same byte sequence can be correct text in multiple encodings). enca uses heuristics based on the language you tell it to detect (to limit the number of encodings). You can use enconv to convert text files to a single encoding.

注意,一般来说,对当前编码进行自动检测是一个困难的过程(相同的字节序列可以在多个编码中是正确的文本)。enca使用基于您所告知的语言的启发式方法(以限制编码的数量)。可以使用enconv将文本文件转换为单个编码。

#2


8  

You can get what you need using standard gnu utils file and awk. Example:

您可以使用标准的gnu utils文件和awk获得所需的内容。例子:

file -bi .xsession-errors gives me: "text/plain; charset=us-ascii"

文件-bi .xsession-errors给我:“text/plain;charset = us - ascii”

so file -bi .xsession-errors |awk -F "=" '{print $2}' gives me "us-ascii"

因此,file -bi .xsession-error, |awk -F "=" '{print $2}'给我"us-ascii"

I use it in scripts like so:

我在这样的脚本中使用它:

CHARSET="$(file -bi "$i"|awk -F "=" '{print $2}')"

if [ "$CHARSET" != utf-8 ]; then

        iconv -f "$CHARSET" -t utf8 "$i" -o outfile

fi

#3


6  

Compiling all them. Go to dir, create dir2utf8.sh :

编译所有。到dir,创建dir2utf8。承宪:

#!/bin/bash
# converting all files in a dir to utf8 

for f in *
do
    if test -f $f then
        echo -e "\nConverting $f"
        CHARSET="$( file -bi "$f"|awk -F "=" '{print $2}')"
        if [ "$CHARSET" != utf-8 ]; then
                iconv -f "$CHARSET" -t utf8 "$f" -o "$f"
        fi
    else
        echo -e "\nSkipping $f - it's a regular file";
    fi
done

#4


3  

Here is my solution to inplace all files:

下面是我的解决方案:

#!/bin/bash

apt-get -y install recode uchardet > /dev/null
find "$1" -type f | while read FFN # 'dir' should be changed...
do
    encoding=$(uchardet "$FFN")
    echo "$FFN: $encoding"
    enc=`echo $encoding | sed 's#^x-mac-#mac#'`
    set +x
    recode $enc..UTF-8 "$FFN"
done

https://gist.github.com/demofly/25f856a96c29b89baa32

https://gist.github.com/demofly/25f856a96c29b89baa32

put it into convert-dir-to-utf8.sh and run:

把它放到convert-dir-to-utf8。sh并运行:

bash convert-dir-to-utf8.sh /pat/to/my/trash/dir

Note that sed is a workaround for mac encodings here. Many uncommon encodings need workarounds like this.

注意,sed是一个用于mac编码的工作区。许多不常见的编码需要这样的变通方法。

#5


0  

Check out tools available for a data convertation in a linux cli: https://www.debian.org/doc/manuals/debian-reference/ch11.en.html

查看在linux cli中可用于数据转换的工具:https://www.debian.org/doc/manuals/debian-reference/ch11.en.html。

Also, there is a quest to find out a full list of encodings which are available in iconv. Just run iconv --list and find out that encoding names differs from names returned by uchardet tool (for example: x-mac-cyrillic in uchardet vs. mac-cyrillic in iconv)

此外,还有一项任务是找出在iconv中可用的编码的完整列表。运行iconv——列出并发现编码名称不同于uchardet工具返回的名称(例如:uchardet中的x-mac-cyrillic和iconv中的macl -cyrillic)

#6


0  

enca command doesn't work for my Simplified-Chinese text file with GB2312 encoding.

enca命令对我的简体中文文本文件不适用GB2312编码。

Instead, I use the following function to convert the text file for me. You could of course re-direct the output into a file.

相反,我使用以下函数来转换文本文件。当然,您可以将输出重新引导到一个文件中。

It requires chardet and iconv commands.

它需要chardet和iconv命令。

detection_cat () 
{
    DET_OUT=$(chardet $1);
    ENC=$(echo $DET_OUT | sed "s|^.*: \(.*\) (confid.*$|\1|");
    iconv -f $ENC $1
}

#1


17  

Maybe you are looking for enca:

也许你在找enca:

Enca is an Extremely Naive Charset Analyser. It detects character set and encoding of text files and can also convert them to other encodings using either a built-in converter or external libraries and tools like libiconv, librecode, or cstocs.

Enca是一种非常简单的字符集分析器。它可以检测文本文件的字符集和编码,还可以使用内置的转换器或外部库和工具(如libiconv、librecode或cstocs)将它们转换为其他编码。

Currently it supports Belarusian, Bulgarian, Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Russian, Slovak, Slovene, Ukrainian, Chinese, and some multibyte encodings independently on language.

目前它支持白俄罗斯、保加利亚、克罗地亚、捷克、爱沙尼亚、匈牙利、拉脱维亚、立陶宛、波兰、俄罗斯、斯洛伐克、斯洛文尼亚、乌克兰、中国和一些独立于语言的多字节编码。

Note that in general, autodetection of current encoding is a difficult process (the same byte sequence can be correct text in multiple encodings). enca uses heuristics based on the language you tell it to detect (to limit the number of encodings). You can use enconv to convert text files to a single encoding.

注意,一般来说,对当前编码进行自动检测是一个困难的过程(相同的字节序列可以在多个编码中是正确的文本)。enca使用基于您所告知的语言的启发式方法(以限制编码的数量)。可以使用enconv将文本文件转换为单个编码。

#2


8  

You can get what you need using standard gnu utils file and awk. Example:

您可以使用标准的gnu utils文件和awk获得所需的内容。例子:

file -bi .xsession-errors gives me: "text/plain; charset=us-ascii"

文件-bi .xsession-errors给我:“text/plain;charset = us - ascii”

so file -bi .xsession-errors |awk -F "=" '{print $2}' gives me "us-ascii"

因此,file -bi .xsession-error, |awk -F "=" '{print $2}'给我"us-ascii"

I use it in scripts like so:

我在这样的脚本中使用它:

CHARSET="$(file -bi "$i"|awk -F "=" '{print $2}')"

if [ "$CHARSET" != utf-8 ]; then

        iconv -f "$CHARSET" -t utf8 "$i" -o outfile

fi

#3


6  

Compiling all them. Go to dir, create dir2utf8.sh :

编译所有。到dir,创建dir2utf8。承宪:

#!/bin/bash
# converting all files in a dir to utf8 

for f in *
do
    if test -f $f then
        echo -e "\nConverting $f"
        CHARSET="$( file -bi "$f"|awk -F "=" '{print $2}')"
        if [ "$CHARSET" != utf-8 ]; then
                iconv -f "$CHARSET" -t utf8 "$f" -o "$f"
        fi
    else
        echo -e "\nSkipping $f - it's a regular file";
    fi
done

#4


3  

Here is my solution to inplace all files:

下面是我的解决方案:

#!/bin/bash

apt-get -y install recode uchardet > /dev/null
find "$1" -type f | while read FFN # 'dir' should be changed...
do
    encoding=$(uchardet "$FFN")
    echo "$FFN: $encoding"
    enc=`echo $encoding | sed 's#^x-mac-#mac#'`
    set +x
    recode $enc..UTF-8 "$FFN"
done

https://gist.github.com/demofly/25f856a96c29b89baa32

https://gist.github.com/demofly/25f856a96c29b89baa32

put it into convert-dir-to-utf8.sh and run:

把它放到convert-dir-to-utf8。sh并运行:

bash convert-dir-to-utf8.sh /pat/to/my/trash/dir

Note that sed is a workaround for mac encodings here. Many uncommon encodings need workarounds like this.

注意,sed是一个用于mac编码的工作区。许多不常见的编码需要这样的变通方法。

#5


0  

Check out tools available for a data convertation in a linux cli: https://www.debian.org/doc/manuals/debian-reference/ch11.en.html

查看在linux cli中可用于数据转换的工具:https://www.debian.org/doc/manuals/debian-reference/ch11.en.html。

Also, there is a quest to find out a full list of encodings which are available in iconv. Just run iconv --list and find out that encoding names differs from names returned by uchardet tool (for example: x-mac-cyrillic in uchardet vs. mac-cyrillic in iconv)

此外,还有一项任务是找出在iconv中可用的编码的完整列表。运行iconv——列出并发现编码名称不同于uchardet工具返回的名称(例如:uchardet中的x-mac-cyrillic和iconv中的macl -cyrillic)

#6


0  

enca command doesn't work for my Simplified-Chinese text file with GB2312 encoding.

enca命令对我的简体中文文本文件不适用GB2312编码。

Instead, I use the following function to convert the text file for me. You could of course re-direct the output into a file.

相反,我使用以下函数来转换文本文件。当然,您可以将输出重新引导到一个文件中。

It requires chardet and iconv commands.

它需要chardet和iconv命令。

detection_cat () 
{
    DET_OUT=$(chardet $1);
    ENC=$(echo $DET_OUT | sed "s|^.*: \(.*\) (confid.*$|\1|");
    iconv -f $ENC $1
}