从CSV中删除非ascii字符。

时间:2022-02-01 05:05:04

I want to remove all the non-ASCII characters from a file in place.

我要将所有非ascii字符从一个文件中删除。

I found one solution with tr, but I guess I need to write back that file after modification.

我在tr中找到了一个解决方案,但是我想我需要在修改后回写那个文件。

I need to do it in place with relatively good performance.

我需要在相对较好的性能下进行。

Any suggestions?

有什么建议吗?

11 个解决方案

#1


30  

# -i (inplace)

sed -i 's/[\d128-\d255]//g' FILENAME

#2


57  

A perl oneliner would do: perl -i.bak -pe 's/[^[:ascii:]]//g' <your file>

perl oneliner会做:perl -i。贝克体育' s /[^[ascii:]]/ / g ' <文件>

-i says that the file is going to be edited inplace, and the backup is going to be saved with extension .bak.

-我说文件将会被编辑,并且备份将被保存。

#3


12  

sed -i 's/[^[:print:]]//' FILENAME

Also, this acts like dos2unix

另外,这也类似于dos2unix。

#4


11  

I found the following solution to be working:

我发现下面的解决方案是:

perl -i.bk -pe 's/[^[:ascii:]]//g;' filename

#5


4  

I'm using a very minimal busybox system, in which there is no support for ranges in tr or POSIX character classes, so I have to do it the crappy old-fashioned way. Here's the solution with sed, stripping ALL non-printable non-ASCII characters from the file:

我使用的是一个非常小的busybox系统,在这个系统中不支持tr或POSIX字符类的范围,所以我不得不采用老式的方法。下面是sed的解决方案,从文件中删除所有不可打印的非ascii字符:

sed -i 's/[^a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILE

#6


3  

As an alternative to sed or perl you may consider to use ed(1) and POSIX character classes.

作为sed或perl的替代方案,您可以考虑使用ed(1)和POSIX字符类。

Note: ed(1) reads the entire file into memory to edit it in-place, so for really large files you should use sed -i ..., perl -i ...

注意:ed(1)将整个文件读入内存来编辑它,所以对于真正大的文件,您应该使用sed -i…,perl -我…

# see:
# - http://wiki.bash-hackers.org/doku.php?id=howto:edit-ed
# - http://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes

# test
echo $'aaa \177 bbb \200 \214 ccc \254 ddd\r\n' > testfile
ed -s testfile <<< $',l' 
ed -s testfile <<< $'H\ng/[^[:graph:][:space:][:cntrl:]]/s///g\nwq'
ed -s testfile <<< $',l'

#7


3  

This worked for me:

这工作对我来说:

sed -i 's/[^[:print:]]//g'

#8


2  

awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print }' MYinputfile.txt > pipe_out_to_CONVERTED_FILE.txt

#9


1  

I tried all the solutions and nothing worked. The following, however, does:

我尝试了所有的解决方案,但没有成功。然而,以下:

tr -cd '\11\12\15\40-\176'

Which I found here:

我在这里找到:

https://alvinalexander.com/blog/post/linux-unix/how-remove-non-printable-ascii-characters-file-unix

https://alvinalexander.com/blog/post/linux-unix/how-remove-non-printable-ascii-characters-file-unix

My problem needed it in a series of piped programs, not directly from a file, so modify as needed.

我的问题需要它在一系列的管道程序中,而不是直接从一个文件,所以根据需要修改。

#10


1  

Try tr instead of sed

试试tr而不是sed。

tr -cd '[:print:]' < file.txt

#11


-1  

I appreciate the tips I found on this site.

我很感激我在这个网站上找到的建议。

But, on my Windows 10, I had to use double quotes for this to work ...

但是,在我的Windows 10上,我不得不使用双引号来工作……

sed -i "s/[\d128-\d255]//g" FILENAME

sed - s /[\ d128 - \ d255]/ / g”文件名

Noticed these things ...

注意到这些事情……

  1. For FILENAME the entire path\name needs to be quoted This didn't work -- %TEMP%\"FILENAME" This did -- %TEMP%\FILENAME"

    对于文件名,整个路径\名需要被引用,这不起作用——%TEMP%\"文件名",%TEMP%\FILENAME"

  2. sed leaves behind temp files in the current directory, named sed*

    sed在当前目录下的temp文件后面,命名为sed*。

#1


30  

# -i (inplace)

sed -i 's/[\d128-\d255]//g' FILENAME

#2


57  

A perl oneliner would do: perl -i.bak -pe 's/[^[:ascii:]]//g' <your file>

perl oneliner会做:perl -i。贝克体育' s /[^[ascii:]]/ / g ' <文件>

-i says that the file is going to be edited inplace, and the backup is going to be saved with extension .bak.

-我说文件将会被编辑,并且备份将被保存。

#3


12  

sed -i 's/[^[:print:]]//' FILENAME

Also, this acts like dos2unix

另外,这也类似于dos2unix。

#4


11  

I found the following solution to be working:

我发现下面的解决方案是:

perl -i.bk -pe 's/[^[:ascii:]]//g;' filename

#5


4  

I'm using a very minimal busybox system, in which there is no support for ranges in tr or POSIX character classes, so I have to do it the crappy old-fashioned way. Here's the solution with sed, stripping ALL non-printable non-ASCII characters from the file:

我使用的是一个非常小的busybox系统,在这个系统中不支持tr或POSIX字符类的范围,所以我不得不采用老式的方法。下面是sed的解决方案,从文件中删除所有不可打印的非ascii字符:

sed -i 's/[^a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILE

#6


3  

As an alternative to sed or perl you may consider to use ed(1) and POSIX character classes.

作为sed或perl的替代方案,您可以考虑使用ed(1)和POSIX字符类。

Note: ed(1) reads the entire file into memory to edit it in-place, so for really large files you should use sed -i ..., perl -i ...

注意:ed(1)将整个文件读入内存来编辑它,所以对于真正大的文件,您应该使用sed -i…,perl -我…

# see:
# - http://wiki.bash-hackers.org/doku.php?id=howto:edit-ed
# - http://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes

# test
echo $'aaa \177 bbb \200 \214 ccc \254 ddd\r\n' > testfile
ed -s testfile <<< $',l' 
ed -s testfile <<< $'H\ng/[^[:graph:][:space:][:cntrl:]]/s///g\nwq'
ed -s testfile <<< $',l'

#7


3  

This worked for me:

这工作对我来说:

sed -i 's/[^[:print:]]//g'

#8


2  

awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print }' MYinputfile.txt > pipe_out_to_CONVERTED_FILE.txt

#9


1  

I tried all the solutions and nothing worked. The following, however, does:

我尝试了所有的解决方案,但没有成功。然而,以下:

tr -cd '\11\12\15\40-\176'

Which I found here:

我在这里找到:

https://alvinalexander.com/blog/post/linux-unix/how-remove-non-printable-ascii-characters-file-unix

https://alvinalexander.com/blog/post/linux-unix/how-remove-non-printable-ascii-characters-file-unix

My problem needed it in a series of piped programs, not directly from a file, so modify as needed.

我的问题需要它在一系列的管道程序中,而不是直接从一个文件,所以根据需要修改。

#10


1  

Try tr instead of sed

试试tr而不是sed。

tr -cd '[:print:]' < file.txt

#11


-1  

I appreciate the tips I found on this site.

我很感激我在这个网站上找到的建议。

But, on my Windows 10, I had to use double quotes for this to work ...

但是,在我的Windows 10上,我不得不使用双引号来工作……

sed -i "s/[\d128-\d255]//g" FILENAME

sed - s /[\ d128 - \ d255]/ / g”文件名

Noticed these things ...

注意到这些事情……

  1. For FILENAME the entire path\name needs to be quoted This didn't work -- %TEMP%\"FILENAME" This did -- %TEMP%\FILENAME"

    对于文件名,整个路径\名需要被引用,这不起作用——%TEMP%\"文件名",%TEMP%\FILENAME"

  2. sed leaves behind temp files in the current directory, named sed*

    sed在当前目录下的temp文件后面,命名为sed*。