grep / regex找不到重音词

时间:2022-01-27 20:23:05

I'm trying mount a regex that get some words on a file where all letters of this word match with a word pattern.

我正在尝试安装一个正则表达式,在文件中得到一些单词,这个单词的所有字母都与单词模式匹配。

My problem is, the regex can't find accented words, but in my text file there are alot of accented words.

我的问题是,正则表达式找不到重音词,但在我的文本文件中有很多重音词。

My command line is:

我的命令行是:

cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt
cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt

And the content of file is:

而文件的内容是:

carroça
éra
éssa
roça
roco
rato
onça
orça
roca

How can I fix it?

我该如何解决?

4 个解决方案

#1


8  

If your file is encoded in ISO-8859-1 but your system locale is UTF-8, this will not work.

如果您的文件使用ISO-8859-1进行编码,但系统区域设置为UTF-8,则无法使用。

Either convert the file to UTF-8 or change your system locale to ISO-8859-1.

将文件转换为UTF-8或将系统区域设置更改为ISO-8859-1。

# convert from ISO-8859-1 to the environmental locale before grepping
# output will be in the current locale
$ iconv -f 8859_1 input/words.txt | grep ...

# run grep with an ISO-8859-1 locale
# output will be in ISO-8859-1 encoding
$ cat input/words.txt | env LC_ALL=en_US grep ...

#2


1  

Assuming everything is UTF-8, I’d usually just use something like

假设一切都是UTF-8,我通常会使用类似的东西

perl -CSAD -le 'print if /^carroça{1,3}$/' filenames

because then I know what it’s doing.

因为那时我知道它在做什么。

#3


1  

I found a related question here that seems to work.

我发现这里的相关问题似乎有效。

So if you try something like:

所以,如果你尝试这样的事情:

cat input/words.txt | LANG=C grep '^[éra]\{1,4\}$' > output/words_era.txt

Does that produce what you expect?

这会产生你所期望的吗?

#4


0  

Try as @dule said, but with LANG=en_US.iso88591:

尝试@dule说,但使用LANG = en_US.iso88591:

cat input/words.txt | LANG=en_US.iso88591 grep '^[éra]\{1,4\}$' > output/words_era.txt

#1


8  

If your file is encoded in ISO-8859-1 but your system locale is UTF-8, this will not work.

如果您的文件使用ISO-8859-1进行编码,但系统区域设置为UTF-8,则无法使用。

Either convert the file to UTF-8 or change your system locale to ISO-8859-1.

将文件转换为UTF-8或将系统区域设置更改为ISO-8859-1。

# convert from ISO-8859-1 to the environmental locale before grepping
# output will be in the current locale
$ iconv -f 8859_1 input/words.txt | grep ...

# run grep with an ISO-8859-1 locale
# output will be in ISO-8859-1 encoding
$ cat input/words.txt | env LC_ALL=en_US grep ...

#2


1  

Assuming everything is UTF-8, I’d usually just use something like

假设一切都是UTF-8,我通常会使用类似的东西

perl -CSAD -le 'print if /^carroça{1,3}$/' filenames

because then I know what it’s doing.

因为那时我知道它在做什么。

#3


1  

I found a related question here that seems to work.

我发现这里的相关问题似乎有效。

So if you try something like:

所以,如果你尝试这样的事情:

cat input/words.txt | LANG=C grep '^[éra]\{1,4\}$' > output/words_era.txt

Does that produce what you expect?

这会产生你所期望的吗?

#4


0  

Try as @dule said, but with LANG=en_US.iso88591:

尝试@dule说,但使用LANG = en_US.iso88591:

cat input/words.txt | LANG=en_US.iso88591 grep '^[éra]\{1,4\}$' > output/words_era.txt