Sed从html文件中删除标记

I need to remove all tags from a html with a bash script using the sed command. I tried with this

我需要使用sed命令用bash脚本从html中删除所有标记。我试着用这个

sed -r 's/[\<][\/]?[a-zA-Z0-9\=\"\-\#\.\& ]+[\/]?[\>]//g' $1

and whith this

因为这个

sed -r 's/[\<][\/]?[.]*[\/]?[\\]?[\>]//g' $1

but I still miss something, any suggestions??

但是我还是错过了什么，有什么建议吗?

1 个解决方案

#1

You can either use one of the many HTML to text converters, use Perl regex if possible <.+?> or if it must be sed use <[^>]*>

您可以使用其中一个HTML到文本转换器，如果可能的话，可以使用Perl regex。>或如果它必须sed使用<[^ >]* >

sed -e 's/<[^>]*>//g' file.html

If there's no room for errors, use an HTML parser instead. E.g. when an element is spread over two lines

如果没有错误的空间，可以使用HTML解析器。当一个元素分布在两条线上。

<div
>Lorem ipsum</div>

this regular expression will not work.

这个正则表达式不起作用。

This regular expression consists of three parts <, [^>]*, >

这个正则表达式由三部分<,[^ >]*,>

search for opening <
寻找开放<
followed by zero or more characters *, which are not the closing >
[...] is a character class, when it starts with ^ look for characters not in the class
后面是0个或多个字符*，不是关闭的>[…)是一个字符类,当它开始^寻找字符类
and finally look for closing >
最后是结束>

The simpler regular expression <.*> will not work, because it searches for the longest possible match, i.e. the last closing > in an input line. E.g., when you have more than one tag in an input line

更简单的正则表达式<。*>不能工作，因为它搜索可能最长的匹配，即输入行中最后一个结束的>。例如，当输入行中有多个标签时

<name>Olaf</name> answers questions.

will result in

将导致

answers questions.

回答问题。

instead of

而不是

Olaf answers questions.

奥拉夫的答案的问题。

See also Repetition with Star and Plus, especially section Watch Out for The Greediness! and following, for a detailed explanation.

看也重复与星形和加号，特别是部分注意贪心!下面是详细的解释。

#1