如何从HTML文件中提取标签的属性值?

时间:2021-10-19 08:11:12

I know, don't parse using curl, grep and sed. But I am looking for an easy approach, not a very safe one.

我知道,不要使用curl,grep和sed解析。但我正在寻找一种简单的方法,而不是一种非常安全的方法。

So I get an HTML file with curl, from which I need a value of a certain attribute from a tag. I use grep to get me the line where it says token. This only occurs once. This gives me a whole div:

所以我得到一个带curl的HTML文件,我需要从标签中获取某个属性的值。我使用grep来获取它所说的令牌。这只发生一次。这给了我一个完整的div:

<div class="userlinks">
  <span class="arrow flleft profilesettings">settings</span>
  <form class="logoutform" method="post" action="/logout">
    <input class="logoutbtn arrow flright" type="submit" value="Log out">
    <input type="hidden" name="ltoken" value="a5fc8828a42277538f1352cf9ea27a71">
  </form>
</div>

How can I get just the value attribute (e.g. "a5fc8828a42277538f1352cf9ea27a71")?

我怎样才能得到value属性(例如“a5fc8828a42277538f1352cf9ea27a71”)?

5 个解决方案

#1


10  

There's no need to grep:

没有必要grep:

sed -n '/token/s/.*name="ltoken"\s\+value="\([^"]\+\).*/\1/p' input_file

#2


8  

One way, using sed:

一种方法,使用sed:

sed "s/.* value=\"\(.*\)\".*/\1/" file.txt

Results:

a5fc8828a42277538f1352cf9ea27a71

HTH

#3


2  

Use XPath Expression and a Dash of Grep

You can actually parse the HTML properly from the command line. For example, you can use xgrep to create an xpath expression, and then use GNU sed (or your grep of choice) to extract just the text you care about. For example:

您实际上可以从命令行正确解析HTML。例如,您可以使用xgrep创建xpath表达式,然后使用GNU sed(或您选择的grep)来仅提取您关注的文本。例如:

$ xgrep -x '//input[@name="ltoken"][1]/@value' /tmp/foo |
      sed -rn '/value/ s/.*"([[:xdigit:]]+)"/\1/p'
a5fc8828a42277538f1352cf9ea27a71

#4


2  

Another way using awk

使用awk的另一种方式

grep "ltoken" file.txt | awk -F"\"" '{print $6}'

For different attribute value just increase or decrease the value of $6

对于不同的属性值,只需增加或减少$ 6的值

#5


0  

There is one problem with the xgrep solution in that it expects valid xml. The provided html isn't valid because of the unclosed 'input' elements. xmllint has a html parser option and also provides the string() function to extract the value without the use of sed.

xgrep解决方案存在一个问题,即它需要有效的xml。由于未关闭的“输入”元素,提供的html无效。 xmllint有一个html解析器选项,还提供了string()函数来提取值而不使用sed。

$ xmllint --html --xpath 'string(//input[@name="ltoken"][1]/@value)' foo
a5fc8828a42277538f1352cf9ea27a71

#1


10  

There's no need to grep:

没有必要grep:

sed -n '/token/s/.*name="ltoken"\s\+value="\([^"]\+\).*/\1/p' input_file

#2


8  

One way, using sed:

一种方法,使用sed:

sed "s/.* value=\"\(.*\)\".*/\1/" file.txt

Results:

a5fc8828a42277538f1352cf9ea27a71

HTH

#3


2  

Use XPath Expression and a Dash of Grep

You can actually parse the HTML properly from the command line. For example, you can use xgrep to create an xpath expression, and then use GNU sed (or your grep of choice) to extract just the text you care about. For example:

您实际上可以从命令行正确解析HTML。例如,您可以使用xgrep创建xpath表达式,然后使用GNU sed(或您选择的grep)来仅提取您关注的文本。例如:

$ xgrep -x '//input[@name="ltoken"][1]/@value' /tmp/foo |
      sed -rn '/value/ s/.*"([[:xdigit:]]+)"/\1/p'
a5fc8828a42277538f1352cf9ea27a71

#4


2  

Another way using awk

使用awk的另一种方式

grep "ltoken" file.txt | awk -F"\"" '{print $6}'

For different attribute value just increase or decrease the value of $6

对于不同的属性值,只需增加或减少$ 6的值

#5


0  

There is one problem with the xgrep solution in that it expects valid xml. The provided html isn't valid because of the unclosed 'input' elements. xmllint has a html parser option and also provides the string() function to extract the value without the use of sed.

xgrep解决方案存在一个问题,即它需要有效的xml。由于未关闭的“输入”元素,提供的html无效。 xmllint有一个html解析器选项,还提供了string()函数来提取值而不使用sed。

$ xmllint --html --xpath 'string(//input[@name="ltoken"][1]/@value)' foo
a5fc8828a42277538f1352cf9ea27a71