使用JSOUP解析表中的表

时间:2022-01-26 19:35:42

I am trying to extract some data from a table by parsing the HTML using jsoup.

我试图通过使用jsoup解析HTML从表中提取一些数据。

Here is an example,

这是一个例子,

String tableHtml =
     "<table>
           <thead>
                <tr><th> 
                     <table>
                         <tr><td>asdf</td></tr>
                     </table> 
                     <table>
                          <tr><td>asdf</td></tr>
                     </table>
                 </th></tr>
           </thead> 
           <tfoot>
                <tr><td>
                   THE TEXT I WANT TO GET
                </td></tr>
           </tfoot> 
     </table>";

Document doc = Jsoup.parseBodyFragment(tableHtml);
Element table = doc.select("table").first();
Element r = table.select("tfoot").first(); // I get NULL here/// WHY???
System.out.println("-----------" + r.text());

I get null pointer exception !

我得到空指针异常!

However if I remove one of the inner tables, I don't get an exception and it works. Also if I changed the tag <th> to <td>, it works. Strange behavior. This is just an example of real html that I am trying to parse. I would appreciate if anyone can point me out why I am getting this exception. Thank you.

但是,如果我删除其中一个内部表,我不会得到一个例外,它的工作原理。此外,如果我将标签更改为,它也可以。奇怪的行为。这只是我试图解析的真实html的一个例子。如果有人能指出我为什么会得到这个例外,我将不胜感激。谢谢。

NOTE. Please assume that I cannot modify the HTML. I just want to parse it as it is.

注意。请假设我无法修改HTML。我只是想解析它。

1 个解决方案

#1


1  

Maybe instead of using HTML parser (which apparently doesn't fully support this kind of nesting tables) use XML parser. Try with

也许不使用HTML解析器(显然不完全支持这种嵌套表)使用XML解析器。试试吧

Document doc = Jsoup.parse(tableHtml,"",Parser.xmlParser());
Element table = doc.select("table").first();
Element r = table.select("tfoot").first(); 
System.out.println("->" + r.text());

#1


1  

Maybe instead of using HTML parser (which apparently doesn't fully support this kind of nesting tables) use XML parser. Try with

也许不使用HTML解析器(显然不完全支持这种嵌套表)使用XML解析器。试试吧

Document doc = Jsoup.parse(tableHtml,"",Parser.xmlParser());
Element table = doc.select("table").first();
Element r = table.select("tfoot").first(); 
System.out.println("->" + r.text());