如何使用XPath与Saxon-HE在命令行中解析HTML ?

时间:2023-01-14 11:45:31

I use saxon HE 9.6, and it's great for playing with XPath 3 while you are parsing well formed XML files.

我使用saxon HE 9.6,在解析格式良好的XML文件时,使用XPath 3非常有用。

But I would like to know how to combine expath-http-client (or any other working solution) with Saxon to have the power to parse realLife©®™ (possibly broken) HTML. (Java is not my better skill).

但是我想知道如何结合expath-http-client(或任何其他工作解决方案)和撒克逊人有权解析方法可搜集到活生生的©®™(可能破碎)的HTML。(Java不是我更好的技能)。

I searched google quite many hours without any working solution. I tried something like :

我在谷歌上搜索了好几个小时,没有找到任何有效的解决方案。我试过以下方法:

xquery_file.xsl :

xquery_file。xsl:

xquery version "1.0";

declare namespace http="http://expath.org/ns/http-client";

let $url := 'http://*.com'
let $response := http:send-request(
   <http:request href="{$url}" method="get"/>
) return
    <echo-results>
        {$response}
    </echo-results>

Shell command taken from the README of expath-http-client-saxon-0.10.0

Shell命令从expath-http-client-saxon-0.10.0的自述中获取

saxon --repo /usr/share/java/expath/repo -xsl:sample/simple-get.xsl -it:main

or

saxon --repo /usr/share/java/expath/repo -xsl:xquery_file.xsl -it:main

without success. I get : Transformation failed: Unknown configuration property http://saxon.sf.net/feature/repo

没有成功。我得到:转换失败:未知配置属性http://saxon.sf.net/feature/repo

What I want to do ideally in final, is to query directly an URL from the command line without a XQuery file but an XPath expression (if possible). I'm pretty sure some XML/Java/XPath guru around there have the solution I'm looking for.

在final中,我理想的做法是直接从命令行查询URL,不使用XQuery文件,而是使用XPath表达式(如果可能的话)。我很确定那里的一些XML/Java/XPath大师有我正在寻找的解决方案。

/usr/share/java/expath/repo contains :

/usr/share/java/expath/repo包含:

/usr/share/java/expath/repo
├── expath-http-client-saxon-0.10.0
│   ├── cxan.xml
│   ├── expath-http-client-saxon
│   │   ├── jar
│   │   │   ├── expath-http-client-java.jar
│   │   │   └── expath-http-client-saxon.jar
│   │   ├── lib
│   │   │   ├── apache-mime4j-0.6.jar
│   │   │   ├── commons-codec-1.4.jar
│   │   │   ├── commons-logging-1.1.1.jar
│   │   │   ├── httpclient-4.0.1.jar
│   │   │   ├── httpcore-4.0.1.jar
│   │   │   └── tagsoup-1.2.jar
│   │   ├── xq
│   │   │   └── expath-http-client-saxon.xq
│   │   └── xsl
│   │       └── expath-http-client-saxon.xsl
│   ├── expath-pkg.xml
│   └── saxon.xml
└── hello-1.1
    ├── expath-pkg.xml
    └── hello
        ├── hello.xq
        └── hello.xsl

EDIT:

My best attempt (linux based solution)

我的最佳尝试(基于linux的解决方案)

java -classpath "./tagsoup-1.2.jar:./saxon9he.jar" \
    net.sf.saxon.Query \
   -x:org.ccil.cowan.tagsoup.Parser \
   -s:myrealLife.html \
   -qs://*:body

This work, but now I try to figure out how to set the default namespace to be able to query directly by example //a

这是可行的,但是现在我试图弄清楚如何设置默认名称空间,以便能够通过示例//a直接查询

EDIT 2

I have created a whole github project according to this POST, check https://github.com/sputnick-dev/saxon-lint

根据本文,我创建了一个完整的github项目,请查看https://github.com/sputnick-dev/saxon-lint

2 个解决方案

#1


5  

I don't think you need any HTTP client for this. You can read the file using the doc() function, or supply it as the primary input document, provided you configure it to be parsed using an HTML SAX parser rather than an XML parser. If you put John Cowan's TagSoup on the classpath, then invoking Saxon with

我认为你不需要任何HTTP客户端。您可以使用doc()函数读取文件,或者将其作为主要输入文档提供,前提是您将其配置为使用HTML SAX解析器而不是XML解析器解析。如果您将John Cowan的TagSoup放在类路径中,然后调用Saxon with

-x:org.ccil.cowan.tagsoup.Parser -s:myrealLife.html

should do the trick.

应该足够了。

I think you can also use validator.nu, which is rather more up-to-speed with HTML5 than TagSoup, but I haven't tried it myself.

我认为你也可以使用验证器。nu比TagSoup更接近HTML5,但我自己还没有尝试过。

#2


1  

If you look at the documentation for the EXPath HTTP Client, you will see that if you retrieve HTML with it, and the server responds with a HTML Internet Media Type, then the HTML will be automatically tidied up into valid XML for you, see here http://expath.org/spec/http-client#d2e517.

如果您查看EXPath HTTP客户机的文档,您将看到,如果您使用它检索HTML,并且服务器使用HTML Internet媒体类型进行响应,那么HTML将自动为您整理成有效的XML,请参见http://expath.org/spec/http-client#d2e517。

As such you will not need to write any Java code to achieve your goal.

因此,您不需要编写任何Java代码来实现您的目标。

Your XQuery is incorrect, as you are trying to use eXist-db's HTTP Client, whereas you state that you want to use the EXPath HTTP Client. So you should change your XQuery to this:

您的XQuery是不正确的,因为您正在尝试使用eXist-db的HTTP客户端,而您则声明希望使用EXPath HTTP客户端。因此,您应该将XQuery更改为以下内容:

xquery version "1.0";

declare namespace http="http://expath.org/ns/http-client";

let $url := 'http://*.com'
let $response := http:send-request(
   <http:request href="{$url}" method="get"/>
) return
    <echo-results>
        {$response}
    </echo-results>

However, you will also need to convince Saxon to load and use the EXPath HTTP Client module, by default Saxon does not have native support for the HTTP Client, see http://saxonica.com/documentation/index.html#!functions.

但是,您还需要说服Saxon加载并使用EXPath HTTP客户端模块,默认情况下,Saxon没有对HTTP客户端的本地支持,请参见http://saxonica.com/documentation/index.html#函数。

You can find the EXPath HTTP Client implementation for Saxon here: https://code.google.com/p/expath-http-client/downloads/list and if you download the latest Zip file, inside is a README file which tells you how to use it with Saxon.

您可以在这里找到Saxon的EXPath HTTP客户机实现:https://code.google.com/p/expath-http- Client / downloads/downloads/list,如果您下载最新的Zip文件,其中有一个README文件,告诉您如何使用Saxon。

#1


5  

I don't think you need any HTTP client for this. You can read the file using the doc() function, or supply it as the primary input document, provided you configure it to be parsed using an HTML SAX parser rather than an XML parser. If you put John Cowan's TagSoup on the classpath, then invoking Saxon with

我认为你不需要任何HTTP客户端。您可以使用doc()函数读取文件,或者将其作为主要输入文档提供,前提是您将其配置为使用HTML SAX解析器而不是XML解析器解析。如果您将John Cowan的TagSoup放在类路径中,然后调用Saxon with

-x:org.ccil.cowan.tagsoup.Parser -s:myrealLife.html

should do the trick.

应该足够了。

I think you can also use validator.nu, which is rather more up-to-speed with HTML5 than TagSoup, but I haven't tried it myself.

我认为你也可以使用验证器。nu比TagSoup更接近HTML5,但我自己还没有尝试过。

#2


1  

If you look at the documentation for the EXPath HTTP Client, you will see that if you retrieve HTML with it, and the server responds with a HTML Internet Media Type, then the HTML will be automatically tidied up into valid XML for you, see here http://expath.org/spec/http-client#d2e517.

如果您查看EXPath HTTP客户机的文档,您将看到,如果您使用它检索HTML,并且服务器使用HTML Internet媒体类型进行响应,那么HTML将自动为您整理成有效的XML,请参见http://expath.org/spec/http-client#d2e517。

As such you will not need to write any Java code to achieve your goal.

因此,您不需要编写任何Java代码来实现您的目标。

Your XQuery is incorrect, as you are trying to use eXist-db's HTTP Client, whereas you state that you want to use the EXPath HTTP Client. So you should change your XQuery to this:

您的XQuery是不正确的,因为您正在尝试使用eXist-db的HTTP客户端,而您则声明希望使用EXPath HTTP客户端。因此,您应该将XQuery更改为以下内容:

xquery version "1.0";

declare namespace http="http://expath.org/ns/http-client";

let $url := 'http://*.com'
let $response := http:send-request(
   <http:request href="{$url}" method="get"/>
) return
    <echo-results>
        {$response}
    </echo-results>

However, you will also need to convince Saxon to load and use the EXPath HTTP Client module, by default Saxon does not have native support for the HTTP Client, see http://saxonica.com/documentation/index.html#!functions.

但是,您还需要说服Saxon加载并使用EXPath HTTP客户端模块,默认情况下,Saxon没有对HTTP客户端的本地支持,请参见http://saxonica.com/documentation/index.html#函数。

You can find the EXPath HTTP Client implementation for Saxon here: https://code.google.com/p/expath-http-client/downloads/list and if you download the latest Zip file, inside is a README file which tells you how to use it with Saxon.

您可以在这里找到Saxon的EXPath HTTP客户机实现:https://code.google.com/p/expath-http- Client / downloads/downloads/list,如果您下载最新的Zip文件,其中有一个README文件,告诉您如何使用Saxon。