获取java的网站源码

时间:2023-01-18 23:57:55

I would like to use java to get the source of a website (secure) and then parse that website for links that are in it. I have found how to connect to that url, but then how can i easily get just the source, preferraby as the DOM Document oso that I could easily get the info I want.

我想使用java来获取网站的来源(安全),然后解析该网站中的链接。我已经找到了如何连接到该URL,但是我怎样才能轻松获得源代码,更喜欢DOM Document oso,我可以轻松获得我想要的信息。

Or is there a better way to connect to https site, get the source (which I neet to do to get a table of data...its pretty simple) then those links are files i am going to download.

或者是否有更好的方法连接到https站点,获取源(我需要做的是获取数据表...非常简单)然后这些链接是我要下载的文件。

I wish it was FTP but these are files stored on my tivo (i want to programmatically download them to my computer(

我希望它是FTP,但这些文件存储在我的tivo上(我想以编程方式将它们下载到我的电脑上(

8 个解决方案

#1


You can get low level and just request it with a socket. In java it looks like

您可以获得低级别,只需使用套接字请求它。在java中它看起来像

// Arg[0] = Hostname
// Arg[1] = File like index.html
public static void main(String[] args) throws Exception {
    SSLSocketFactory factory = (SSLSocketFactory) SSLSocketFactory.getDefault();

    SSLSocket sslsock = (SSLSocket) factory.createSocket(args[0], 443);

    SSLSession session = sslsock.getSession();
    X509Certificate cert;
    try {
        cert = (X509Certificate) session.getPeerCertificates()[0];
    } catch (SSLPeerUnverifiedException e) {
        System.err.println(session.getPeerHost() + " did not present a valid cert.");
        return;
    }

    // Now use the secure socket just like a regular socket to read pages.
    PrintWriter out = new PrintWriter(sslsock.getOutputStream());
    out.write("GET " + args[1] + " HTTP/1.0\r\n\r\n");
    out.flush();

    BufferedReader in = new BufferedReader(new InputStreamReader(sslsock.getInputStream()));
    String line;
    String regExp = ".*<a href=\"(.*)\">.*";
    Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

    while ((line = in.readLine()) != null) {
        // Using Oscar's RegEx.
        Matcher m = p.matcher( line );  
        if( m.matches() ) {
            System.out.println( m.group(1) );
        }
    }

    sslsock.close();
}

#2


Extremely similar questions:

非常相似的问题:

#3


Probably you could get better resutls from Pete's or sktrdie options. Here's an additional way if you would like to know how to do it "by had"

也许你可以从Pete或sktrdie选项中获得更好的结果。如果您想知道如何“通过”,这是另一种方法

I'm not very good at regex so in this case it returns the last link in a line. Well, it's a start.

我不是很擅长正则表达式,所以在这种情况下它会返回一行中的最后一个链接。嗯,这是一个开始。

import java.io.*;
import java.net.*;
import java.util.regex.*;

public class Links { 
    public static void main( String [] args ) throws IOException  { 

        URL url = new URL( args[0] );
        InputStream is = url.openConnection().getInputStream();

        BufferedReader reader = new BufferedReader( new InputStreamReader( is )  );

        String line = null;
        String regExp = ".*<a href=\"(.*)\">.*";
        Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

        while( ( line = reader.readLine() ) != null )  {
            Matcher m = p.matcher( line );  
            if( m.matches() ) {
                System.out.println( m.group(1) );
            }
        }
        reader.close();
    }
}

EDIT

Ooops I totally missed the "secure" part. Anyway I couldn't help it, I had to write this sample :P

哎呀我完全错过了“安全”部分。无论如何我无法帮助它,我不得不写下这个样本:P

#4


Try HttpUnit or HttpClient. Although the former is ostensibly for writing integration tests, it has a convenient API for programmatically iterating through a web page's links, with something like the following use of WebResponse.getLinks():

尝试HttpUnit或HttpClient。虽然前者表面上是用于编写集成测试,但它有一个方便的API,用于以编程方式迭代网页的链接,类似于以下使用WebResponse.getLinks():

WebConversation wc = new WebConversation();
WebResponse resp = wc.getResponse("http://*.com/questions/422970/");
WebLink[] links = resp.getLinks();
// Loop over array of links...

#5


You can use javacurl to get the site's html, and java DOM to analyze it.

您可以使用javacurl获取站点的html,并使用java DOM来分析它。

#6


Try using the jsoup library.

尝试使用jsoup库。

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;


public class ParseHTML {

    public static void main(String args[]) throws IOException{
        Document doc = Jsoup.connect("https://www.wikipedia.org/").get();
        String text = doc.body().text();

        System.out.print(text);
    }
}

You can download the jsoup library here.

你可以在这里下载jsoup库。

#7


There are two meanings of souce in a web context:

在Web上下文中有两种含义:

The HTML source: If you request a webpage by URL, you always get the HTML source code. In fact, there is nothing else that you could get from the URL. Webpages are always transmitted in source form, there is no such thing as a compiled webpage. And for what you are trying, this should be enough to fulfill your task.

HTML源:如果您通过URL请求网页,则始终获取HTML源代码。实际上,您无法从URL获得任何其他内容。网页总是以源代码形式传输,没有编译网页这样的东西。对于你正在尝试的东西,这应该足以完成你的任务。

Script Source: If the webpage is dynamically generated, then it is coded in some server side scripting language (like PHP, Ruby, JSP...). There also exists a source code at this level. But using a HTTP-connection you are not able to get this kind of source code. This is not a missing feature but completely by purpose.

脚本源:如果网页是动态生成的,那么它是用某种服务器端脚本语言编写的(如PHP,Ruby,JSP ......)。此级别还存在源代码。但是使用HTTP连接你无法获得这种源代码。这不是缺失的功能,而是完全按目的。

Parsing: Having that said, you will need to somehow parse the HTML code. If you just need the links, using a RegEx (as Oscar Reyes showed) will be the most practical approach, but you could also write a simple parser "manually". It would be slow, more code... but works.

解析:说完了,你需要以某种方式解析HTML代码。如果您只需要链接,使用RegEx(如Oscar Reyes所示)将是最实用的方法,但您也可以“手动”编写一个简单的解析器。这将是缓慢的,更多的代码...但有效。

If you want to acess the code on a more logical level, parsing it to a DOM would be the way to go. If the code is valid XHTML you can just parse it to a org.w3c.dom.Document and do anything with it. If it is at least valid HTML you might apply some tricks to convert it to XHTML (in some rare cases, replacing <br> by <br/> and changing the doctype is enough) and use it as XML.

如果你想在更合乎逻辑的层面*问代码,那么将它解析为DOM将是最佳选择。如果代码是有效的XHTML,您可以将其解析为org.w3c.dom.Document并对其执行任何操作。如果它至少是有效的HTML,您可以应用一些技巧将其转换为XHTML(在极少数情况下,通过更换
并更改doctype就足够了)并将其用作XML。

If it's not valid XML, you would need an HTML DOM parser. I've no idea if such a thing exists for Java and if it performs nice.

如果它不是有效的XML,则需要HTML DOM解析器。我不知道Java是否存在这样的东西,以及它是否表现良好。

#8


There exists FTP server that can be installed on your Tivo to allow for show downloads, see here http://dvrpedia.com/MFS_FTP

有一个FTP服务器可以安装在您的Tivo上以允许显示下载,请参见http://dvrpedia.com/MFS_FTP

The question is formulated differently (how to handle http/html in java) but at the end you mention what you want is to download shows. Tivo uses unique file system (MFS - Media File System) of their own, so it is not easy to mount the drive on another machine - instead it is easier to run http or ftp server on the Tivo and download from these

问题的表达方式不同(如何处理java中的http / html),但最后你提到你想要的是下载节目。 Tivo使用他们自己的独特文件系统(MFS - 媒体文件系统),因此在另一台机器上安装驱动器并不容易 - 而是更容易在Tivo上运行http或ftp服务器并从这些下载

#1


You can get low level and just request it with a socket. In java it looks like

您可以获得低级别,只需使用套接字请求它。在java中它看起来像

// Arg[0] = Hostname
// Arg[1] = File like index.html
public static void main(String[] args) throws Exception {
    SSLSocketFactory factory = (SSLSocketFactory) SSLSocketFactory.getDefault();

    SSLSocket sslsock = (SSLSocket) factory.createSocket(args[0], 443);

    SSLSession session = sslsock.getSession();
    X509Certificate cert;
    try {
        cert = (X509Certificate) session.getPeerCertificates()[0];
    } catch (SSLPeerUnverifiedException e) {
        System.err.println(session.getPeerHost() + " did not present a valid cert.");
        return;
    }

    // Now use the secure socket just like a regular socket to read pages.
    PrintWriter out = new PrintWriter(sslsock.getOutputStream());
    out.write("GET " + args[1] + " HTTP/1.0\r\n\r\n");
    out.flush();

    BufferedReader in = new BufferedReader(new InputStreamReader(sslsock.getInputStream()));
    String line;
    String regExp = ".*<a href=\"(.*)\">.*";
    Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

    while ((line = in.readLine()) != null) {
        // Using Oscar's RegEx.
        Matcher m = p.matcher( line );  
        if( m.matches() ) {
            System.out.println( m.group(1) );
        }
    }

    sslsock.close();
}

#2


Extremely similar questions:

非常相似的问题:

#3


Probably you could get better resutls from Pete's or sktrdie options. Here's an additional way if you would like to know how to do it "by had"

也许你可以从Pete或sktrdie选项中获得更好的结果。如果您想知道如何“通过”,这是另一种方法

I'm not very good at regex so in this case it returns the last link in a line. Well, it's a start.

我不是很擅长正则表达式,所以在这种情况下它会返回一行中的最后一个链接。嗯,这是一个开始。

import java.io.*;
import java.net.*;
import java.util.regex.*;

public class Links { 
    public static void main( String [] args ) throws IOException  { 

        URL url = new URL( args[0] );
        InputStream is = url.openConnection().getInputStream();

        BufferedReader reader = new BufferedReader( new InputStreamReader( is )  );

        String line = null;
        String regExp = ".*<a href=\"(.*)\">.*";
        Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

        while( ( line = reader.readLine() ) != null )  {
            Matcher m = p.matcher( line );  
            if( m.matches() ) {
                System.out.println( m.group(1) );
            }
        }
        reader.close();
    }
}

EDIT

Ooops I totally missed the "secure" part. Anyway I couldn't help it, I had to write this sample :P

哎呀我完全错过了“安全”部分。无论如何我无法帮助它,我不得不写下这个样本:P

#4


Try HttpUnit or HttpClient. Although the former is ostensibly for writing integration tests, it has a convenient API for programmatically iterating through a web page's links, with something like the following use of WebResponse.getLinks():

尝试HttpUnit或HttpClient。虽然前者表面上是用于编写集成测试,但它有一个方便的API,用于以编程方式迭代网页的链接,类似于以下使用WebResponse.getLinks():

WebConversation wc = new WebConversation();
WebResponse resp = wc.getResponse("http://*.com/questions/422970/");
WebLink[] links = resp.getLinks();
// Loop over array of links...

#5


You can use javacurl to get the site's html, and java DOM to analyze it.

您可以使用javacurl获取站点的html,并使用java DOM来分析它。

#6


Try using the jsoup library.

尝试使用jsoup库。

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;


public class ParseHTML {

    public static void main(String args[]) throws IOException{
        Document doc = Jsoup.connect("https://www.wikipedia.org/").get();
        String text = doc.body().text();

        System.out.print(text);
    }
}

You can download the jsoup library here.

你可以在这里下载jsoup库。

#7


There are two meanings of souce in a web context:

在Web上下文中有两种含义:

The HTML source: If you request a webpage by URL, you always get the HTML source code. In fact, there is nothing else that you could get from the URL. Webpages are always transmitted in source form, there is no such thing as a compiled webpage. And for what you are trying, this should be enough to fulfill your task.

HTML源:如果您通过URL请求网页,则始终获取HTML源代码。实际上,您无法从URL获得任何其他内容。网页总是以源代码形式传输,没有编译网页这样的东西。对于你正在尝试的东西,这应该足以完成你的任务。

Script Source: If the webpage is dynamically generated, then it is coded in some server side scripting language (like PHP, Ruby, JSP...). There also exists a source code at this level. But using a HTTP-connection you are not able to get this kind of source code. This is not a missing feature but completely by purpose.

脚本源:如果网页是动态生成的,那么它是用某种服务器端脚本语言编写的(如PHP,Ruby,JSP ......)。此级别还存在源代码。但是使用HTTP连接你无法获得这种源代码。这不是缺失的功能,而是完全按目的。

Parsing: Having that said, you will need to somehow parse the HTML code. If you just need the links, using a RegEx (as Oscar Reyes showed) will be the most practical approach, but you could also write a simple parser "manually". It would be slow, more code... but works.

解析:说完了,你需要以某种方式解析HTML代码。如果您只需要链接,使用RegEx(如Oscar Reyes所示)将是最实用的方法,但您也可以“手动”编写一个简单的解析器。这将是缓慢的,更多的代码...但有效。

If you want to acess the code on a more logical level, parsing it to a DOM would be the way to go. If the code is valid XHTML you can just parse it to a org.w3c.dom.Document and do anything with it. If it is at least valid HTML you might apply some tricks to convert it to XHTML (in some rare cases, replacing <br> by <br/> and changing the doctype is enough) and use it as XML.

如果你想在更合乎逻辑的层面*问代码,那么将它解析为DOM将是最佳选择。如果代码是有效的XHTML,您可以将其解析为org.w3c.dom.Document并对其执行任何操作。如果它至少是有效的HTML,您可以应用一些技巧将其转换为XHTML(在极少数情况下,通过更换
并更改doctype就足够了)并将其用作XML。

If it's not valid XML, you would need an HTML DOM parser. I've no idea if such a thing exists for Java and if it performs nice.

如果它不是有效的XML,则需要HTML DOM解析器。我不知道Java是否存在这样的东西,以及它是否表现良好。

#8


There exists FTP server that can be installed on your Tivo to allow for show downloads, see here http://dvrpedia.com/MFS_FTP

有一个FTP服务器可以安装在您的Tivo上以允许显示下载,请参见http://dvrpedia.com/MFS_FTP

The question is formulated differently (how to handle http/html in java) but at the end you mention what you want is to download shows. Tivo uses unique file system (MFS - Media File System) of their own, so it is not easy to mount the drive on another machine - instead it is easier to run http or ftp server on the Tivo and download from these

问题的表达方式不同(如何处理java中的http / html),但最后你提到你想要的是下载节目。 Tivo使用他们自己的独特文件系统(MFS - 媒体文件系统),因此在另一台机器上安装驱动器并不容易 - 而是更容易在Tivo上运行http或ftp服务器并从这些下载