Java - 抓取优酷网视频播放页面(使用jsoup解析html,正则表达式处理字符串)

最近在研究视频播放这块儿，然后打算做一款视频聚合类的软件，首先把优酷聚合搞定吧！

我们首先得把优酷网的视频播放页面的地址解析出来，由于优酷有很多拍客的视频，这些拍客的视频当然不是我们想要的，通过优酷网的页面分析，应该从优酷的节目列表页开始抓取，页面如下。

http://www.youku.com/v_olist/c_96_a__s__g__r__lg__im__st__mt__tg__d_1_et_0_ag_0_fv_0_fl__fc__fe__o_7_p_1.html

我们使用 jsoup 开发包进行 html 的解析，完整代码如下

package com.gavin.video.down;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class DownYouku {

public static void main(String[] args) {
String ss = "http://www.youku.com/v_olist/c_96_a__s__g__r__lg__im__st__mt__tg__d_1_et_0_ag_0_fv_0_fl__fc__fe__o_7_p_1.html";
new DownYouku().downMovie(ss);
}

public void downMovie(String str) {
String homeurl = "";
String page = "";
String suf = "";
// 正则表达式处理 解析地址
Pattern p_url = Pattern.compile("(.*)(\\d)(\\.html)");
Matcher m_url = p_url.matcher(str);
if (m_url.find()) {
homeurl = m_url.group(1);
page = m_url.group(2);
suf = m_url.group(3);
}
// 循环抓取
for (int i = Integer.parseInt(page); i <= 30; i++) {
String path = homeurl + i + suf;
Document doc1 = null;
try {
doc1 = Jsoup.connect(path).get();
} catch (IOException e2) {
System.out.println(">>>>>>IO 错误 1");
continue;
}
Element listofficial = doc1.getElementById("listofficial");
Elements p_pvs = listofficial.getElementsByTag("ul");
for (int j = 0; j < 40; j++) {
Element p_pv = p_pvs.get(j);
String a = p_pv.getElementsByTag("a").attr("href");
String title = p_pv.getElementsByTag("a").attr("title");
Document doc2 = null;
try {
doc2 = Jsoup.connect(a).get();
} catch (IOException e1) {
System.out.println(">>>>>>IO 错误 2");
continue;
}
Element showInfo = doc2.getElementById("showInfo");
Element baseinfo = showInfo.getElementsByClass("baseinfo").get(0);
Elements link = baseinfo.getElementsByClass("link");
Element url_a = null;
if (link.size() > 0) {
url_a = link.get(0).getElementsByTag("a").get(0);
} else {
System.out.println(title + " : " + "无视频链接");
continue;
}
String url = url_a.attr("href");
System.out.print(url + "\t\t");
System.out.println(title);
}
}

}

}

抓取结果：

Java - 抓取优酷网视频播放页面(使用jsoup解析html,正则表达式处理字符串)

内容就提取了这么两项，其他的内容照着优酷的 html dom 很容易就提取出来了。

秒客网

Java - 抓取优酷网视频播放页面(使用jsoup解析html,正则表达式处理字符串)

相关文章