java多线程爬虫实例

很早就知道爬虫的原理，但是一直没有去实现过，今天写起来还真遇到很多困难，尤其是多线程同步的问题。还是自己对多线程不熟，没有大量实践过的原因。

先上我做的结果吧：

开始爬虫.........................................
当前有1个线程在等待
当前有2个线程在等待
当前有3个线程在等待
当前有4个线程在等待
当前有5个线程在等待
.....................

爬网页http://dev.yesky.com成功，深度为2 是由线程thread-9来爬当前有7个线程在等待爬网页http://www.cnblogs.com/rexyoung/archive/2012/05/01/2477960.html成功，深度为2 是由线程thread-2来爬当前有8个线程在等待爬网页http://www.hjenglish.com 成功，深度为2 是由线程thread-0来爬当前有9个线程在等待爬网页http://www.cnblogs.com/snandy/archive/2012/05/01/2476675.html成功，深度为2 是由线程thread-5来爬当前有10个线程在等待总共爬了159个网页总共耗时53秒

上面是爬博客园的主页，只爬了两级深度，10个线程，总共耗时53秒，应该速度还算不错的，下面是所有的代码：

public class WebCrawler {
	ArrayList<String> allurlSet = new ArrayList<String>();//所有的网页url，需要更高效的去重可以考虑HashSet
	ArrayList<String> notCrawlurlSet = new ArrayList<String>();//未爬过的网页url
	HashMap<String, Integer> depth = new HashMap<String, Integer>();//所有网页的url深度
	int crawDepth  = 2; //爬虫深度
	int threadCount = 10; //线程数量
	int count = 0; //表示有多少个线程处于wait状态
	public static final Object signal = new Object();   //线程间通信变量

	public static void main(String[] args) {
		final WebCrawler wc = new WebCrawler();
//		wc.addUrl("http://www.126.com", 1);
		wc.addUrl("http://www.cnblogs.com", 1);
		long start= System.currentTimeMillis();
		System.out.println("开始爬虫.........................................");
		wc.begin();

		while(true){
			if(wc.notCrawlurlSet.isEmpty()&& Thread.activeCount() == 1||wc.count==wc.threadCount){
				long end = System.currentTimeMillis();
				System.out.println("总共爬了"+wc.allurlSet.size()+"个网页");
				System.out.println("总共耗时"+(end-start)/1000+"秒");
				System.exit(1);
//				break;
			}

		}
	}
	private void begin() {
		for(int i=0;i<threadCount;i++){
			new Thread(new Runnable(){
				public void run() {
//					System.out.println("当前进入"+Thread.currentThread().getName());
//					while(!notCrawlurlSet.isEmpty()){ ----------------------------------（1）
//						String tmp = getAUrl();
//						crawler(tmp);
//					}
					while (true) { 
//						System.out.println("当前进入"+Thread.currentThread().getName());
						String tmp = getAUrl();
						if(tmp!=null){
							crawler(tmp);
						}else{
							synchronized(signal) {  //------------------（2）
								try {
									count++;
									System.out.println("当前有"+count+"个线程在等待");
									signal.wait();
								} catch (InterruptedException e) {
									// TODO Auto-generated catch block
									e.printStackTrace();
								}
							}


						}
					}
				}
			},"thread-"+i).start();
		}
	}
	public synchronized  String getAUrl() {
		if(notCrawlurlSet.isEmpty())
			return null;
		String tmpAUrl;
//		synchronized(notCrawlurlSet){
			tmpAUrl= notCrawlurlSet.get(0);
			notCrawlurlSet.remove(0);
//		}
		return tmpAUrl;
	}
//	public synchronized  boolean isEmpty() {
//		boolean f = notCrawlurlSet.isEmpty();
//		return f;
//	}

	public synchronized void  addUrl(String url,int d){
			notCrawlurlSet.add(url);
			allurlSet.add(url);
			depth.put(url, d);
	}

	//爬网页sUrl
	public  void crawler(String sUrl){
		URL url;
		try {
				url = new URL(sUrl);
//				HttpURLConnection urlconnection = (HttpURLConnection)url.openConnection(); 
				URLConnection urlconnection = url.openConnection();
				urlconnection.addRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
				InputStream is = url.openStream();
				BufferedReader bReader = new BufferedReader(new InputStreamReader(is));
				StringBuffer sb = new StringBuffer();//sb为爬到的网页内容
				String rLine = null;
				while((rLine=bReader.readLine())!=null){
					sb.append(rLine);
					sb.append("/r/n");
				}

				int d = depth.get(sUrl);
				System.out.println("爬网页"+sUrl+"成功，深度为"+d+" 是由线程"+Thread.currentThread().getName()+"来爬");
				if(d<crawDepth){
					//解析网页内容，从中提取链接
					parseContext(sb.toString(),d+1);
				}
//				System.out.println(sb.toString());


		} catch (IOException e) {
//			crawlurlSet.add(sUrl);
//			notCrawlurlSet.remove(sUrl);
			e.printStackTrace();
		}
	}

	//从context提取url地址
	public  void parseContext(String context,int dep) {
	    String regex = "<a href.*?/a>";
//		String regex = "<title>.*?</title>";
		String s = "fdfd<title>我 是</title><a href=\"http://www.iteye.com/blogs/tag/Google\">Google</a>fdfd<>";
		// String regex ="http://.*?>";
		Pattern pt = Pattern.compile(regex);
		Matcher mt = pt.matcher(context);
		while (mt.find()) {
//			System.out.println(mt.group());
			Matcher myurl = Pattern.compile("href=\".*?\"").matcher(
					mt.group());
			while(myurl.find()){
				String str = myurl.group().replaceAll("href=\"|\"", "");
//				System.out.println("网址是:"+ str);
				if(str.contains("http:")){ //取出一些不是url的地址
					if(!allurlSet.contains(str)){
						addUrl(str, dep);//加入一个新的url
						if(count>0){ //如果有等待的线程，则唤醒
							synchronized(signal) {  //---------------------（2）
								count--;
								signal.notify();
							}
						}

					}
				}
			}
		}
	}
}

在上面（1）（2）两个地方卡了很久，两个地方其实是一个知识点，都是多线程的知识：

一开始用了

//					while(!notCrawlurlSet.isEmpty()){ ----------------------------------（1）
//						String tmp = getAUrl();
//						crawler(tmp);
//					}

一进入线程就判断notCrawlurlSet为不为空，但是是多线程的，一开始notCrawlurlSet不为空，所以所有的线程都进入了循环，尽管getAul（）方法我设置了synchronized，但是一旦一个线程从getAurl（）方法出来，另外一个线程就会进去，看一开始的getAurl方法的代码：

	public synchronized  String getAUrl() {
		String tmpAUrl;
//		synchronized(notCrawlurlSet){
			tmpAUrl= notCrawlurlSet.get(0);
			notCrawlurlSet.remove(0);
//		}
		return tmpAUrl;
	}

每一次都会删除一个notCrawlurlSet数组里面的元素，导致第一个线程执行完getAUrl方法时，且notCrawlurlSet恰好为空的时候，另外一个线程进入就会报错，因为notCrawlUrlSet没有元素，get（0）会报错。后来把getAUrl函数改成：

	public synchronized  String getAUrl() {
		if(notCrawlurlSet.isEmpty())
			return null;
		String tmpAUrl;
//		synchronized(notCrawlurlSet){
			tmpAUrl= notCrawlurlSet.get(0);
			notCrawlurlSet.remove(0);
//		}
		return tmpAUrl;
	}

在线程的run函数改成：

					while (true) { 
//						System.out.println("当前进入"+Thread.currentThread().getName());
						String tmp = getAUrl();
						if(tmp!=null){
							crawler(tmp);
						}else{
							synchronized(signal) {
								try {
									count++;
									System.out.println("当前有"+count+"个线程在等待");
									signal.wait();
								} catch (InterruptedException e) {
									// TODO Auto-generated catch block
									e.printStackTrace();
								}
							}


						}
					}

即线程进入后就调用getAUrl函数，从notCrawlurlSet数组取url，如果没有取到，则用signal来让此线程等待，但是在哪里唤醒呢？肯定在notCrawlurlSet有元素的时候唤醒，即notCrawlurlSet不能空的时候，这其中有个很重要的变量count，它表示正在等待的线程个数，只有count大于0才会唤醒线程，即只有有线程在等待的时候才会调用signal.notify(); 此段实现在parseContext函数里面：

				if(str.contains("http:")){ //取出一些不是url的地址
					if(!allurlSet.contains(str)){
						addUrl(str, dep);//加入一个新的url
						if(count>0){ //如果有等待的线程，则唤醒
							synchronized(signal) {
								count--;
								signal.notify();
							}
						}
					}
				}

这个count变量还解决了我一个问题，当所有的线程启动后，也正确的爬取网页了，但是不知道怎么结束这些线程，因为线程都是永久循环的，有了count变量，就知道有多少线程在等待，当等待的线程等于threadCount的时候，就表示已经爬完了，因为所有线程都在等待了，不会往notCrawlurlSet添加新的url了，此时已经爬完了指定深度的所有网页。

写下自己的一点感悟，明白原理是一回事，有时候实现起来也挺费神的。

代码几度修改，还有待完善的地方及我的思路：

1：爬取的网页要存起来，该怎么存放也是一个问题，目录怎么生成？网页自动分类？等等，分类可以用考虑贝叶斯分类器，分好类之后安装类别来存储。

2：网页去重问题，如果url太多，内存装不下去怎么办？考虑先压缩，比如MD5压缩，同时MD5又能得到hash值，最简单的是hash去重，或者可以考虑用bloom filter去重，还有一种方法是考虑用key-value数据库来实现去重，不过我对key-value数据库不是很了解，应该类似hash，但是效率的问题数据库已经帮你解决了。

3：url不同的网页也可能内容一样，怎么判断网页相似度问题。网页相似度可以先提取网页正文，方法有行块函数法，提取正文后再可以用向量余弦法来计算相似度。

4：增量抓取的问题，一个网页抓取之后，什么时候再重新来抓？可以针对具体的网页的更新频率来解决这个问题，如新浪首页的新闻可能更新快一些，重新来爬的频率会更快一点。

暂时想到这些，以后继续完善。

秒客网

java多线程爬虫实例

相关文章