从URL保存第一个图像

Here's my problem. I have a txt file called "sites.txt" . In these i type random internet sites. My Goal is to save the first image of each site. I tried to filter the Server response by the img tag and it actually works for some sites, but for some not.

这是我的问题。我有一个名为“sites.txt”的txt文件。在这些我键入随机互联网网站。我的目标是保存每个网站的第一张图片。我尝试通过img标签过滤服务器响应,它实际上适用于某些网站,但有些网站没有。

The sites where it works the img src starts with http:// ... the sites it doesnt work start with anything else.

img src工作的网站以http://开头......它不起作用的网站从其他任何东西开始。

I also tried to add the http:// to the img src images which didnt have it, but i still get the same error:

我也尝试将http://添加到没有它的img src图像中,但我仍然得到同样的错误:

    Exception in thread "main" java.net.MalformedURLException: no protocol:
    at java.net.URL.<init>(Unknown Source)

My current code is:

我目前的代码是:

    public static void main(String[] args) throws IOException{
    try {
        File file = new File ("sites.txt");
        Scanner scanner = new Scanner (file);
        String url;
        int counter = 0;
            while(scanner.hasNext()) 
                {   
                    url=scanner.nextLine();
                    URL page = new URL(url);
                    URLConnection yc = page.openConnection();
                       BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
                       String inputLine = in.readLine();
                       while (!inputLine.toLowerCase().contains("img"))inputLine = in.readLine();
                       in.close();
                       String[] parts = inputLine.split(" ");
                       int i=0;
                       while(!parts[i].contains("src"))i++;
                       String destinationFile = "image"+(counter++)+".jpg";
                       saveImage(parts[i].substring(5,parts[i].length()-1), destinationFile);
                       String tmp=scanner.nextLine();
                       System.out.println(url);

                }
        scanner.close();
        }
            catch (FileNotFoundException e) 
            {
                System.out.println ("File not found!");
                System.exit (0);
            }

}

public static void saveImage(String imageUrl, String destinationFile) throws IOException {
    // TODO Auto-generated method stub
    URL url = new URL(imageUrl);
    String fileName = url.getFile();
    String destName = fileName.substring(fileName.lastIndexOf("/"));
    System.out.println(destName);
    InputStream is = url.openStream();
    OutputStream os = new FileOutputStream(destinationFile);

    byte[] b = new byte[2048];
    int length;

    while ((length = is.read(b)) != -1) {
        os.write(b, 0, length);
    }

    is.close();
    os.close();
}

I also got a tip to use the apache jakarte http client libraries but i got absolutely no idea how i could use those i would appreciate any help.

我也有一个提示使用apache jakarte http客户端库但我完全不知道如何使用那些我会感激任何帮助。

3 个解决方案

#1

A URL (a type of URI) requires a scheme in order to be valid. In this case, http.

URL(一种URI)需要一个方案才能生效。在这种情况下,http。

When you type www.google.com into your browser, the browser is inferring you mean http:// and automatically prepends it for you. Java doesn't do this, hence your exception.

当您在浏览器中输入www.google.com时,浏览器会推断您的意思是http://并自动为您预先添加。 Java没有这样做,因此你的例外。

Make sure you always have http://. You can easily fix this using regex:

确保你总是有http://。您可以使用正则表达式轻松解决此问题:

String fixedUrl = stringUrl.replaceAll("^((?!http://).{7})", "http://$1");

if(!stringUrl.startsWith("http://"))
    stringUrl = "http://" + stringUrl;

#2

An alternative solution

另一种解决方案

Simply try with ImageIO that contains static convenience methods for locating ImageReaders and ImageWriters, and performing simple encoding and decoding.

只需尝试使用包含静态便捷方法的ImageIO来定位ImageReader和ImageWriters,并执行简单的编码和解码。

Sample code:

// read a image from the URL
// I used the URL that is your profile pic on *
BufferedImage image = ImageIO
        .read(new URL(
                "https://www.gravatar.com/avatar/3935223a285ab35a1b21f31248f1e721?s=32&d=identicon&r=PG&f=1"));

// save the image
ImageIO.write(image, "jpg", new File("resources/avatar.jpg"));

#3

When you're scraping the site's HTML for image elements and their src attributes, you'll run into several different representations of URLs.

当您在抓取网站的HTML图像元素及其src属性时,您将遇到几种不同的URL表示形式。

Some examples are:

一些例子是:

resource = https://google.com/images/srpr/logo9w.png

resource = https://google.com/images/srpr/logo9w.png

resource = google.com/images/srpr/logo9w.png

resource = google.com/images/srpr/logo9w.png

resource = //google.com/images/srpr/logo9w.png

resource = //google.com/images/srpr/logo9w.png

resource = /images/srpr/logo9w.png

resource = /images/srpr/logo9w.png

resource = images/srpr/logo9w.png

resource = images / srpr / logo9w.png

For the second through fifth ones, you'll need to build the rest of the URL.

对于第二个到第五个,您需要构建URL的其余部分。

The second one may be more difficult to differentiate from the fourth and fifth ones, but I'm sure there are workarounds. The URL Standard leads me to believe you won't see it as often, because I don't think it's technically valid.

第二个可能更难以区分第四个和第五个,但我确信有解决方法。 URL标准让我相信你不会经常看到它,因为我不认为它在技术上是有效的。

The third case is pretty simple. If the resource variable starts with //, then you just need to prepend the protocol/scheme to it. You can do this with the site object you have:

第三种情况非常简单。如果资源变量以//开头,那么您只需要将协议/方案添加到其中。您可以使用您拥有的站点对象执行此操作:

url = site.getProtocol() + ":" + resource

url = site.getProtocol()+“:”+资源

For the fourth and fifth cases, you'll need to prepend the resource with the entire site's URL.

对于第四种和第五种情况,您需要在整个站点的URL之前添加资源。

Here's a sample application that uses jsoup to parse the HTML, and a simple utility method to build the resource URL. You're interested in the buildResourceUrl method. Also, it doesn't handle the second case; I'll leave that to you.

这是一个使用jsoup解析HTML的示例应用程序,以及一个构建资源URL的简单实用工具方法。您对buildResourceUrl方法感兴趣。此外,它不处理第二种情况;我会留给你的。

import java.io.*;
import java.net.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

public class SiteScraper {

    public static void main(String[] args) throws IOException {
        URL site = new URL("https://google.com/");
        Document doc = Jsoup.connect(site.toString()).get();
        Elements images = doc.select("img");
        for (Element image : images) {
            String src = image.attr("src");
            System.out.println(buildResourceUrl(site, src));
        }
    }

    static URL buildResourceUrl(URL site, String resource) 
            throws MalformedURLException {
        if (!resource.matches("^(http|https|ftp)://.*$")) {
            if (resource.startsWith("//")) {
                return new URL(site.getProtocol() + ":" + resource);
            } else {
                return new URL(site.getProtocol() + "://" + site.getHost() + "/" 
                        + resource.replaceAll("^/", ""));
            }
        }
        return new URL(resource);
    }
}

This obviously won't cover everything, but it's a start. You may run into problems when the URL you're trying to access is in a subdirectory of the root of the site (i.e., http://some.place/under/the/rainbow.html). You may even encounter base64 encoded data URI's in the src attribute... It really depends on the individual case and how far you're willing to go.

这显然不会涵盖一切,但这是一个开始。当您尝试访问的URL位于站点根目录的子目录中时(例如,http://some.place/under/the/rainbow.html),您可能会遇到问题。您甚至可能在src属性中遇到base64编码数据URI ...这实际上取决于具体情况以及您愿意走多远。

#1