
时间:2022-10-16 09:00:54

Here's my problem. I have a txt file called "sites.txt" . In these i type random internet sites. My Goal is to save the first image of each site. I tried to filter the Server response by the img tag and it actually works for some sites, but for some not.


The sites where it works the img src starts with http:// ... the sites it doesnt work start with anything else.

img src工作的网站以http://开头......它不起作用的网站从其他任何东西开始。

I also tried to add the http:// to the img src images which didnt have it, but i still get the same error:

我也尝试将http://添加到没有它的img src图像中,但我仍然得到同样的错误:

    Exception in thread "main" java.net.MalformedURLException: no protocol:
    at java.net.URL.<init>(Unknown Source)

My current code is:


    public static void main(String[] args) throws IOException{
    try {
        File file = new File ("sites.txt");
        Scanner scanner = new Scanner (file);
        String url;
        int counter = 0;
                    URL page = new URL(url);
                    URLConnection yc = page.openConnection();
                       BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
                       String inputLine = in.readLine();
                       while (!inputLine.toLowerCase().contains("img"))inputLine = in.readLine();
                       String[] parts = inputLine.split(" ");
                       int i=0;
                       String destinationFile = "image"+(counter++)+".jpg";
                       saveImage(parts[i].substring(5,parts[i].length()-1), destinationFile);
                       String tmp=scanner.nextLine();

            catch (FileNotFoundException e) 
                System.out.println ("File not found!");
                System.exit (0);


public static void saveImage(String imageUrl, String destinationFile) throws IOException {
    // TODO Auto-generated method stub
    URL url = new URL(imageUrl);
    String fileName = url.getFile();
    String destName = fileName.substring(fileName.lastIndexOf("/"));
    InputStream is = url.openStream();
    OutputStream os = new FileOutputStream(destinationFile);

    byte[] b = new byte[2048];
    int length;

    while ((length = is.read(b)) != -1) {
        os.write(b, 0, length);


I also got a tip to use the apache jakarte http client libraries but i got absolutely no idea how i could use those i would appreciate any help.

我也有一个提示使用apache jakarte http客户端库但我完全不知道如何使用那些我会感激任何帮助。

3 个解决方案



A URL (a type of URI) requires a scheme in order to be valid. In this case, http.


When you type www.google.com into your browser, the browser is inferring you mean http:// and automatically prepends it for you. Java doesn't do this, hence your exception.

当您在浏览器中输入www.google.com时,浏览器会推断您的意思是http://并自动为您预先添加。 Java没有这样做,因此你的例外。

Make sure you always have http://. You can easily fix this using regex:


String fixedUrl = stringUrl.replaceAll("^((?!http://).{7})", "http://$1");


    stringUrl = "http://" + stringUrl;



An alternative solution


Simply try with ImageIO that contains static convenience methods for locating ImageReaders and ImageWriters, and performing simple encoding and decoding.


Sample code:

// read a image from the URL
// I used the URL that is your profile pic on *
BufferedImage image = ImageIO
        .read(new URL(

// save the image
ImageIO.write(image, "jpg", new File("resources/avatar.jpg"));



When you're scraping the site's HTML for image elements and their src attributes, you'll run into several different representations of URLs.


Some examples are:


  1. resource = https://google.com/images/srpr/logo9w.png
  2. resource = https://google.com/images/srpr/logo9w.png

  3. resource = google.com/images/srpr/logo9w.png
  4. resource = google.com/images/srpr/logo9w.png

  5. resource = //google.com/images/srpr/logo9w.png
  6. resource = //google.com/images/srpr/logo9w.png

  7. resource = /images/srpr/logo9w.png
  8. resource = /images/srpr/logo9w.png

  9. resource = images/srpr/logo9w.png
  10. resource = images / srpr / logo9w.png

For the second through fifth ones, you'll need to build the rest of the URL.


The second one may be more difficult to differentiate from the fourth and fifth ones, but I'm sure there are workarounds. The URL Standard leads me to believe you won't see it as often, because I don't think it's technically valid.

第二个可能更难以区分第四个和第五个,但我确信有解决方法。 URL标准让我相信你不会经常看到它,因为我不认为它在技术上是有效的。

The third case is pretty simple. If the resource variable starts with //, then you just need to prepend the protocol/scheme to it. You can do this with the site object you have:


url = site.getProtocol() + ":" + resource

url = site.getProtocol()+“:”+资源

For the fourth and fifth cases, you'll need to prepend the resource with the entire site's URL.


Here's a sample application that uses jsoup to parse the HTML, and a simple utility method to build the resource URL. You're interested in the buildResourceUrl method. Also, it doesn't handle the second case; I'll leave that to you.


import java.io.*;
import java.net.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

public class SiteScraper {

    public static void main(String[] args) throws IOException {
        URL site = new URL("https://google.com/");
        Document doc = Jsoup.connect(site.toString()).get();
        Elements images = doc.select("img");
        for (Element image : images) {
            String src = image.attr("src");
            System.out.println(buildResourceUrl(site, src));

    static URL buildResourceUrl(URL site, String resource) 
            throws MalformedURLException {
        if (!resource.matches("^(http|https|ftp)://.*$")) {
            if (resource.startsWith("//")) {
                return new URL(site.getProtocol() + ":" + resource);
            } else {
                return new URL(site.getProtocol() + "://" + site.getHost() + "/" 
                        + resource.replaceAll("^/", ""));
        return new URL(resource);

This obviously won't cover everything, but it's a start. You may run into problems when the URL you're trying to access is in a subdirectory of the root of the site (i.e., http://some.place/under/the/rainbow.html). You may even encounter base64 encoded data URI's in the src attribute... It really depends on the individual case and how far you're willing to go.

这显然不会涵盖一切,但这是一个开始。当您尝试访问的URL位于站点根目录的子目录中时(例如,http://some.place/under/the/rainbow.html),您可能会遇到问题。您甚至可能在src属性中遇到base64编码数据URI ...这实际上取决于具体情况以及您愿意走多远。



A URL (a type of URI) requires a scheme in order to be valid. In this case, http.


When you type www.google.com into your browser, the browser is inferring you mean http:// and automatically prepends it for you. Java doesn't do this, hence your exception.

当您在浏览器中输入www.google.com时,浏览器会推断您的意思是http://并自动为您预先添加。 Java没有这样做,因此你的例外。

Make sure you always have http://. You can easily fix this using regex:


String fixedUrl = stringUrl.replaceAll("^((?!http://).{7})", "http://$1");


    stringUrl = "http://" + stringUrl;



An alternative solution


Simply try with ImageIO that contains static convenience methods for locating ImageReaders and ImageWriters, and performing simple encoding and decoding.


Sample code:

// read a image from the URL
// I used the URL that is your profile pic on *
BufferedImage image = ImageIO
        .read(new URL(

// save the image
ImageIO.write(image, "jpg", new File("resources/avatar.jpg"));



When you're scraping the site's HTML for image elements and their src attributes, you'll run into several different representations of URLs.


Some examples are:


  1. resource = https://google.com/images/srpr/logo9w.png
  2. resource = https://google.com/images/srpr/logo9w.png

  3. resource = google.com/images/srpr/logo9w.png
  4. resource = google.com/images/srpr/logo9w.png

  5. resource = //google.com/images/srpr/logo9w.png
  6. resource = //google.com/images/srpr/logo9w.png

  7. resource = /images/srpr/logo9w.png
  8. resource = /images/srpr/logo9w.png

  9. resource = images/srpr/logo9w.png
  10. resource = images / srpr / logo9w.png

For the second through fifth ones, you'll need to build the rest of the URL.


The second one may be more difficult to differentiate from the fourth and fifth ones, but I'm sure there are workarounds. The URL Standard leads me to believe you won't see it as often, because I don't think it's technically valid.

第二个可能更难以区分第四个和第五个,但我确信有解决方法。 URL标准让我相信你不会经常看到它,因为我不认为它在技术上是有效的。

The third case is pretty simple. If the resource variable starts with //, then you just need to prepend the protocol/scheme to it. You can do this with the site object you have:


url = site.getProtocol() + ":" + resource

url = site.getProtocol()+“:”+资源

For the fourth and fifth cases, you'll need to prepend the resource with the entire site's URL.


Here's a sample application that uses jsoup to parse the HTML, and a simple utility method to build the resource URL. You're interested in the buildResourceUrl method. Also, it doesn't handle the second case; I'll leave that to you.


import java.io.*;
import java.net.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

public class SiteScraper {

    public static void main(String[] args) throws IOException {
        URL site = new URL("https://google.com/");
        Document doc = Jsoup.connect(site.toString()).get();
        Elements images = doc.select("img");
        for (Element image : images) {
            String src = image.attr("src");
            System.out.println(buildResourceUrl(site, src));

    static URL buildResourceUrl(URL site, String resource) 
            throws MalformedURLException {
        if (!resource.matches("^(http|https|ftp)://.*$")) {
            if (resource.startsWith("//")) {
                return new URL(site.getProtocol() + ":" + resource);
            } else {
                return new URL(site.getProtocol() + "://" + site.getHost() + "/" 
                        + resource.replaceAll("^/", ""));
        return new URL(resource);

This obviously won't cover everything, but it's a start. You may run into problems when the URL you're trying to access is in a subdirectory of the root of the site (i.e., http://some.place/under/the/rainbow.html). You may even encounter base64 encoded data URI's in the src attribute... It really depends on the individual case and how far you're willing to go.

这显然不会涵盖一切,但这是一个开始。当您尝试访问的URL位于站点根目录的子目录中时(例如,http://some.place/under/the/rainbow.html),您可能会遇到问题。您甚至可能在src属性中遇到base64编码数据URI ...这实际上取决于具体情况以及您愿意走多远。