在浏览器中使用CSS选择器从包含HTML标记的字符串中刮取数据,而不创建DOM元素?

时间:2021-12-26 09:47:05

I have been trying this simple task for hours. No available libraries seem to help and no questions here seem to tackle this scenario.

我一直在尝试这个简单的任务几个小时。没有可用的库似乎有帮助,这里似乎没有任何问题可以解决这个问题。

It's fairly simple:

这很简单:

  • I have an entire page's markup as a string.
  • 我将整个页面的标记作为字符串。

  • I need to use CSS selectors to point to the elements I need to scrape the data from.
  • 我需要使用CSS选择器指向我需要从中抓取数据的元素。

  • I DO NOT want to create actual HTML DOM elements. Only scrape the data from them. The page might contain image, audio, video and other elements that I don't want to create.
  • 我不想创建实际的HTML DOM元素。只抓取他们的数据。该页面可能包含我不想创建的图像,音频,视频和其他元素。

  • It needs to be able to deal with markup errors and HTML5-style tagging. Currently, trying to parse it as XML throws an "Invalid XML" exception.
  • 它需要能够处理标记错误和HTML5样式标记。目前,尝试将其解析为XML会引发“无效的XML”异常。

  • It needs to happen in the browser. So, no NodeJS modules.
  • 它需要在浏览器中发生。所以,没有NodeJS模块。

In JAVA I've been able to do exactly this using JSoup. But there doesn't seem to be an equivalent library for JS running on a browser.

在JAVA中,我已经能够使用JSoup做到这一点。但似乎没有一个等效的库在浏览器上运行JS。

Thanks for your time.

谢谢你的时间。

2 个解决方案

#1


0  

@JaromandaX's suggestion was correct. A way to do this is to use a DOMParser object. It allows you to create the elements and then use .querySelector or .querySelectorAll on them while also not loading any external resources or running any scripts.

@ JaromandaX的建议是正确的。一种方法是使用DOMParser对象。它允许您创建元素,然后在它们上使用.querySelector或.querySelectorAll,同时也不加载任何外部资源或运行任何脚本。

This is what worked for me:

这对我有用:

var parser = new DOMParser();
var doc = parser.parseFromString(markup, "text/html");

#2


0  

You can use PHP Goutte or Python's BeautifulSoup4 library where you can use CSS Selectors or XPaths as well, whatever you are comfortable with.

您可以使用PHP Goutte或Python的BeautifulSoup4库,您也可以使用CSS选择器或XPath,无论您喜欢什么。

Here are some simple examples to get started.

以下是一些简单的示例。

PHP Goutte:

require_once 'vendor/autoload.php';
use Goutte\Client;

$client = new Client();
$resp = $client->request('GET', $url);
foreach ($resp->filter(' your css selector here') as $li) {
// your logic here
}

Python BeautifulSoup example:

Python BeautifulSoup示例:

import requests
from bs4 import BeautifulSoup
timeout_time = 30;

def tryAgain(passed_url):
    try:
        page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
        return page
    except Exception:
        while 1:

            print("Trying again the URL:")
            print(passed_url)
            try:
                page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
                print("-------------------------------------")
                print("---- URL was successfully scraped ---")
                print("-------------------------------------")
                return page
            except Exception:
                time.sleep(20)
                continue

header = [{"User-Agent": "Mozilla/5.0 (Windows NT 5.1; rv:14.0) Gecko/20100101 Firefox/14.0.1"},
{"User-Agent":"Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14"},
{"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201"},
{"User-Agent":"Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25"}]

main_url = " your URL here "

main_page_html  = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html, "html.parser")

for a in main_page_soup.select(' css selector here '):
        print a.select(' your css selector here ')[0].text

#1


0  

@JaromandaX's suggestion was correct. A way to do this is to use a DOMParser object. It allows you to create the elements and then use .querySelector or .querySelectorAll on them while also not loading any external resources or running any scripts.

@ JaromandaX的建议是正确的。一种方法是使用DOMParser对象。它允许您创建元素,然后在它们上使用.querySelector或.querySelectorAll,同时也不加载任何外部资源或运行任何脚本。

This is what worked for me:

这对我有用:

var parser = new DOMParser();
var doc = parser.parseFromString(markup, "text/html");

#2


0  

You can use PHP Goutte or Python's BeautifulSoup4 library where you can use CSS Selectors or XPaths as well, whatever you are comfortable with.

您可以使用PHP Goutte或Python的BeautifulSoup4库,您也可以使用CSS选择器或XPath,无论您喜欢什么。

Here are some simple examples to get started.

以下是一些简单的示例。

PHP Goutte:

require_once 'vendor/autoload.php';
use Goutte\Client;

$client = new Client();
$resp = $client->request('GET', $url);
foreach ($resp->filter(' your css selector here') as $li) {
// your logic here
}

Python BeautifulSoup example:

Python BeautifulSoup示例:

import requests
from bs4 import BeautifulSoup
timeout_time = 30;

def tryAgain(passed_url):
    try:
        page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
        return page
    except Exception:
        while 1:

            print("Trying again the URL:")
            print(passed_url)
            try:
                page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
                print("-------------------------------------")
                print("---- URL was successfully scraped ---")
                print("-------------------------------------")
                return page
            except Exception:
                time.sleep(20)
                continue

header = [{"User-Agent": "Mozilla/5.0 (Windows NT 5.1; rv:14.0) Gecko/20100101 Firefox/14.0.1"},
{"User-Agent":"Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14"},
{"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201"},
{"User-Agent":"Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25"}]

main_url = " your URL here "

main_page_html  = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html, "html.parser")

for a in main_page_soup.select(' css selector here '):
        print a.select(' your css selector here ')[0].text