如何做广度优先搜索和漂亮的汤?

时间:2022-11-29 13:11:26

I am trying to do a Breath First Search on a Beautiful soup tree. I know, we can do a Depth First Search with Beautiful soup like this :

我正试着在一棵漂亮的汤树上做一次呼吸。我知道,我们可以用这样的漂亮的汤做深度的第一次搜索:

html = """SOME HTML FILE"""

soup = BeautifulSoup(html)

for child in soup.recursiveChildGenerator():
    # do some stuff here
    pass

But I have no idea how to do a Breath First Search, anyone having any idea, suggestion ?

但我不知道怎么做呼吸第一次搜索,有人有任何想法,建议吗?

Thanks for your help.

谢谢你的帮助。

2 个解决方案

#1


0  

Use the .children generator for each element to append to your breadth-first queue:

使用.children生成器为每个元素添加到第一个队列:

from bs4 import BeautifulSoup
import requests

html = requests.get("https://*.com/questions/44798715/").text
soup = BeautifulSoup(html, "html5lib")
queue = [([], soup)]  # queue of (path, element) pairs
while queue:
    path, element = queue.pop(0)
    if hasattr(element, 'children'):  # check for leaf elements
        for child in element.children:
            queue.append((path + [child.name if child.name is not None else type(child)],
                          child))
    # do stuff
    print(path, repr(element.string[:50]) if element.string else type(element))

#2


0  

To browse HTML document parsed by BeautifulSoup with DFS or BFS do :

浏览用DFS或BFS进行美化的HTML文档:

solution.py:

solution.py:

import bs4
from bs4 import BeautifulSoup

html = """
<div>root
     <div>child1
          <div>child4
          </div>
          <div>child5
          </div>
     </div>
     <div>child2
     </div>
     <div>child3
          <div>child6
          </div>
     </div>
</div>
"""

Append these lines to solution.py :

将这些行附加到解决方案中。py:

def visit(node):
    if isinstance(node, bs4.element.Tag):
        # be careful bs4.element subclass ...
        print(type(node), 'tag:', node.name)
    elif isinstance(node, bs4.element.NavigableString):
        # be careful bs4.CDdata and bs4.element.Comment subclass ...
        print(type(node), repr(node.string))
    else:
        print(type(node), 'UNKNOWN')

And:

和:

def dfs(html):
    bs = BeautifulSoup(html, 'html.parser')
    # <class 'bs4.BeautifulSoup'> [document]
    visit(bs)
    for child in bs.recursiveChildGenerator():
        visit(child)


def bfs(html):
    bs = BeautifulSoup(html, 'html.parser')
    # <class 'bs4.BeautifulSoup'> [document]
    visit(bs)
    for child in recursiveChildGeneratorBfs(bs):
        visit(child)


def recursiveChildGeneratorBfs(bs):
    root = bs
    stack = [root]
    while len(stack) != 0:
        node = stack.pop(0)
        if node is not bs:
            yield node
        if hasattr(node, 'children'):
            for child in node.children:
                stack.append(child)

In ipython console:

在ipython控制台:

In [1]: run solution.py

BFS:

石:

In [2]: bfs(html)
<class 'bs4.BeautifulSoup'> tag: [document]
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'root\n     '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'child1\n          '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'child2\n     '
<class 'bs4.element.NavigableString'> 'child3\n          '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'child4\n          '
<class 'bs4.element.NavigableString'> 'child5\n          '
<class 'bs4.element.NavigableString'> 'child6\n          '

DFS:

DFS:

In [3]: dfs(html)
<class 'bs4.BeautifulSoup'> tag: [document]
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'root\n     '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child1\n          '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child4\n          '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child5\n          '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child2\n     '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child3\n          '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child6\n          '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> '\n'

See :

看到的:

Documentation

文档

#1


0  

Use the .children generator for each element to append to your breadth-first queue:

使用.children生成器为每个元素添加到第一个队列:

from bs4 import BeautifulSoup
import requests

html = requests.get("https://*.com/questions/44798715/").text
soup = BeautifulSoup(html, "html5lib")
queue = [([], soup)]  # queue of (path, element) pairs
while queue:
    path, element = queue.pop(0)
    if hasattr(element, 'children'):  # check for leaf elements
        for child in element.children:
            queue.append((path + [child.name if child.name is not None else type(child)],
                          child))
    # do stuff
    print(path, repr(element.string[:50]) if element.string else type(element))

#2


0  

To browse HTML document parsed by BeautifulSoup with DFS or BFS do :

浏览用DFS或BFS进行美化的HTML文档:

solution.py:

solution.py:

import bs4
from bs4 import BeautifulSoup

html = """
<div>root
     <div>child1
          <div>child4
          </div>
          <div>child5
          </div>
     </div>
     <div>child2
     </div>
     <div>child3
          <div>child6
          </div>
     </div>
</div>
"""

Append these lines to solution.py :

将这些行附加到解决方案中。py:

def visit(node):
    if isinstance(node, bs4.element.Tag):
        # be careful bs4.element subclass ...
        print(type(node), 'tag:', node.name)
    elif isinstance(node, bs4.element.NavigableString):
        # be careful bs4.CDdata and bs4.element.Comment subclass ...
        print(type(node), repr(node.string))
    else:
        print(type(node), 'UNKNOWN')

And:

和:

def dfs(html):
    bs = BeautifulSoup(html, 'html.parser')
    # <class 'bs4.BeautifulSoup'> [document]
    visit(bs)
    for child in bs.recursiveChildGenerator():
        visit(child)


def bfs(html):
    bs = BeautifulSoup(html, 'html.parser')
    # <class 'bs4.BeautifulSoup'> [document]
    visit(bs)
    for child in recursiveChildGeneratorBfs(bs):
        visit(child)


def recursiveChildGeneratorBfs(bs):
    root = bs
    stack = [root]
    while len(stack) != 0:
        node = stack.pop(0)
        if node is not bs:
            yield node
        if hasattr(node, 'children'):
            for child in node.children:
                stack.append(child)

In ipython console:

在ipython控制台:

In [1]: run solution.py

BFS:

石:

In [2]: bfs(html)
<class 'bs4.BeautifulSoup'> tag: [document]
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'root\n     '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'child1\n          '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'child2\n     '
<class 'bs4.element.NavigableString'> 'child3\n          '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'child4\n          '
<class 'bs4.element.NavigableString'> 'child5\n          '
<class 'bs4.element.NavigableString'> 'child6\n          '

DFS:

DFS:

In [3]: dfs(html)
<class 'bs4.BeautifulSoup'> tag: [document]
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'root\n     '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child1\n          '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child4\n          '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child5\n          '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child2\n     '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child3\n          '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child6\n          '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> '\n'

See :

看到的:

Documentation

文档