寻找元素的直接子元素

时间:2023-01-22 20:59:48

I'm writing a solution to test this phenomenon in Python. I have most of the logic done, but there are many edge cases that arise when following links in Wikipedia articles.

我正在编写一个在Python中测试这种现象的解决方案。我已经完成了大部分逻辑,但是在*文章中关注链接时会出现许多边缘情况。

The problem I'm running into arises for a page like this where the first <p> has multiple levels of child elements and the first <a> tag after the first set of parentheses needs to be extracted. In this case, (to extract this link), you have to skip over the parentheses, and then get to the very next anchor tag/href. In most articles, my algorithm can skip over the parentheses, but with the way that it looks for links in front of parentheses (or if they don't exist), it is finding the anchor tag in the wrong place. Specifically, here: <span style="font-size: small;"><span id="coordinates"><a href="/wiki/Geographic_coordinate_system" title="Geographic coordinate system">Coordinates</a>

我遇到的问题出现在这样的页面上,其中第一个

具有多个子元素级别,并且需要提取第一组括号后的第一个标记。在这种情况下,(要提取此链接),您必须跳过括号,然后转到下一个锚标记/ href。在大多数文章中,我的算法可以跳过括号,但是通过它在括号前面查找链接的方式(或者如果它们不存在),它会在错误的位置找到锚标记。具体来说,这里: 坐标

The algorithm works by iterating through the elements in the first paragraph tag (in the main body of the article), stringifying each element iteratively, and first checking to see if it contains either an '(' or an '

该算法通过迭代第一段标记中的元素(在文章的主体中),迭代地对每个元素进行字符串化,并首先检查它是否包含'('或''

Is there any straight forward way to avoid embedded anchor tags and only take the first link that is a direct child of the first <p> ?

是否有任何直接的方法来避免嵌入的锚标记,只采取第一个

的直接子项链接?

Below is the function with this code for reference:

以下是此代码的功能供参考:

**def getValidLink(self, currResponse):
        currRoot = BeautifulSoup(currResponse.text,"lxml")
        temp = currRoot.body.findAll('p')[0]
        parenOpened = False
        parenCompleted = False
        openCount = 0
        foundParen = False
        while temp.next:
            temp = temp.next
            curr = str(temp)
            if '(' in curr and str(type(temp)) == "<class 'bs4.element.NavigableString'>":
                foundParen = True
                break
            if '<a' in curr and str(type(temp)) == "<class 'bs4.element.Tag'>":
                link = temp
                break

        temp = currRoot.body.findAll('p')[0]
        if foundParen:
            while temp.next and not parenCompleted:
                temp = temp.next
                curr = str(temp)
                if '(' in curr:
                    openCount += 1
                    if parenOpened is False:
                        parenOpened = True
                if ')' in curr and parenOpened and openCount > 1:
                    openCount -= 1
                elif ')' in curr and parenOpened and openCount == 1:
                    parenCompleted = True
            try:
                return temp.findNext('a').attrs['href']
            except KeyError:
                print "\nReached article with no main body!\n"
                return None
        try:
            return str(link.attrs['href'])
        except KeyError:
            print "\nReached article with no main body\n"
            return None**

1 个解决方案

#1


1  

I think you are seriously overcomplicating the problem.

我认为你严重过分复杂了这个问题。

There are multiple ways to use the direct parent-child relationship between the elements in BeautifulSoup. One way is the > CSS selector:

有多种方法可以在BeautifulSoup中使用元素之间的直接父子关系。一种方法是> CSS选择器:

In [1]: import requests  

In [2]: from bs4 import BeautifulSoup   

In [3]: url = "https://en.wikipedia.org/wiki/Sierra_Leone"    

In [4]: response = requests.get(url)    

In [5]: soup = BeautifulSoup(response.content, "html.parser")

In [6]: [a.get_text() for a in soup.select("#mw-content-text > p > a")]
Out[6]: 
['West Africa',
 'Guinea',
 'Liberia',
 ...
 'Allen Iverson',
 'Magic Johnson',
 'Victor Oladipo',
 'Frances Tiafoe']

Here we've found a elements that are located directly under the p elements directly under the element with id="mw-content-text" - from what I understand this is where the main Wikipedia article is located in.

在这里,我们发现了一个直接位于元素正下方的元素,其元素为id =“mw-content-text” - 据我所知,这是*的主要文章所在。

If you need a single element, use select_one() instead of select().

如果需要单个元素,请使用select_one()而不是select()。

Also, if you want to solve it via find*(), pass the recursive=False argument.

此外,如果要通过find *()求解,请传递recursive = False参数。

#1


1  

I think you are seriously overcomplicating the problem.

我认为你严重过分复杂了这个问题。

There are multiple ways to use the direct parent-child relationship between the elements in BeautifulSoup. One way is the > CSS selector:

有多种方法可以在BeautifulSoup中使用元素之间的直接父子关系。一种方法是> CSS选择器:

In [1]: import requests  

In [2]: from bs4 import BeautifulSoup   

In [3]: url = "https://en.wikipedia.org/wiki/Sierra_Leone"    

In [4]: response = requests.get(url)    

In [5]: soup = BeautifulSoup(response.content, "html.parser")

In [6]: [a.get_text() for a in soup.select("#mw-content-text > p > a")]
Out[6]: 
['West Africa',
 'Guinea',
 'Liberia',
 ...
 'Allen Iverson',
 'Magic Johnson',
 'Victor Oladipo',
 'Frances Tiafoe']

Here we've found a elements that are located directly under the p elements directly under the element with id="mw-content-text" - from what I understand this is where the main Wikipedia article is located in.

在这里,我们发现了一个直接位于元素正下方的元素,其元素为id =“mw-content-text” - 据我所知,这是*的主要文章所在。

If you need a single element, use select_one() instead of select().

如果需要单个元素,请使用select_one()而不是select()。

Also, if you want to solve it via find*(), pass the recursive=False argument.

此外,如果要通过find *()求解,请传递recursive = False参数。