
时间:2022-08-22 21:36:43


Sometimes the HTML is:


<div id="1">
    <div id="2">
        this is the text i do NOT want
    this is the text i want here

Other times it's just:


<div id="1">
    this is the text i want here

I want to get only the text in the one tag, and ignore all other child tags. If I run the .text property, I get both.


2 个解决方案


Updated to use a more generic method (see edit history for original answer):


You can extract child elements of the outer div by testing whether they are instances of NavigableString.


from bs4 import BeautifulSoup, NavigableString

html = '''<div id="1">
    <div id="2">
        this is the text i do NOT want
    this is the text i want here

soup = BeautifulSoup(html)    
outer = soup.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

This results in a list of strings contained in the outer div element.


>>> inner_text
[u'\n', u'\n    this is the text i want here\n']
>>> ''.join(inner_text)
u'\n\n    this is the text i want here\n'

For your second example:


html = '''<div id="1">
    this is the text i want here
soup2 = BeautifulSoup(html)    
outer = soup2.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

>>> inner_text
[u'\n    this is the text i want here\n']

This will also work for other cases such as the outer div's text element being present before any child tags, between child tags, multiple text elements, or not present at all.



Another possible approach (I would make it in a function) :


def getText(parent):
    return ''.join(parent.find_all(text=True, recursive=False)).strip()

recursive=False indicates that you want only direct children, not nested ones. And text=True indicates that you want only text nodes.

recursive = False表示您只想要直接子项,而不是嵌套项。而text = True表示您只需要文本节点。

Usage example :


from bs4 import BeautifulSoup

html = """<div id="1">
    <div id="2">
        this is the text i do NOT want
    this is the text i want here
soup = BeautifulSoup(html)
#this is the text i want here


Updated to use a more generic method (see edit history for original answer):


You can extract child elements of the outer div by testing whether they are instances of NavigableString.


from bs4 import BeautifulSoup, NavigableString

html = '''<div id="1">
    <div id="2">
        this is the text i do NOT want
    this is the text i want here

soup = BeautifulSoup(html)    
outer = soup.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

This results in a list of strings contained in the outer div element.


>>> inner_text
[u'\n', u'\n    this is the text i want here\n']
>>> ''.join(inner_text)
u'\n\n    this is the text i want here\n'

For your second example:


html = '''<div id="1">
    this is the text i want here
soup2 = BeautifulSoup(html)    
outer = soup2.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

>>> inner_text
[u'\n    this is the text i want here\n']

This will also work for other cases such as the outer div's text element being present before any child tags, between child tags, multiple text elements, or not present at all.



Another possible approach (I would make it in a function) :


def getText(parent):
    return ''.join(parent.find_all(text=True, recursive=False)).strip()

recursive=False indicates that you want only direct children, not nested ones. And text=True indicates that you want only text nodes.

recursive = False表示您只想要直接子项,而不是嵌套项。而text = True表示您只需要文本节点。

Usage example :


from bs4 import BeautifulSoup

html = """<div id="1">
    <div id="2">
        this is the text i do NOT want
    this is the text i want here
soup = BeautifulSoup(html)
#this is the text i want here