获取没有内部子标签文本的HTML标签文本

时间:2022-08-22 21:36:43

Example:

Sometimes the HTML is:

有时HTML是:

<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>

Other times it's just:

其他时候它只是:

<div id="1">
    this is the text i want here
</div>

I want to get only the text in the one tag, and ignore all other child tags. If I run the .text property, I get both.

我想只获取一个标签中的文本,并忽略所有其他子标签。如果我运行.text属性,我得到两者。

2 个解决方案

#1


Updated to use a more generic method (see edit history for original answer):

已更新以使用更通用的方法(请参阅编辑历史记录以获取原始答案):

You can extract child elements of the outer div by testing whether they are instances of NavigableString.

您可以通过测试它们是否是NavigableString的实例来提取外部div的子元素。

from bs4 import BeautifulSoup, NavigableString

html = '''<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>'''

soup = BeautifulSoup(html)    
outer = soup.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

This results in a list of strings contained in the outer div element.

这导致外部div元素中包含的字符串列表。

>>> inner_text
[u'\n', u'\n    this is the text i want here\n']
>>> ''.join(inner_text)
u'\n\n    this is the text i want here\n'

For your second example:

对于你的第二个例子:

html = '''<div id="1">
    this is the text i want here
</div>'''
soup2 = BeautifulSoup(html)    
outer = soup2.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

>>> inner_text
[u'\n    this is the text i want here\n']

This will also work for other cases such as the outer div's text element being present before any child tags, between child tags, multiple text elements, or not present at all.

这也适用于其他情况,例如外部div的文本元素在任何子标记之前,子标记之间,多个文本元素之间或根本不存在。

#2


Another possible approach (I would make it in a function) :

另一种可能的方法(我会在一个函数中实现):

def getText(parent):
    return ''.join(parent.find_all(text=True, recursive=False)).strip()

recursive=False indicates that you want only direct children, not nested ones. And text=True indicates that you want only text nodes.

recursive = False表示您只想要直接子项,而不是嵌套项。而text = True表示您只需要文本节点。

Usage example :

用法示例:

from bs4 import BeautifulSoup

html = """<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>
"""
soup = BeautifulSoup(html)
print(getText(soup.div))
#this is the text i want here

#1


Updated to use a more generic method (see edit history for original answer):

已更新以使用更通用的方法(请参阅编辑历史记录以获取原始答案):

You can extract child elements of the outer div by testing whether they are instances of NavigableString.

您可以通过测试它们是否是NavigableString的实例来提取外部div的子元素。

from bs4 import BeautifulSoup, NavigableString

html = '''<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>'''

soup = BeautifulSoup(html)    
outer = soup.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

This results in a list of strings contained in the outer div element.

这导致外部div元素中包含的字符串列表。

>>> inner_text
[u'\n', u'\n    this is the text i want here\n']
>>> ''.join(inner_text)
u'\n\n    this is the text i want here\n'

For your second example:

对于你的第二个例子:

html = '''<div id="1">
    this is the text i want here
</div>'''
soup2 = BeautifulSoup(html)    
outer = soup2.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

>>> inner_text
[u'\n    this is the text i want here\n']

This will also work for other cases such as the outer div's text element being present before any child tags, between child tags, multiple text elements, or not present at all.

这也适用于其他情况,例如外部div的文本元素在任何子标记之前,子标记之间,多个文本元素之间或根本不存在。

#2


Another possible approach (I would make it in a function) :

另一种可能的方法(我会在一个函数中实现):

def getText(parent):
    return ''.join(parent.find_all(text=True, recursive=False)).strip()

recursive=False indicates that you want only direct children, not nested ones. And text=True indicates that you want only text nodes.

recursive = False表示您只想要直接子项,而不是嵌套项。而text = True表示您只需要文本节点。

Usage example :

用法示例:

from bs4 import BeautifulSoup

html = """<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>
"""
soup = BeautifulSoup(html)
print(getText(soup.div))
#this is the text i want here