从字符串[duplicate]中删除HTML标记的Python代码

This question already has an answer here:

这个问题已经有了答案:

Strip HTML from strings in Python 20 answers
在Python 20中从字符串中删除HTML

I have a text like this:

我有这样一段文字:

text = """<div>
<h1>Title</h1>
<p>A long text........ </p>
<a href=""> a link </a>
</div>"""

using pure Python, with no external module I want to have this:

使用纯Python，没有外部模块，我想要:

>>> print remove_tags(text)
Title A long text..... a link

I know I can do it using lxml.html.fromstring(text).text_content() but I need to achieve the same in pure Python using builtin or std library for 2.6+

我知道我可以使用lxml.html.fromstring(text).text_content()来实现这一点，但是我需要在纯Python中使用内置的或2.6+的std库来实现这一点

How can I do that?

我怎么做呢?

5 个解决方案

#1

114

Using a regex

Using a regex you can clean everything inside <> :

使用regex，您可以清理<>:

import re

def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

Using BeautifulSoup

You could also use BeautifulSoup additionnal package to find out all the raw text

您还可以使用漂亮的soup additionnal包来查找所有的原始文本。

You will need to explicitly set a parser when calling BeautifulSoup I recommand "lxml" as mentionned in alternative answers (puch more robist than the default one (i.e available without additionnal install) 'html.parser'

在调用BeautifulSoup时，您将需要显式地设置解析器。e无附加安装“html.parser”

from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text

But it doesn't prevent you from using external libraries, so I recommend the first solution.

但这并不妨碍您使用外部库，因此我推荐第一个解决方案。

#2

Python has several XML modules built in. The simplest one for the case that you already have a string with the full HTML is xml.etree, which works (somewhat) similarly to the lxml example you mention:

Python内置了几个XML模块。对于已经具有完整HTML的字符串，最简单的一种是xml。etree，它的工作原理(有点)类似于您提到的lxml示例:

def remove_tags(text):
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

#3

Note that this isn't perfect, since if you had something like, say, <a title=">"> it would break. However, it's about the closest you'd get in non-library Python without a really complex function:

注意，这并不是完美的，因为如果你有这样的东西，它就会坏掉。但是，如果没有一个非常复杂的函数，它是最接近非库Python的:

import re

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

However, as lvc mentions xml.etree is available in the Python Standard Library, so you could probably just adapt it to serve like your existing lxml version:

但是，正如lvc提到的xml一样。etree在Python标准库中是可用的，所以您可以将它调整为类似于您现有的lxml版本:

def remove_tags(text):
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

#4

There's a simple way to this in any C-like language. The style is not Pythonic but works with pure Python:

在任何类似c的语言中都有一种简单的方法。这种风格不是Python的，而是纯Python的:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

The idea based in a simple finite-state machine and is detailed explained here: http://youtu.be/2tu9LTDujbw

这个想法基于一个简单的有限状态机，详细说明如下:http://youtu.be/2tu9LTDujbw。

You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s

您可以看到它在这里工作:http://youtu.be/hpknpcyed9m?

PS - If you're interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It's free!

如果您对这个类(关于使用python进行智能调试)感兴趣，我将提供一个链接:http://www.udacity.com/overview/Course/cs259/CourseRev/1。它是免费的!

#5

-5

global temp

temp =''

s = ' '

def remove_strings(text):

    global temp 

    if text == '':

        return temp

    start = text.find('<')

    end = text.find('>')

    if start == -1 and end == -1 :

        temp = temp + text

    return temp

newstring = text[end+1:]

fresh_start = newstring.find('<')

if newstring[:fresh_start] != '':

    temp += s+newstring[:fresh_start]

remove_strings(newstring[fresh_start:])

return temp

#1

114

Using a regex

Using a regex you can clean everything inside <> :

使用regex，您可以清理<>:

import re

def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

Using BeautifulSoup

You could also use BeautifulSoup additionnal package to find out all the raw text

您还可以使用漂亮的soup additionnal包来查找所有的原始文本。

在调用BeautifulSoup时，您将需要显式地设置解析器。e无附加安装“html.parser”

from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text

But it doesn't prevent you from using external libraries, so I recommend the first solution.

但这并不妨碍您使用外部库，因此我推荐第一个解决方案。

#2

Python内置了几个XML模块。对于已经具有完整HTML的字符串，最简单的一种是xml。etree，它的工作原理(有点)类似于您提到的lxml示例:

def remove_tags(text):
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

#3

Note that this isn't perfect, since if you had something like, say, <a title=">"> it would break. However, it's about the closest you'd get in non-library Python without a really complex function:

注意，这并不是完美的，因为如果你有这样的东西，它就会坏掉。但是，如果没有一个非常复杂的函数，它是最接近非库Python的:

import re

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

However, as lvc mentions xml.etree is available in the Python Standard Library, so you could probably just adapt it to serve like your existing lxml version:

但是，正如lvc提到的xml一样。etree在Python标准库中是可用的，所以您可以将它调整为类似于您现有的lxml版本:

def remove_tags(text):
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

#4

There's a simple way to this in any C-like language. The style is not Pythonic but works with pure Python:

在任何类似c的语言中都有一种简单的方法。这种风格不是Python的，而是纯Python的:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

The idea based in a simple finite-state machine and is detailed explained here: http://youtu.be/2tu9LTDujbw

这个想法基于一个简单的有限状态机，详细说明如下:http://youtu.be/2tu9LTDujbw。

You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s

您可以看到它在这里工作:http://youtu.be/hpknpcyed9m?

PS - If you're interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It's free!

如果您对这个类(关于使用python进行智能调试)感兴趣，我将提供一个链接:http://www.udacity.com/overview/Course/cs259/CourseRev/1。它是免费的!

#5

-5

global temp

temp =''

s = ' '

def remove_strings(text):

    global temp 

    if text == '':

        return temp

    start = text.find('<')

    end = text.find('>')

    if start == -1 and end == -1 :

        temp = temp + text

    return temp

newstring = text[end+1:]

fresh_start = newstring.find('<')

if newstring[:fresh_start] != '':

    temp += s+newstring[:fresh_start]

remove_strings(newstring[fresh_start:])

return temp

秒客网

从字符串[duplicate]中删除HTML标记的Python代码

5 个解决方案

#1

Using a regex

Using BeautifulSoup

#2

#3

#4

#5

#1

Using a regex

Using BeautifulSoup

#2

#3

#4

#5

相关文章