删除字符串中的所有HTML标记

时间:2022-08-27 16:20:14

In my dataset, I have a field which stores text marked up with HTML. The general format is as follows:

在我的数据集中,我有一个字段用来存储用HTML标记的文本。一般格式如下:

<html><head></head><body><p>My text.</p></body></html>

< html > <头> < /头> <身体> < p >文本。< / p > < /身体> < / html >

I could attempt to solve the problem by doing the following:

我可以通过以下方法来解决这个问题:

REPLACE(REPLACE(Table.HtmlData, '<html><head></head><body><p>', ''), '</p></body></html>')

However, this is not a strict rule as some of entries break W3C Standards and do not include <head> tags for example. Even worse, there could be missing closing tags. So I would need to include the REPLACE function for each opening and closing tag that could exist.

但是,这并不是一个严格的规则,因为一些条目违反了W3C标准,并且不包括标签。更糟糕的是,可能会缺少关闭标签。因此,我需要为可能存在的每个打开和结束标记包含替换函数。

REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    Table.HtmlData,
    '<html>', ''),
    '</html>', ''),
    '<head>', ''),
    '</head>', ''),
    '<body>', ''),
    '</body>', ''),
    '<p>', ''),
    '</p>', '')

I was wondering if there was a better way to accomplish this than using multiple nested REPLACE functions. Unfortunately, the only languages I have available in this environment are SQL and Visual Basic (not .NET).

我想知道是否有比使用多个嵌套替换函数更好的方法来实现这一点。不幸的是,我在这个环境中仅有的语言是SQL和Visual Basic(不是。net)。

7 个解决方案

#1


7  

DECLARE @x XML = '<html><head></head><body><p>My text.</p></body></html>'

SELECT t.c.value('.', 'NVARCHAR(MAX)')
FROM @x.nodes('*') t(c)

Update - For strings with unclosed tags:

更新-对于带有未关闭标签的字符串:

DECLARE @x NVARCHAR(MAX) = '<html><head></head><body><p>My text.<br>More text.</p></body></html>'

SELECT x.value('.', 'NVARCHAR(MAX)')
FROM (
    SELECT x = CAST(REPLACE(REPLACE(@x, '>', '/>'), '</', '<') AS XML)
) r

#2


5  

If the HTML is well formed then there's no need to use replace to parse XML.
Just cast or convert it to an XML type and get the value(s).

如果HTML格式良好,那么就不需要使用replace来解析XML。只需将其转换或转换为XML类型并获取值。

Here's an example to output the text from all tags:

这里有一个从所有标签中输出文本的例子:

declare @htmlData nvarchar(100) = '<html>
<head>
</head>
<body>
   <p>My text.</p>
   <p>My other text.</p>
</body>
</html>';

select convert(XML,@htmlData,1).value('.', 'nvarchar(max)');

select cast(@htmlData as XML).value('.', 'nvarchar(max)');

Note that there's a difference in the output of whitespace between cast and convert.

注意,在类型转换和转换之间,空格的输出是不同的。

To only get content from a specific node, the XQuery syntax is used. (XQuery is based on the XPath syntax)

为了只从特定节点获取内容,使用XQuery语法。(XQuery基于XPath语法)

For example:

例如:

select cast(@htmlData as XML).value('(//body/p/node())[1]', 'nvarchar(max)');

select convert(XML,@htmlData,1).value('(//body/p/node())[1]', 'nvarchar(max)');

Result : My text.

结果:我的文本。

Of course, this still assumes a valid XML.
If for example, a closing tag is missing then this would raise an XML parsing error.

当然,这仍然假设有一个有效的XML。例如,如果缺少结束标记,则会引发XML解析错误。

If the HTML isn't well formed as an XML, then one could use PATINDEX & SUBSTRING to get the first p tag. And then cast that to an XML type to get the value.

如果HTML不是很好的XML格式,那么可以使用PATINDEX和SUBSTRING来获得第一个p标记。然后将其转换为XML类型以获得该值。

select cast(SUBSTRING(@htmlData,patindex('%<p>%',@htmlData),patindex('%</p>%',@htmlData) - patindex('%<p>%',@htmlData)+4) as xml).value('.','nvarchar(max)');

or via a funky recursive way:

或者通过一种时髦的递归方式:

declare @xmlData nvarchar(100);
WITH Lines(n, x, y) AS (
  SELECT 1, 1, CHARINDEX(char(13), @htmlData)
  UNION ALL
  SELECT n+1, y+1, CHARINDEX(char(13), @htmlData, y+1) FROM Lines
  WHERE y > 0
)
SELECT @xmlData = concat(@xmlData,SUBSTRING(@htmlData,x,IIF(y>0,y-x,8)))
FROM Lines
where PATINDEX('%<p>%</p>%', SUBSTRING(@htmlData,x,IIF(y>0,y-x,10))) > 0
order by n;

select 
@xmlData as xmlData, 
convert(XML,@xmlData,1).value('(/p/node())[1]', 'nvarchar(max)') as FirstP;

#3


2  

Firstly create a user defined function that strips the HTML out like so:

首先创建一个用户定义的函数,将HTML去掉,如下所示:

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
     BEGIN
         DECLARE @Start INT;
         DECLARE @End INT;
         DECLARE @Length INT;
         SET @Start = CHARINDEX('<', @HTMLText);
         SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
         SET @Length = (@End - @Start) + 1;
         WHILE @Start > 0
               AND @End > 0
               AND @Length > 0
             BEGIN
                 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '');
                 SET @Start = CHARINDEX('<', @HTMLText);
                 SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
                 SET @Length = (@End - @Start) + 1;
             END;
         RETURN LTRIM(RTRIM(@HTMLText));
     END;
GO

When you're trying to select it:

当你试图选择它时:

SELECT dbo.udf_StripHTML([column]) FROM SOMETABLE

This should lead to you avoiding to have to use several nested replace statements.

这将使您避免使用几个嵌套的替换语句。

Credit and further info: http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/

信贷和进一步的信息:http://blog.sqlauthority.com/2007/06/16/sql-server-udf用户定义的功能到条带-html-parse-html-no-regular-expression/

#4


1  

One more solution, just to demonstrate a trick to replace many values of a table (easy to maintain!!!) in one single statement:

还有一个解决方案,就是演示如何在一个语句中替换一个表的多个值(易于维护!!):

--add any replace templates here:

——在这里添加任何替换模板:

CREATE TABLE ReplaceTags (HTML VARCHAR(100));
INSERT INTO ReplaceTags VALUES
 ('<html>'),('<head>'),('<body>'),('<p>'),('<br>')
,('</html>'),('</head>'),('</body>'),('</p>'),('</br>');
GO

--This function will perform the "trick"

——这个函数将执行“trick”

CREATE FUNCTION dbo.DoReplace(@Content VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
    SELECT @Content=REPLACE(@Content,HTML,'')
    FROM ReplaceTags;

    RETURN @Content;
END
GO

--All examples I found in your question and in comments

——所有我在你的问题和评论中找到的例子。

DECLARE @content TABLE(Content VARCHAR(MAX));
INSERT INTO @content VALUES
 ('<html><head></head><body><p>My text.</p></body></html>')
,('<html><head></head><body><p>My text.<br>More text.</p></body></html>')
,('<html><head></head><body><p>My text.<br>More text.</p></body></html>')
,('<html><head></head><body><p>My text.</p></html>');

--this is the actual query

——这是实际的查询

SELECT dbo.DoReplace(Content) FROM @content;
GO

--Clean-Up

——清理

DROP FUNCTION dbo.DoReplace;
DROP TABLE ReplaceTags;

UPDATE

If you add a replace-value to the template-table you might even use different values as replacements like replace a <br> with an actual line break...

如果您将替换值添加到模板表中,您甚至可以使用不同的值作为替换,例如用实际的换行符替换

#5


0  

This is just an example. You can use this in script to rmeove any html tags:

这只是一个例子。您可以在脚本中使用它来rmeove任何html标记:

 DECLARE @VALUE VARCHAR(MAX),@start INT,@end int,@remove varchar(max)
SET @VALUE='<html itemscope itemtype="http://schema.org/QAPage">
<head>

<title>sql - Converting INT to DATE then using GETDATE on conversion? - Stack Overflow</title>
<html>
</html>
'

set @start=charindex('<',@value)
while @start>0
begin
set @end=charindex('>',@VALUE)

set @remove=substring(@VALUE,@start,@end)
set @value=replace(@value,@remove,'')
set @start=charindex('<',@value)
end
print @value

#6


0  

This is the simplest way.

这是最简单的方法。

DECLARE @str VARCHAR(299)

SELECT @str = '<html><head></head><body><p>My text.</p></body></html>'

SELECT cast(@str AS XML).query('.').value('.', 'varchar(200)')

#7


0  

You mention the XML is not always valid, but does it always contain the <p> and </p> tags?

您提到XML并不总是有效的,但是它是否总是包含

标记?

In that case the following would work:

在这种情况下,可以采用下列方法:

SUBSTRING(Table.HtmlData, 
    CHARINDEX('<p>', Table.HtmlData) + 1, 
    CHARINDEX('</p>', Table.HtmlData) - CHARINDEX('<p>', Table.HtmlData) + 1)

For finding all positions of a <p> within a HTML, there's already a good post here: https://dba.stackexchange.com/questions/41961/how-to-find-all-positions-of-a-string-within-another-string

要在HTML中找到

的所有位置,这里已经有了一个很好的帖子:https://dba.stackexchange.com/questions/41961/how to find-all- positionsof -string- string- string

Alternatively I suggest using Visual Basic, as you mentioned that is also an option.

或者,我建议使用Visual Basic,正如您所提到的,它也是一个选项。

#1


7  

DECLARE @x XML = '<html><head></head><body><p>My text.</p></body></html>'

SELECT t.c.value('.', 'NVARCHAR(MAX)')
FROM @x.nodes('*') t(c)

Update - For strings with unclosed tags:

更新-对于带有未关闭标签的字符串:

DECLARE @x NVARCHAR(MAX) = '<html><head></head><body><p>My text.<br>More text.</p></body></html>'

SELECT x.value('.', 'NVARCHAR(MAX)')
FROM (
    SELECT x = CAST(REPLACE(REPLACE(@x, '>', '/>'), '</', '<') AS XML)
) r

#2


5  

If the HTML is well formed then there's no need to use replace to parse XML.
Just cast or convert it to an XML type and get the value(s).

如果HTML格式良好,那么就不需要使用replace来解析XML。只需将其转换或转换为XML类型并获取值。

Here's an example to output the text from all tags:

这里有一个从所有标签中输出文本的例子:

declare @htmlData nvarchar(100) = '<html>
<head>
</head>
<body>
   <p>My text.</p>
   <p>My other text.</p>
</body>
</html>';

select convert(XML,@htmlData,1).value('.', 'nvarchar(max)');

select cast(@htmlData as XML).value('.', 'nvarchar(max)');

Note that there's a difference in the output of whitespace between cast and convert.

注意,在类型转换和转换之间,空格的输出是不同的。

To only get content from a specific node, the XQuery syntax is used. (XQuery is based on the XPath syntax)

为了只从特定节点获取内容,使用XQuery语法。(XQuery基于XPath语法)

For example:

例如:

select cast(@htmlData as XML).value('(//body/p/node())[1]', 'nvarchar(max)');

select convert(XML,@htmlData,1).value('(//body/p/node())[1]', 'nvarchar(max)');

Result : My text.

结果:我的文本。

Of course, this still assumes a valid XML.
If for example, a closing tag is missing then this would raise an XML parsing error.

当然,这仍然假设有一个有效的XML。例如,如果缺少结束标记,则会引发XML解析错误。

If the HTML isn't well formed as an XML, then one could use PATINDEX & SUBSTRING to get the first p tag. And then cast that to an XML type to get the value.

如果HTML不是很好的XML格式,那么可以使用PATINDEX和SUBSTRING来获得第一个p标记。然后将其转换为XML类型以获得该值。

select cast(SUBSTRING(@htmlData,patindex('%<p>%',@htmlData),patindex('%</p>%',@htmlData) - patindex('%<p>%',@htmlData)+4) as xml).value('.','nvarchar(max)');

or via a funky recursive way:

或者通过一种时髦的递归方式:

declare @xmlData nvarchar(100);
WITH Lines(n, x, y) AS (
  SELECT 1, 1, CHARINDEX(char(13), @htmlData)
  UNION ALL
  SELECT n+1, y+1, CHARINDEX(char(13), @htmlData, y+1) FROM Lines
  WHERE y > 0
)
SELECT @xmlData = concat(@xmlData,SUBSTRING(@htmlData,x,IIF(y>0,y-x,8)))
FROM Lines
where PATINDEX('%<p>%</p>%', SUBSTRING(@htmlData,x,IIF(y>0,y-x,10))) > 0
order by n;

select 
@xmlData as xmlData, 
convert(XML,@xmlData,1).value('(/p/node())[1]', 'nvarchar(max)') as FirstP;

#3


2  

Firstly create a user defined function that strips the HTML out like so:

首先创建一个用户定义的函数,将HTML去掉,如下所示:

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
     BEGIN
         DECLARE @Start INT;
         DECLARE @End INT;
         DECLARE @Length INT;
         SET @Start = CHARINDEX('<', @HTMLText);
         SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
         SET @Length = (@End - @Start) + 1;
         WHILE @Start > 0
               AND @End > 0
               AND @Length > 0
             BEGIN
                 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '');
                 SET @Start = CHARINDEX('<', @HTMLText);
                 SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
                 SET @Length = (@End - @Start) + 1;
             END;
         RETURN LTRIM(RTRIM(@HTMLText));
     END;
GO

When you're trying to select it:

当你试图选择它时:

SELECT dbo.udf_StripHTML([column]) FROM SOMETABLE

This should lead to you avoiding to have to use several nested replace statements.

这将使您避免使用几个嵌套的替换语句。

Credit and further info: http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/

信贷和进一步的信息:http://blog.sqlauthority.com/2007/06/16/sql-server-udf用户定义的功能到条带-html-parse-html-no-regular-expression/

#4


1  

One more solution, just to demonstrate a trick to replace many values of a table (easy to maintain!!!) in one single statement:

还有一个解决方案,就是演示如何在一个语句中替换一个表的多个值(易于维护!!):

--add any replace templates here:

——在这里添加任何替换模板:

CREATE TABLE ReplaceTags (HTML VARCHAR(100));
INSERT INTO ReplaceTags VALUES
 ('<html>'),('<head>'),('<body>'),('<p>'),('<br>')
,('</html>'),('</head>'),('</body>'),('</p>'),('</br>');
GO

--This function will perform the "trick"

——这个函数将执行“trick”

CREATE FUNCTION dbo.DoReplace(@Content VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
    SELECT @Content=REPLACE(@Content,HTML,'')
    FROM ReplaceTags;

    RETURN @Content;
END
GO

--All examples I found in your question and in comments

——所有我在你的问题和评论中找到的例子。

DECLARE @content TABLE(Content VARCHAR(MAX));
INSERT INTO @content VALUES
 ('<html><head></head><body><p>My text.</p></body></html>')
,('<html><head></head><body><p>My text.<br>More text.</p></body></html>')
,('<html><head></head><body><p>My text.<br>More text.</p></body></html>')
,('<html><head></head><body><p>My text.</p></html>');

--this is the actual query

——这是实际的查询

SELECT dbo.DoReplace(Content) FROM @content;
GO

--Clean-Up

——清理

DROP FUNCTION dbo.DoReplace;
DROP TABLE ReplaceTags;

UPDATE

If you add a replace-value to the template-table you might even use different values as replacements like replace a <br> with an actual line break...

如果您将替换值添加到模板表中,您甚至可以使用不同的值作为替换,例如用实际的换行符替换

#5


0  

This is just an example. You can use this in script to rmeove any html tags:

这只是一个例子。您可以在脚本中使用它来rmeove任何html标记:

 DECLARE @VALUE VARCHAR(MAX),@start INT,@end int,@remove varchar(max)
SET @VALUE='<html itemscope itemtype="http://schema.org/QAPage">
<head>

<title>sql - Converting INT to DATE then using GETDATE on conversion? - Stack Overflow</title>
<html>
</html>
'

set @start=charindex('<',@value)
while @start>0
begin
set @end=charindex('>',@VALUE)

set @remove=substring(@VALUE,@start,@end)
set @value=replace(@value,@remove,'')
set @start=charindex('<',@value)
end
print @value

#6


0  

This is the simplest way.

这是最简单的方法。

DECLARE @str VARCHAR(299)

SELECT @str = '<html><head></head><body><p>My text.</p></body></html>'

SELECT cast(@str AS XML).query('.').value('.', 'varchar(200)')

#7


0  

You mention the XML is not always valid, but does it always contain the <p> and </p> tags?

您提到XML并不总是有效的,但是它是否总是包含

标记?

In that case the following would work:

在这种情况下,可以采用下列方法:

SUBSTRING(Table.HtmlData, 
    CHARINDEX('<p>', Table.HtmlData) + 1, 
    CHARINDEX('</p>', Table.HtmlData) - CHARINDEX('<p>', Table.HtmlData) + 1)

For finding all positions of a <p> within a HTML, there's already a good post here: https://dba.stackexchange.com/questions/41961/how-to-find-all-positions-of-a-string-within-another-string

要在HTML中找到

的所有位置,这里已经有了一个很好的帖子:https://dba.stackexchange.com/questions/41961/how to find-all- positionsof -string- string- string

Alternatively I suggest using Visual Basic, as you mentioned that is also an option.

或者,我建议使用Visual Basic,正如您所提到的,它也是一个选项。