在C#字符串中搜索特定文本的HTML并标记文本的最佳方法是什么?

时间:2022-09-13 07:56:34

What would be the best way to search through HTML inside a C# string variable to find a specific word/phrase and mark (or wrap) that word/phrase with a highlight?

在C#字符串变量中搜索HTML以查找特定单词/短语并用突出显示标记(或换行)该单词/短语的最佳方法是什么?

Thanks,

Jeff

6 个解决方案

#1


I like using Html Agility Pack very easy to use, although there hasn't been much updates lately, it is still usable. For example grabbing all the links

我喜欢使用Html Agility Pack非常容易使用,虽然最近没有太多更新,它仍然可用。例如抓住所有链接

HtmlWeb client = new HtmlWeb();
HtmlDocument doc = client.Load("http://yoururl.com");            
HtmlNodeCollection Nodes = doc.DocumentNode.SelectNodes("//a[@href]");         

foreach (var link in Nodes)
{                
    Console.WriteLine(link.Attributes["href"].Value);
}

#2


Regular Expression would be my way. ;)

正则表达将是我的方式。 ;)

#3


If the HTML you're using XHTML compliant, you could load it as an XML document, and then use XPath/XSL - long winded but kind of elegant?

如果你正在使用符合XHTML的HTML,你可以将它作为XML文档加载,然后使用XPath / XSL - 长篇大论但又有点优雅?

An approach I used in the past is to use HTMLTidy to convert messy HTML to XHTML, and then use XSL/XPath for screen scraping content into a database, to create a reverse content management system.

我过去使用的一种方法是使用HTMLTidy将凌乱的HTML转换为XHTML,然后使用XSL / XPath将内容屏幕抓取到数据库中,以创建反向内容管理系统。

Regular expressions would do it, but could be complicated once you try stripping out tags, image names etc, to remove false positives.

正则表达式会这样做,但是一旦你试图剥离标签,图像名称等以消除误报,它可能会很复杂。

#4


In simple cases, regular expressions will do.

在简单的情况下,正则表达式就可以。

string input = "ttttttgottttttt";
string output = Regex.Replace(input, "go", "<strong>$0</strong>");

string input =“ttttttgottttttt”; string output = Regex.Replace(input,“go”,“ $ 0 ”);

will yield: "tttttt<strong>go</strong>ttttttt"

将屈服:“tttttt 去 ttttttt”

But when you say HTML, if you're referring to final text rendered, that's a bit of a mess. Say you've got this HTML:

但是当你说HTML时,如果你指的是最终渲染的文字,那就有点乱了。说你有这个HTML:

<span class="firstLetter">B</span>ook

To highlight the word 'Book', you would need the help of a proper HTML renderer. To simplify, one can first remove all tags and leave only contents, and then do the usual replace, but it doesn't feel right.

要突出显示“预订”一词,您需要正确的HTML渲染器帮助。为了简化,可以先删除所有标签,只留下内容,然后进行常规替换,但感觉不对。

#5


You could look at using Html DOM, an open source project on SourceForge.net. This way you could programmatically manipulate your text instead of relying regular expressions.

你可以看看在SourceForge.net上使用Html DOM,一个开源项目。这样,您可以以编程方式操作文本,而不是依赖正则表达式。

#6


Searching for strings, you'll want to look up regular expressions. As for marking it, once you have the position of the substring it should be simple enough to use that to add in something to wrap around the phrase.

搜索字符串时,您需要查找正则表达式。至于标记它,一旦你有了子串的位置,它应该足够简单,用它来添加一些东西来包围短语。

#1


I like using Html Agility Pack very easy to use, although there hasn't been much updates lately, it is still usable. For example grabbing all the links

我喜欢使用Html Agility Pack非常容易使用,虽然最近没有太多更新,它仍然可用。例如抓住所有链接

HtmlWeb client = new HtmlWeb();
HtmlDocument doc = client.Load("http://yoururl.com");            
HtmlNodeCollection Nodes = doc.DocumentNode.SelectNodes("//a[@href]");         

foreach (var link in Nodes)
{                
    Console.WriteLine(link.Attributes["href"].Value);
}

#2


Regular Expression would be my way. ;)

正则表达将是我的方式。 ;)

#3


If the HTML you're using XHTML compliant, you could load it as an XML document, and then use XPath/XSL - long winded but kind of elegant?

如果你正在使用符合XHTML的HTML,你可以将它作为XML文档加载,然后使用XPath / XSL - 长篇大论但又有点优雅?

An approach I used in the past is to use HTMLTidy to convert messy HTML to XHTML, and then use XSL/XPath for screen scraping content into a database, to create a reverse content management system.

我过去使用的一种方法是使用HTMLTidy将凌乱的HTML转换为XHTML,然后使用XSL / XPath将内容屏幕抓取到数据库中,以创建反向内容管理系统。

Regular expressions would do it, but could be complicated once you try stripping out tags, image names etc, to remove false positives.

正则表达式会这样做,但是一旦你试图剥离标签,图像名称等以消除误报,它可能会很复杂。

#4


In simple cases, regular expressions will do.

在简单的情况下,正则表达式就可以。

string input = "ttttttgottttttt";
string output = Regex.Replace(input, "go", "<strong>$0</strong>");

string input =“ttttttgottttttt”; string output = Regex.Replace(input,“go”,“ $ 0 ”);

will yield: "tttttt<strong>go</strong>ttttttt"

将屈服:“tttttt 去 ttttttt”

But when you say HTML, if you're referring to final text rendered, that's a bit of a mess. Say you've got this HTML:

但是当你说HTML时,如果你指的是最终渲染的文字,那就有点乱了。说你有这个HTML:

<span class="firstLetter">B</span>ook

To highlight the word 'Book', you would need the help of a proper HTML renderer. To simplify, one can first remove all tags and leave only contents, and then do the usual replace, but it doesn't feel right.

要突出显示“预订”一词,您需要正确的HTML渲染器帮助。为了简化,可以先删除所有标签,只留下内容,然后进行常规替换,但感觉不对。

#5


You could look at using Html DOM, an open source project on SourceForge.net. This way you could programmatically manipulate your text instead of relying regular expressions.

你可以看看在SourceForge.net上使用Html DOM,一个开源项目。这样,您可以以编程方式操作文本,而不是依赖正则表达式。

#6


Searching for strings, you'll want to look up regular expressions. As for marking it, once you have the position of the substring it should be simple enough to use that to add in something to wrap around the phrase.

搜索字符串时,您需要查找正则表达式。至于标记它,一旦你有了子串的位置,它应该足够简单,用它来添加一些东西来包围短语。