javascript正则表达式从锚标记中提取锚文本和URL

时间:2022-01-14 21:15:31

I have a paragraph of text in a javascript variable called 'input_content' and that text contains multiple anchor tags/links. I would like to match all of the anchor tags and extract anchor text and URL, and put it into an array like (or similar to) this:

我在一个名为'input_content'的javascript变量中有一段文本,该文本包含多个锚标记/链接。我想匹配所有锚标签并提取锚文本和URL,并将其放入类似(或类似)的数组中:

Array
(
    [0] => Array
        (
            [0] => <a href="http://yahoo.com">Yahoo</a>
            [1] => http://yahoo.com
            [2] => Yahoo
        )
    [1] => Array
        (
            [0] => <a href="http://google.com">Google</a>
            [1] => http://google.com
            [2] => Google
        )
)

I've taken a crack at it (http://pastie.org/339755), but I am stumped beyond this point. Thanks for the help!

我已经对它进行了一次破解(http://pastie.org/339755),但我对这一点感到困惑。谢谢您的帮助!

6 个解决方案

#1


41  

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4))
});

This assumes that your anchors will always be in the form <a href="...">...</a> i.e. it won't work if there are any other attributes (for example, target). The regular expression can be improved to accommodate this.

这假设您的锚点将始终采用 ... 的形式,即如果存在任何其他属性(例如,目标),它将无效。可以改进正则表达式以适应这种情况。

To break down the regular expression:

要打破正则表达式:

/ -> start regular expression
  [^<]* -> skip all characters until the first <
  ( -> start capturing first token
    <a href=" -> capture first bit of anchor
    ( -> start capturing second token
        [^"]+ -> capture all characters until a "
    ) -> end capturing second token
    "> -> capture more of the anchor
    ( -> start capturing third token
        [^<]+ -> capture all characters until a <
    ) -> end capturing third token
    <\/a> -> capture last bit of anchor
  ) -> end capturing first token
/g -> end regular expression, add global flag to match all anchors in string

Each call to our anonymous function will receive three tokens as the second, third and fourth arguments, namely arguments[1], arguments[2], arguments[3]:

每次调用我们的匿名函数都会收到三个标记作为第二,第三和第四个参数,即参数[1],参数[2],参数[3]:

  • arguments[1] is the entire anchor
  • arguments [1]是整个锚点
  • arguments[2] is the href part
  • 参数[2]是href部分
  • arguments[3] is the text inside
  • 参数[3]是里面的文字

We'll use a hack to push these three arguments as a new array into our main matches array. The arguments built-in variable is not a true JavaScript Array, so we'll have to apply the split Array method on it to extract the items we want:

我们将使用hack将这三个参数作为一个新数组推送到我们的主匹配数组中。参数内置变量不是真正的JavaScript数组,因此我们必须在其上应用split array方法来提取我们想要的项目:

Array.prototype.slice.call(arguments, 1, 4)

This will extract items from arguments starting at index 1 and ending (not inclusive) at index 4.

这将从索引1开始的参数和索引4的结束(不包括)中提取项目。

var input_content = "blah \
    <a href=\"http://yahoo.com\">Yahoo</a> \
    blah \
    <a href=\"http://google.com\">Google</a> \
    blah";

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4));
});

alert(matches.join("\n"));

Gives:

得到:

<a href="http://yahoo.com">Yahoo</a>,http://yahoo.com,Yahoo
<a href="http://google.com">Google</a>,http://google.com,Google

#2


9  

Since you're presumably running the javascript in a web browser, regex seems like a bad idea for this. If the paragraph came from the page in the first place, get a handle for the container, call .getElementsByTagName() to get the anchors, and then extract the values you want that way.

由于你可能在网络浏览器中运行javascript,因此正则表达式似乎是一个坏主意。如果该段落首先来自页面,请获取容器的句柄,调用.getElementsByTagName()来获取锚点,然后提取您想要的值。

If that's not possible then create a new html element object, assign your text to it's .innerHTML property, and then call .getElementsByTagName().

如果那不可能,那么创建一个新的html元素对象,将文本分配给它的.innerHTML属性,然后调用.getElementsByTagName()。

#3


6  

I think Joel has the right of it — regexes are notorious for playing poorly with markup, as there are simply too many possibilities to consider. Are there other attributes to the anchor tags? What order are they in? Is the separating whitespace always a single space? Seeing as you already have a browser's HTML parser available, best to put that to work instead.

我认为Joel有权利 - 因为有很多可能性需要考虑,所以正则表达式因为标记效果不佳而臭名昭着。锚标签还有其他属性吗?他们的订单是什么?分离的空白是否总是一个空格?看到您已经拥有浏览器的HTML解析器,最好将其用于工作。

function getLinks(html) {
    var container = document.createElement("p");
    container.innerHTML = html;

    var anchors = container.getElementsByTagName("a");
    var list = [];

    for (var i = 0; i < anchors.length; i++) {
        var href = anchors[i].href;
        var text = anchors[i].textContent;

        if (text === undefined) text = anchors[i].innerText;

        list.push(['<a href="' + href + '">' + text + '</a>', href, text];
    }

    return list;
}

This will return an array like the one you describe regardless of how the links are stored. Note that you could change the function to work with a passed element instead of text by changing the parameter name to "container" and removing the first two lines. The textContent/innerText property gets the text displayed for the link, stripped of any markup (bold/italic/font/…). You could replace .textContent with .innerHTML and remove the inner if() statement if you want to preserve the markup.

无论链接如何存储,这将返回与您描述的数组类似的数组。请注意,您可以通过将参数名称更改为“container”并删除前两行来更改函数以使用传递的元素而不是文本。 textContent / innerText属性获取链接显示的文本,剥离任何标记(粗体/斜体/字体/ ...)。如果要保留标记,可以用.innerHTML替换.textContent并删除内部的if()语句。

#4


2  

I think JQuery would be your best bet. This isn't the best script and I'm sure others can give something better. But this creates an array of exactly what you're looking for.

我认为JQuery是你最好的选择。这不是最好的剧本,我相信其他人可以提供更好的东西。但这会创建一个正是您正在寻找的数组。

<script type="text/javascript">
    // From http://brandonaaron.net Thanks!
    jQuery.fn.outerHTML = function() {
        return $('<div>').append( this.eq(0).clone() ).html();
    };    

    var items = new Array();
    var i = 0;

    $(document).ready(function(){
        $("a").each(function(){
            items[i] = {el:$(this).outerHTML(),href:this.href,text:this.text};
            i++;      
        });
    });

    function showItems(){
        alert(items);
    }

</script>

#5


1  

To extract the url:

要提取网址:

var pattern = /.href="(.)".*/; var url = string.replace(pattern,'$1');

var pattern = /.href="(.)".*/; var url = string.replace(pattern,'$ 1');

Demo:

演示:

//var string = '<a id="btn" target="_blank" class="button" href="https://yourdomainame.com:4089?param=751&amp;2ndparam=2345">Buy Now</a>;'
//uncomment the above as an example of link.outerHTML

var string = link.outerHTML
var pattern = /.*href="(.*)".*/;
var href = string.replace(pattern,'$1');
alert(href)

For "anchor text", why not use: link.innerHtml

对于“锚文本”,为什么不使用:link.innerHtml

#6


1  

For the benefit of searchers: I created something that will work with additional attributes in the anchor tag. For those not familiar with Regex, the dollar ($1 etc) values are the regex group matches.

为了搜索者的利益:我创建了一些可以在锚标记中使用其他属性的东西。对于那些不熟悉正则表达式的人来说,美元(1美元等)值是正则表达式组匹配。

var text = 'This is my <a target="_blank" href="www.google.co.uk">link</a> Text';
var urlPattern = /([^+>]*)[^<]*(<a [^>]*(href="([^>^\"]*)")[^>]*>)([^<]+)(<\/a>)/gi;
var output = text.replace(urlPattern, "$1___$2___$3___$4___$5___$6");
alert(output);

See working jsFiddle and regex101.

请参阅jsFiddle和regex101。

Alternatively, you can get info out of the groups like this:

或者,您可以从这些组中获取信息,如下所示:

var returnText = text.replace(urlPattern, function(fullText, beforeLink, anchorContent, href, lnkUrl, linkText, endAnchor){
                    return "The bits you want e.g. linkText";
                });

#1


41  

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4))
});

This assumes that your anchors will always be in the form <a href="...">...</a> i.e. it won't work if there are any other attributes (for example, target). The regular expression can be improved to accommodate this.

这假设您的锚点将始终采用 ... 的形式,即如果存在任何其他属性(例如,目标),它将无效。可以改进正则表达式以适应这种情况。

To break down the regular expression:

要打破正则表达式:

/ -> start regular expression
  [^<]* -> skip all characters until the first <
  ( -> start capturing first token
    <a href=" -> capture first bit of anchor
    ( -> start capturing second token
        [^"]+ -> capture all characters until a "
    ) -> end capturing second token
    "> -> capture more of the anchor
    ( -> start capturing third token
        [^<]+ -> capture all characters until a <
    ) -> end capturing third token
    <\/a> -> capture last bit of anchor
  ) -> end capturing first token
/g -> end regular expression, add global flag to match all anchors in string

Each call to our anonymous function will receive three tokens as the second, third and fourth arguments, namely arguments[1], arguments[2], arguments[3]:

每次调用我们的匿名函数都会收到三个标记作为第二,第三和第四个参数,即参数[1],参数[2],参数[3]:

  • arguments[1] is the entire anchor
  • arguments [1]是整个锚点
  • arguments[2] is the href part
  • 参数[2]是href部分
  • arguments[3] is the text inside
  • 参数[3]是里面的文字

We'll use a hack to push these three arguments as a new array into our main matches array. The arguments built-in variable is not a true JavaScript Array, so we'll have to apply the split Array method on it to extract the items we want:

我们将使用hack将这三个参数作为一个新数组推送到我们的主匹配数组中。参数内置变量不是真正的JavaScript数组,因此我们必须在其上应用split array方法来提取我们想要的项目:

Array.prototype.slice.call(arguments, 1, 4)

This will extract items from arguments starting at index 1 and ending (not inclusive) at index 4.

这将从索引1开始的参数和索引4的结束(不包括)中提取项目。

var input_content = "blah \
    <a href=\"http://yahoo.com\">Yahoo</a> \
    blah \
    <a href=\"http://google.com\">Google</a> \
    blah";

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4));
});

alert(matches.join("\n"));

Gives:

得到:

<a href="http://yahoo.com">Yahoo</a>,http://yahoo.com,Yahoo
<a href="http://google.com">Google</a>,http://google.com,Google

#2


9  

Since you're presumably running the javascript in a web browser, regex seems like a bad idea for this. If the paragraph came from the page in the first place, get a handle for the container, call .getElementsByTagName() to get the anchors, and then extract the values you want that way.

由于你可能在网络浏览器中运行javascript,因此正则表达式似乎是一个坏主意。如果该段落首先来自页面,请获取容器的句柄,调用.getElementsByTagName()来获取锚点,然后提取您想要的值。

If that's not possible then create a new html element object, assign your text to it's .innerHTML property, and then call .getElementsByTagName().

如果那不可能,那么创建一个新的html元素对象,将文本分配给它的.innerHTML属性,然后调用.getElementsByTagName()。

#3


6  

I think Joel has the right of it — regexes are notorious for playing poorly with markup, as there are simply too many possibilities to consider. Are there other attributes to the anchor tags? What order are they in? Is the separating whitespace always a single space? Seeing as you already have a browser's HTML parser available, best to put that to work instead.

我认为Joel有权利 - 因为有很多可能性需要考虑,所以正则表达式因为标记效果不佳而臭名昭着。锚标签还有其他属性吗?他们的订单是什么?分离的空白是否总是一个空格?看到您已经拥有浏览器的HTML解析器,最好将其用于工作。

function getLinks(html) {
    var container = document.createElement("p");
    container.innerHTML = html;

    var anchors = container.getElementsByTagName("a");
    var list = [];

    for (var i = 0; i < anchors.length; i++) {
        var href = anchors[i].href;
        var text = anchors[i].textContent;

        if (text === undefined) text = anchors[i].innerText;

        list.push(['<a href="' + href + '">' + text + '</a>', href, text];
    }

    return list;
}

This will return an array like the one you describe regardless of how the links are stored. Note that you could change the function to work with a passed element instead of text by changing the parameter name to "container" and removing the first two lines. The textContent/innerText property gets the text displayed for the link, stripped of any markup (bold/italic/font/…). You could replace .textContent with .innerHTML and remove the inner if() statement if you want to preserve the markup.

无论链接如何存储,这将返回与您描述的数组类似的数组。请注意,您可以通过将参数名称更改为“container”并删除前两行来更改函数以使用传递的元素而不是文本。 textContent / innerText属性获取链接显示的文本,剥离任何标记(粗体/斜体/字体/ ...)。如果要保留标记,可以用.innerHTML替换.textContent并删除内部的if()语句。

#4


2  

I think JQuery would be your best bet. This isn't the best script and I'm sure others can give something better. But this creates an array of exactly what you're looking for.

我认为JQuery是你最好的选择。这不是最好的剧本,我相信其他人可以提供更好的东西。但这会创建一个正是您正在寻找的数组。

<script type="text/javascript">
    // From http://brandonaaron.net Thanks!
    jQuery.fn.outerHTML = function() {
        return $('<div>').append( this.eq(0).clone() ).html();
    };    

    var items = new Array();
    var i = 0;

    $(document).ready(function(){
        $("a").each(function(){
            items[i] = {el:$(this).outerHTML(),href:this.href,text:this.text};
            i++;      
        });
    });

    function showItems(){
        alert(items);
    }

</script>

#5


1  

To extract the url:

要提取网址:

var pattern = /.href="(.)".*/; var url = string.replace(pattern,'$1');

var pattern = /.href="(.)".*/; var url = string.replace(pattern,'$ 1');

Demo:

演示:

//var string = '<a id="btn" target="_blank" class="button" href="https://yourdomainame.com:4089?param=751&amp;2ndparam=2345">Buy Now</a>;'
//uncomment the above as an example of link.outerHTML

var string = link.outerHTML
var pattern = /.*href="(.*)".*/;
var href = string.replace(pattern,'$1');
alert(href)

For "anchor text", why not use: link.innerHtml

对于“锚文本”,为什么不使用:link.innerHtml

#6


1  

For the benefit of searchers: I created something that will work with additional attributes in the anchor tag. For those not familiar with Regex, the dollar ($1 etc) values are the regex group matches.

为了搜索者的利益:我创建了一些可以在锚标记中使用其他属性的东西。对于那些不熟悉正则表达式的人来说,美元(1美元等)值是正则表达式组匹配。

var text = 'This is my <a target="_blank" href="www.google.co.uk">link</a> Text';
var urlPattern = /([^+>]*)[^<]*(<a [^>]*(href="([^>^\"]*)")[^>]*>)([^<]+)(<\/a>)/gi;
var output = text.replace(urlPattern, "$1___$2___$3___$4___$5___$6");
alert(output);

See working jsFiddle and regex101.

请参阅jsFiddle和regex101。

Alternatively, you can get info out of the groups like this:

或者,您可以从这些组中获取信息,如下所示:

var returnText = text.replace(urlPattern, function(fullText, beforeLink, anchorContent, href, lnkUrl, linkText, endAnchor){
                    return "The bits you want e.g. linkText";
                });