从抓取的网页获取页面标题

时间:2022-12-04 22:57:47
var http = require('http');
var urlOpts = {host: 'www.nodejs.org', path: '/', port: '80'};
http.get(urlOpts, function (response) {
response.on('data', function (chunk) {
var str=chunk.toString();
var re = new RegExp("(<\s*title[^>]*>(.+?)<\s*/\s*title)\>", "g")
console.log(str.match(re));
});

});

Output

产量

user@dev ~ $ node app.js [ 'node.js' ] null null

user @ dev~ $ node app.js ['node.js'] null null

I only need to get the title.

我只需要获得头衔。

2 个解决方案

#1


7  

I would suggest using RegEx.exec instead of String.match. You can also define the regular expression using the literal syntax, and only once:

我建议使用RegEx.exec而不是String.match。您还可以使用文字语法定义正则表达式,并且只能使用一次:

var http = require('http');
var urlOpts = {host: 'www.nodejs.org', path: '/', port: '80'};
var re = /(<\s*title[^>]*>(.+?)<\s*\/\s*title)>/gi;
http.get(urlOpts, function (response) {
    response.on('data', function (chunk) {
        var str=chunk.toString();
        var match = re.exec(str);
        if (match && match[2]) {
          console.log(match[2]);
        }
    });    
});

The code also assumes that the title will be completely in one chunk, and not split between two chunks. It would probably be best to keep an aggregation of chunks, in case the title is split between chunks. You may also want to stop looking for the title once you've found it.

该代码还假定标题将完全在一个块中,而不是在两个块之间分割。如果标题在块之间分割,最好保留块的聚合。您可能还想在找到标题后停止寻找标题。

#2


2  

Try this:

尝试这个:

var re = new RegExp("<title>(.*?)</title>", "i");
console.log(str.match(re)[1]);

#1


7  

I would suggest using RegEx.exec instead of String.match. You can also define the regular expression using the literal syntax, and only once:

我建议使用RegEx.exec而不是String.match。您还可以使用文字语法定义正则表达式,并且只能使用一次:

var http = require('http');
var urlOpts = {host: 'www.nodejs.org', path: '/', port: '80'};
var re = /(<\s*title[^>]*>(.+?)<\s*\/\s*title)>/gi;
http.get(urlOpts, function (response) {
    response.on('data', function (chunk) {
        var str=chunk.toString();
        var match = re.exec(str);
        if (match && match[2]) {
          console.log(match[2]);
        }
    });    
});

The code also assumes that the title will be completely in one chunk, and not split between two chunks. It would probably be best to keep an aggregation of chunks, in case the title is split between chunks. You may also want to stop looking for the title once you've found it.

该代码还假定标题将完全在一个块中,而不是在两个块之间分割。如果标题在块之间分割,最好保留块的聚合。您可能还想在找到标题后停止寻找标题。

#2


2  

Try this:

尝试这个:

var re = new RegExp("<title>(.*?)</title>", "i");
console.log(str.match(re)[1]);