nodejs使用cheerio解析xml返回空CDATA

时间:2022-12-01 13:20:18

I am using cheerio in nodejs to parse some rss feeds. I am grabbing all the items putting them into an array. I am using 3 test feeds, all of them have a "description" child element for each "item" element. In one of the feeds the whole "description" is wrapped as CDATA, and I cant get its value. Here is an abbreviated code snippet

我在nodejs中使用cheerio来解析一些rss feed。我抓住所有把它们放入阵列的物品。我正在使用3个测试源,它们都为每个“item”元素都有一个“description”子元素。在其中一个Feed中,整个“描述”被包装为CDATA,我无法获得它的价值。这是一个缩写的代码片段

//Open the xml document with cheerio
$ = cheerio.load(arrXmlDocs[i],{ ignoreWhitespace : true, xmlMode : true});

//Loop through every item
$('item').each(function(i, xmlItem){

    //array to hold each item being converted into an array
    var tempArray = [];

    //Loop through each child of <item>
    $(xmlItem).children().each(function(i, xmlItem){
        //Get the name 
        tempArray[$(this)[0].name] = $(this).text();
    }

}

As expected the two rss feeds that dont have CDATA give me an array like this

正如所料,没有CDATA的两个RSS提供给我一个像这样的数组

[
    [
        name: 'name of episode',
        description:'description of episode',
        pubdate: 'published date'
    ],
    [
        name: 'name of episode',
        description:'description of episode',
        pubdate: 'published date'
    ]
]

and the feed with the CDATA description looks like this

并且具有CDATA描述的Feed看起来像这样

    [
        name: 'name of episode',
        pubdate: 'published date'
    ],

So my question is: Why is cheerio not returning values wrapped in CDATA / how can I make it return those values.

所以我的问题是:为什么cheerio没有返回CDATA中包含的值/如何让它返回这些值。

1 个解决方案

#1


6  

This is a known issue (related) with cheerio. It is unable to create a correct tree out of XML with CDATA in your case yet. I know this is a disappointing answer, it's WIP.

这是与cheerio有关的已知问题(相关)。在您的情况下,它无法使用CDATA从XML中创建正确的树。我知道这是一个令人失望的答案,它是WIP。

It is being worked on, meanwhile, you can remove CDATA with a Regular Expression.

它正在进行中,同时,您可以使用正则表达式删除CDATA。

arrXmlDocs[i].replace(/<!\[CDATA\[([\s\S]*?)\]\]>(?=\s*<)/gi, "$1");

Here is a link to an example jsfiddle.

这是一个示例jsfiddle的链接。

While this is not an ideal solution, it should suffice until they work this issue out.

虽然这不是一个理想的解决方案,但它应该足够直到他们解决这个问题。

#1


6  

This is a known issue (related) with cheerio. It is unable to create a correct tree out of XML with CDATA in your case yet. I know this is a disappointing answer, it's WIP.

这是与cheerio有关的已知问题(相关)。在您的情况下,它无法使用CDATA从XML中创建正确的树。我知道这是一个令人失望的答案,它是WIP。

It is being worked on, meanwhile, you can remove CDATA with a Regular Expression.

它正在进行中,同时,您可以使用正则表达式删除CDATA。

arrXmlDocs[i].replace(/<!\[CDATA\[([\s\S]*?)\]\]>(?=\s*<)/gi, "$1");

Here is a link to an example jsfiddle.

这是一个示例jsfiddle的链接。

While this is not an ideal solution, it should suffice until they work this issue out.

虽然这不是一个理想的解决方案,但它应该足够直到他们解决这个问题。