Nasa Rss提供Sax解析错误

时间:2022-06-01 01:42:57

I am trying to write a java program for reading NASA Rss feed.The code works but when the code encounters 's symbol ,it doesnot read the entire line. For example-"A new NASA study finds the last remaining section of Antarctica&#039 ;s Larsen B Ice Shelf, which partially collapsed in 2002, is quickly weakening and likely to disintegrate completely before the end of the decade". In this above line the code does not read the entire line after Antartica. What is the problem with the code ???How can I fix it??? Without the &#039 ;s symbol the code works fine. The link to the feed:"http://www.nasa.gov/rss/dyn/earth.rss"

我正在尝试编写一个用于读取NASA Rss feed的java程序。代码可以工作,但是当代码遇到符号时,它不会读取整行。例如 - “美国宇航局的一项新研究发现,南极洲最后一部分的Larsen B冰架在2002年部分倒塌,正在迅速减弱,并可能在十年结束前完全瓦解”。在上面这行中,代码不会在Antartica之后读取整行。代码有什么问题???如何解决?如果没有' s代码,代码就可以正常工作。 Feed的链接:“http://www.nasa.gov/rss/dyn/earth.rss”

package xmlparseprac;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class Handler extends DefaultHandler {
boolean mtitle=false;
boolean mdescription=false;
boolean mitem;

@Override
public void startDocument() throws SAXException {
    super.startDocument(); 
    System.out.println("Starting...");
}

@Override
public void endDocument() throws SAXException {
    super.endDocument(); 
    System.out.println("Ending...");
}

@Override
public void startElement(String string, String string1, String string2, Attributes atrbts) throws SAXException {
    super.startElement(string, string1, string2, atrbts); 
    if(string2.equalsIgnoreCase("item")){mitem=true;}
    if(string2.equalsIgnoreCase("title")){mtitle=true;}
    if(string2.equalsIgnoreCase("description")){mdescription=true;}
}

@Override
public void endElement(String string, String string1, String string2) throws SAXException {
    super.endElement(string, string1, string2);
    if(string2.equalsIgnoreCase("item")){mitem=false;}
    if(string2.equalsIgnoreCase("title")){mtitle=false;}
    if(string2.equalsIgnoreCase("description")){mdescription=false;}
}

@Override
public void characters(char[] chars, int i, int i1) throws SAXException {
    super.characters(chars, i, i1);
    if(mtitle==true && mitem==true){
        String s=new String(chars, i, i1);
        System.out.println("Title:"+s);
        mtitle=false;}
    if(mdescription==true && mitem==true){
        String s=new String(chars, i, i1);
        System.out.println("Description:"+s);
        mdescription=false;
    }
}

}

1 个解决方案

#1


0  

I finally found the answer to my question.

我终于找到了问题的答案。

link:"http://www.javaexperience.com/strip-invalid-characters-from-xml/" link:"https://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringEscapeUtils.html"

The commons apache-lang-StringEscapeUitls library contains a method called unescapeHtml4 .It removes the html encoding characters like &#039 etc with 's and other equivalent characters.Just convert the URL inputstream to a string and use the unescapeHtml14 function to the string and extract a inputstream from it and call the parse function with the inputstream as parameter.Thanks @duffymo for informing me about the "magic characters".

公共apache-lang-StringEscapeUitls库包含一个名为unescapeHtml4的方法。它删除了'等html编码字符和's等等字符。只需将URL输入流转换为字符串,并使用unescapeHtml14函数对字符串和从中提取输入流并使用输入流作为参数调用解析函数。谢谢@duffymo通知我“魔术字符”。

#1


0  

I finally found the answer to my question.

我终于找到了问题的答案。

link:"http://www.javaexperience.com/strip-invalid-characters-from-xml/" link:"https://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringEscapeUtils.html"

The commons apache-lang-StringEscapeUitls library contains a method called unescapeHtml4 .It removes the html encoding characters like &#039 etc with 's and other equivalent characters.Just convert the URL inputstream to a string and use the unescapeHtml14 function to the string and extract a inputstream from it and call the parse function with the inputstream as parameter.Thanks @duffymo for informing me about the "magic characters".

公共apache-lang-StringEscapeUitls库包含一个名为unescapeHtml4的方法。它删除了'等html编码字符和's等等字符。只需将URL输入流转换为字符串,并使用unescapeHtml14函数对字符串和从中提取输入流并使用输入流作为参数调用解析函数。谢谢@duffymo通知我“魔术字符”。