在Java中解析XML时,从String中删除无效字符

时间:2022-10-29 20:20:11

I've been googling around and reading on SO, but nothing worked. I have a problem with characters in an XML feed. I save the value of each tag in a String, but when 
 occurs, it just stops. I only get the 4-5 first words in the tag or so.

我一直在谷歌上搜索SO,但没有任何效果。我在XML Feed中遇到字符问题。我在String中保存每个标记的值,但是什么时候  发生,它只是停止。我只在标签中得到4-5个第一个单词。

So can anyone please help me with a method that can remove it? Or can it be that the text in the tags in the XML feed are too long for a String?

那么有人可以帮我一个可以删除它的方法吗?或者可能是XML Feed中标签中的文本对于String来说太长了?

Thanks!

Sample code:

    public void characters(char[] ch, int start, int length)
        throws SAXException {

    if (currentElement) {
        currentValue = new String(ch, start, length);
        currentElement = false;
    }

}

public void endElement(String uri, String localName, String qName)
        throws SAXException {

    currentElement = false;

    /** set value */ 
    if (localName.equalsIgnoreCase("title"))
        sitesList.setTitle(currentValue);
    else if (localName.equalsIgnoreCase("id"))
        sitesList.setId(currentValue);
    else if(localName.equalsIgnoreCase("description"))
        sitesList.setDescription(currentValue);
}

The text in the description tag is quite long, but I only get the first five words before the 
 characters starts coming.

说明标签中的文字很长,但我只得到前五个单词  人物开始来了。

1 个解决方案

#1


1  

You're using a SAXparser to parse the XML-String.

您正在使用SAXparser来解析XML-String。

The characters()-method can be called multiple times when only reading one XML-element. This happens when it finds something like <desc>blabla bla & # 39; bla bla la.</desc>.

只读一个XML元素时,可以多次调用characters() - 方法。当它发现类似 blabla bla' bla bla la。 。

The solution is to use a StringBuilder and append the readed characters in the characters()-method and then reset the StringBuilder in the endElement()-method:

解决方案是使用StringBuilder并在characters() - 方法中附加readed字符,然后在endElement()方法中重置StringBuilder:

private class Handler extends DefaultHandler{

    private StringBuilder temp_val;

    public Handler(){
        this.temp_val = new StringBuilder();
    }

    public void characters(char[] ch, int start, int length){
        temp_val.append(ch, start, length);
    }

    public void endElement(String uri, String localName, String qName){
        System.out.println("Output: "+temp_val.toString());
        // ... Do your stuff
        temp_val.setLength(0); // Reset the StringBuilder
    }

}

The above code works for me, given this XML-File:

鉴于此XML-File,上面的代码适用于我:

<?xml version="1.0" encoding="iso-8859-1" ?>
<test>This is some &#13; example-text.</test>

The output is:

输出是:

Output: This is some
example-text.

输出:这是一些示例文本。

#1


1  

You're using a SAXparser to parse the XML-String.

您正在使用SAXparser来解析XML-String。

The characters()-method can be called multiple times when only reading one XML-element. This happens when it finds something like <desc>blabla bla & # 39; bla bla la.</desc>.

只读一个XML元素时,可以多次调用characters() - 方法。当它发现类似 blabla bla' bla bla la。 。

The solution is to use a StringBuilder and append the readed characters in the characters()-method and then reset the StringBuilder in the endElement()-method:

解决方案是使用StringBuilder并在characters() - 方法中附加readed字符,然后在endElement()方法中重置StringBuilder:

private class Handler extends DefaultHandler{

    private StringBuilder temp_val;

    public Handler(){
        this.temp_val = new StringBuilder();
    }

    public void characters(char[] ch, int start, int length){
        temp_val.append(ch, start, length);
    }

    public void endElement(String uri, String localName, String qName){
        System.out.println("Output: "+temp_val.toString());
        // ... Do your stuff
        temp_val.setLength(0); // Reset the StringBuilder
    }

}

The above code works for me, given this XML-File:

鉴于此XML-File,上面的代码适用于我:

<?xml version="1.0" encoding="iso-8859-1" ?>
<test>This is some &#13; example-text.</test>

The output is:

输出是:

Output: This is some
example-text.

输出:这是一些示例文本。