在RStudio中使用UTF-8编码读取xml文件的问题。

时间:2023-01-05 14:57:42

I want to read a .xml file which looks like:

我想读取一个。xml文件,它看起来像:

<?xml version="1.0" encoding="UTF-8"?>
<province name="北京市" id="11">
  <city name="市辖区" id="110100000000">
    <county name="东城区" id="110101000000">
      <town name="珍珠泉乡" id="110229214000">
        <village name="珍珠泉乡社区居委会" id="110229214001" type="220"/>
        <village name="珍珠泉村委会" id="110229214200" type="210"/>
        <village name="称沟湾村委会" id="110229214201" type="220"/>
        <village name="庙梁村委会" id="110229214202" type="220"/>
        <village name="下水沟村委会" id="110229214203" type="220"/>
        <village name="上水沟村委会" id="110229214204" type="220"/>
        <village name="下花楼村委会" id="110229214205" type="220"/>
        <village name="八亩地村委会" id="110229214206" type="220"/>
        <village name="转山子村委会" id="110229214207" type="220"/>
        <village name="水泉子村委会" id="110229214208" type="220"/>
        <village name="双金草村委会" id="110229214209" type="220"/>
        <village name="小川村委会" id="110229214210" type="220"/>
        <village name="小铺村委会" id="110229214211" type="220"/>
        <village name="仓米道村委会" id="110229214212" type="220"/>
        <village name="南天门村委会" id="110229214213" type="220"/>
        <village name="桃条沟村委会" id="110229214214" type="220"/>
      </town>
    </county>
  </city>
</province>

I set the system locale to be simplified Chinese using Sys.setlocale("LC_ALL", locale="Chinese (Simplified)"), and read the document using XML package with UTF-8 encoding doc = xmlParse(files[i], encoding = "UTF-8", useInternalNodes = TRUE), but when I look at doc, the Chinese characters are not properly displayed:

我将系统区域设置为使用Sys的简体中文。setlocale("LC_ALL", locale="Chinese(简化)"),并使用UTF-8编码doc = xmlParse(文件[i],编码=" UTF-8", useInternalNodes = TRUE)使用XML包读取文档,但当我查看doc时,汉字没有正确显示:

<village id="110229214001" type="220" name="鐝嶇彔娉変埂绀惧尯灞呭浼?/>
        <village id="110229214200" type="210" name="鐝嶇彔娉夋潙濮斾細"/>
        <village id="110229214201" type="220" name="绉版矡婀炬潙濮斾細"/>
        <village id="110229214202" type="220" name="搴欐鏉戝浼?/>
        <village id="110229214203" type="220" name="涓嬫按娌熸潙濮斾細"/>
        <village id="110229214204" type="220" name="涓婃按娌熸潙濮斾細"/>
        <village id="110229214205" type="220" name="涓嬭姳妤兼潙濮斾細"/>
        <village id="110229214206" type="220" name="鍏憨鍦版潙濮斾細"/>
        <village id="110229214207" type="220" name="杞北瀛愭潙濮斾細"/>
        <village id="110229214208" type="220" name="姘存硥瀛愭潙濮斾細"/>
        <village id="110229214209" type="220" name="鍙岄噾鑽夋潙濮斾細"/>
        <village id="110229214210" type="220" name="灏忓窛鏉戝浼?/>
        <village id="110229214211" type="220" name="灏忛摵鏉戝浼?/>
        <village id="110229214212" type="220" name="浠撶背閬撴潙濮斾細"/>
        <village id="110229214213" type="220" name="鍗楀ぉ闂ㄦ潙濮斾細"/>
        <village id="110229214214" type="220" name="妗冩潯娌熸潙濮斾細"/>

I also tried to set the system locale to English_United States.1252, but the problem remains the same. One strange thing is that, when I use some functions over doc, for example xmlRoot(doc) or getNodeSet(doc,"//village")[1], the Chinese characters are displayed correctly. But not for all functions, if I use xmlAttrs(getNodeSet(doc,"//village")[[1]]), it has problem.

我还试图将系统地区设置为English_United state .1252,但问题仍然是一样的。奇怪的是,当我在doc上使用一些函数时,例如xmlRoot(doc)或getNodeSet(doc,“//village”)[1],汉字的显示是正确的。但对于所有函数,如果我使用xmlAttrs(getNodeSet(doc,“//village”)[[1]]),它就有问题。

2 个解决方案

#1


0  

Try xml linq

尝试xml linq

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;

namespace ConsoleApplication49
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.xml"; 
        static void Main(string[] args)
        {
            StreamReader reader = new StreamReader(FILENAME);
            var z = reader.ReadLine();

            XDocument doc = XDocument.Load(reader);

            var results = doc.Descendants("village").Select(x => new
            {
                name = (string)x.Attribute("name"),
                id = (long)x.Attribute("id"),
                type = (int)x.Attribute("type")
            }).ToList();

        }
    }
}

#2


0  

It seems to be the problem with encoding. My aim is to extract the village information from the xml file. After I extracted the information, when I check the encoding of the village name column, it shows that the encoding is "unknown". So I added one command to make the encoding of that column as "UTF-8" and it works. My code is shown below.

这似乎是编码的问题。我的目标是从xml文件中提取村庄信息。当我提取了信息后,当我检查村子里的名字列的编码时,它显示编码是“未知的”。因此,我添加了一个命令,使该列的编码为“UTF-8”,并有效。我的代码如下所示。

But I still don't know why the encoding is unknown. I have already specified encoding="UTF-8 at the very beginning when I read the xml file using xmlParse(). Anyone knows why? Did I make any mistake when I read the xml file?

但是我仍然不知道为什么编码是未知的。在使用xmlParse()读取xml文件时,我已经指定了编码=“UTF-8”。有人知道为什么吗?读取xml文件时是否出错?

> village = as.data.frame(t(xmlSApply(doc["/province/city/county/town/village"],xmlAttrs)),stringsAsFactors=FALSE)
> View(village)
> Encoding(village[1,"name"])
[1] "unknown"
> Encoding(village[,"name"])="UTF-8"   #added this line and the display is fine now

#1


0  

Try xml linq

尝试xml linq

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;

namespace ConsoleApplication49
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.xml"; 
        static void Main(string[] args)
        {
            StreamReader reader = new StreamReader(FILENAME);
            var z = reader.ReadLine();

            XDocument doc = XDocument.Load(reader);

            var results = doc.Descendants("village").Select(x => new
            {
                name = (string)x.Attribute("name"),
                id = (long)x.Attribute("id"),
                type = (int)x.Attribute("type")
            }).ToList();

        }
    }
}

#2


0  

It seems to be the problem with encoding. My aim is to extract the village information from the xml file. After I extracted the information, when I check the encoding of the village name column, it shows that the encoding is "unknown". So I added one command to make the encoding of that column as "UTF-8" and it works. My code is shown below.

这似乎是编码的问题。我的目标是从xml文件中提取村庄信息。当我提取了信息后,当我检查村子里的名字列的编码时,它显示编码是“未知的”。因此,我添加了一个命令,使该列的编码为“UTF-8”,并有效。我的代码如下所示。

But I still don't know why the encoding is unknown. I have already specified encoding="UTF-8 at the very beginning when I read the xml file using xmlParse(). Anyone knows why? Did I make any mistake when I read the xml file?

但是我仍然不知道为什么编码是未知的。在使用xmlParse()读取xml文件时,我已经指定了编码=“UTF-8”。有人知道为什么吗?读取xml文件时是否出错?

> village = as.data.frame(t(xmlSApply(doc["/province/city/county/town/village"],xmlAttrs)),stringsAsFactors=FALSE)
> View(village)
> Encoding(village[1,"name"])
[1] "unknown"
> Encoding(village[,"name"])="UTF-8"   #added this line and the display is fine now