ANSI Unicod UTF-8 编码及其C++读取三种类型文档实现

一、ANSI编码

ANSI码（American National Standards Institute），中文：美国国家标准学会的标准码。为使计算机支持更多语言，通常使用 0x80~0xFF 范围的 2 个字节来表示 1 个字符。比如：汉字 '中' 在中文操作系统中，使用 [0xD6,0xD0] 这两个字节存储。对于ANSI编码而言，0x00~0x7F之间的字符，依旧是1个字节代表1个字符。这一点是ANSI编码与Unicode(UTF-16)编码之间最大也最明显的区别。比如“A君是第131号”，在ANSI编码中，占用12个字节，而在Unicode(UTF-16)编码中，占用16个字节。因为A和1、3、1这4个字符，在ANSI编码中只各占1个字节，而在Unicode(UTF-16)编码中，是需要各占2个字节的。

二、Unicode编码

通用字符集（Universal Character Set，UCS）。Unicode是国际组织制定的可以容纳世界上所有文字和符号的字符编码方案。Unicode用数字0-0x10FFFF来映射这些字符，最多可以容纳1114112个字符，或者说有1114112个码位。码位就是可以分配给字符的数字。UTF-8、UTF-16、UTF-32都是将数字转换到程序数据的编码方案。

三、UTF-8

UTF-8是UNICODE的一种变长字符编码又称万国码，由Ken Thompson于1992年创建。现在已经标准化为RFC 3629。UTF-8用1到6个字节编码UNICODE字符。用在网页上可以同一页面显示中文简体繁体及其它语言（如日文，韩文）。对可以用ASCII表示的字符使用UNICODE并不高效，因为UNICODE比ASCII占用大一倍的空间，而对ASCII来说高字节的0对他毫无用处。为了解决这个问题，就出现了一些中间格式的字符集，他们被称为通用转换格式，即UTF（Universal Transformation Format）。

四、代码实现

string COfficeControlTestToolDlg::ReadFile(CString strFilePath)
{
	CFile mFile;
	if(!mFile.Open(strFilePath,CFile::modeRead|CFile::typeBinary))
	{
		MessageBox(_T("无法打开文件:")+strFilePath,_T("错误"),MB_ICONERROR|MB_OK);
		PostQuitMessage(0);
	}

	m_isUnicode = FALSE;
	m_isUTF_8Code = FALSE;

	byte head[3];   //get head content
	string strContents;   // file contents
	UINT FileSize;    // file size
	char *buf;        // temp 
	mFile.Read(head,3);  
	if((head[0]==0xff&&head[1]==0xfe)||(head[0]==0xfe&&head[1]==0xff))  //Test file Is Unicode ??
	{
		m_isUnicode = true;
	}

	if ((head[0]==0xef && head[1]==0xbb && head[2]==0xbf) || (head[0]==0xbf && head[1]==0xbb && head[2]==0xef))   //Test file Is UTF-8??
	{
		m_isUTF_8Code = true;
	}

	if (m_isUTF_8Code)  //read UTF-8 File
	{

		FileSize = (UINT)mFile.GetLength();
		buf = new char[FileSize];
		mFile.Seek(3,CFile::begin); //0xefbbbf
		mFile.Read(buf,FileSize);
		int size = MultiByteToWideChar(CP_UTF8,0,buf,FileSize+1,NULL,0);
		wchar_t* pWideChar=new wchar_t[size+1];
		MultiByteToWideChar(CP_UTF8,0,buf,FileSize+1,pWideChar,size);
		strContents = CString(pWideChar).GetBuffer(0);
		delete[] buf;
		delete[] pWideChar;

	}
	else if(m_isUnicode)  //read Unicode File;
	{
		int i = 1;
		wchar_t wch;       //for unicode
		wchar_t wstr[300];  // for unicode
		memset((void*)wstr, 0, sizeof(char)*(2*300));
		mFile.Seek(2,CFile::begin); //0xfffe
		while(mFile.Read((char *)&wch,2)>0)
		{
			if(wch==0x000D) //by line
			{
				//change to ANSI
				int nLen = i;
				buf = new char[2*nLen]; 
				memset((void*)buf, 0, sizeof(char)*(2*nLen));
				WideCharToMultiByte(CP_ACP, 0, wstr, -1, buf, 2*nLen, NULL, NULL);
				buf[2*nLen-1] = '\0'; 
				strContents += buf;
				delete[] buf;
				i=0;
			}
			else
			{
				wstr[i++] = wch;
			}
		}
	}
	else    //read ANSI	file
	{
		FileSize = (UINT)mFile.GetLength();
		buf = new char[FileSize];
		while(mFile.Read(buf,FileSize)>0)
		{
			strContents = buf;
		}
		delete[] buf;
	}
	mFile.Close();
	return strContents;
}

秒客网

ANSI Unicod UTF-8 编码及其C++读取三种类型文档实现

相关文章

ANSI Unicod UTF-8 编码 及其C++读取三种类型文档实现

相关文章

ANSI Unicod UTF-8 编码及其C++读取三种类型文档实现