java中GBK与UTF-8编码的转换

╰半夏微凉° 2022-01-07 06:05 1058阅读 0赞

### 文章目录 ###

*  java源文件中中文字符的编码的问题
 *  UTF-8和GBK格式的文件相互转换
 *   *  java实现文件编码的转换
 *  java不同编码的字节数组的转换
 *  Java判断文件编码格式
 *   *  对于UTF-8格式文件的判断：
     *  利用cpdetector开源库确定文件（网页）的编码

java编码中常遇到的编码转换问题，主要是UTF-8、unicode与GBK编码之间的转换。

经常需要转换的主要原因是：中文编码的问题，如果编解码不对应，经常遇到令人烦躁的乱码问题。究其原因是：在unicode系列编码和GBK系列编码中，同一个中文的字符对应的编码不同。

在java中字符默认是unicode编码的方式存储。

# java源文件中中文字符的编码的问题 #

**windows系统默认的编码为：gbk.**

命令行编译java代码不用`-encoding`指定编码选项时，会默认按照GBK编码编译，如果源文件编码不是GBK编码，编译(可能)将产生错误：xxx编码不可映射的字符集。

**linux/unix系统默认的编码为：utf-8.**

命令行编译java代码不用`-encoding`指定编码选项时，会默认按照utf-8编码编译，如果源文件编码不是utf-8编码，编译(可能)将产生错误：xxx编码不可映射的字符集。

**针对以上问题，可在编译时指定和源文件一致的编码即可，输出中文将正常，将不受操作系统默认的编码的影响。**

eg: windows下utf-8编码的Test.java代码，利用下面的编译指令可通过编译，且中文在终端输出正常。

javac Test.java -encoding utf-8
    
    java Test

# UTF-8和GBK格式的文件相互转换 #

**UTF-8和GBK格式的数据不能直接转换，需要先转化为unicode编码，再进行转换。**

**unicode的码表官网：[http://www.unicode.org/charts/][http_www.unicode.org_charts]**

**unicode的编码范围和各国语言编码映射：[https://www.cnblogs.com/csguo/p/7401874.html][https_www.cnblogs.com_csguo_p_7401874.html]**

**中文的编码在unicode码表中的CJK（CJK 是中文（Chinese）、日文（Japanese）、韩文（Korean）三国文字的缩写）中说明。**[关于CJK的说明][CJK]。

[ISO-8859-1向下兼容ASCII][ISO-8859-1_ASCII]

**特别说明：** unicode 和 gbk系列编码之间没有确定的算数关系，如果需要准确的转换，必须通过unicode码表和中文字符编码的码表进行转换。

一般情况下：高级的编程语言中都会提供 unicode 和 gbk系列编码之间转换的API,像C/C++中如果未提供，可以采用第三方库的API进行转换，没必要浪费时间，自己造轮子。

## java实现文件编码的转换 ##

public static int convertFileEncoding(String srcFilePath,String srcCharset ,
    		String destFilePath,String destCharset,boolean isDeleteSrc) throws  IOException {
        
    	if(srcFilePath == null || srcFilePath.length() == 0)
    		throw new IllegalArgumentException("srcFilePath is empty.");
    	if(destFilePath == null || destFilePath.length() == 0)
    		throw new IllegalArgumentException("destFilePath is empty.");
    	if(srcFilePath.equalsIgnoreCase(destFilePath))
    		throw new IllegalArgumentException("srcFilePath is the same as destFilePath");
    	
    	if(srcCharset == null || srcCharset.length() == 0)
    		throw new IllegalArgumentException("srcCharset is empty.");
    	if(destCharset == null || destCharset.length() == 0)
    		throw new IllegalArgumentException("destCharset is empty.");
    	
    	if(srcCharset.equalsIgnoreCase(destCharset)) // 编码相同，无需转换
    		return 0;
    	
    	File srcFile = new File(srcFilePath);
    	
    	FileInputStream fis = null;
    	InputStreamReader isr = null;
    	BufferedReader br = null;
    	
    	FileOutputStream fos = null;
    	OutputStreamWriter osw = null;
    
    	try {
        
    		fis = new FileInputStream(srcFile);
    		isr = new InputStreamReader(fis, srcCharset);
    		
    		// BufferedReader中defaultCharBufferSize = 8192.
    		// 即：8192 × 2 byte = 16k
    		// 若是utf-8,中文占3个字节，16K / 3  = 5461，即只要每行中文字符数 < 5461,读取的行数就是准确的，
    		// 否则，可能会截断一行，多写入'\n',但这种情况一般不存在。
    		// 如果源文件中最后一行没有换行符，转码后的文件最后会多写入一个换行符
    		br = new BufferedReader(isr);
    
    		// 以UTF-8格式写入文件,file.getAbsolutePath()即该文件的绝对路径,false代表不追加直接覆盖,true代表追加文件
    		fos = new FileOutputStream(destFilePath, false);
    		osw = new OutputStreamWriter(fos, destCharset);
    
    		String str = null;
    
    		// 创建StringBuffer字符串缓存区
    		StringBuffer sb = new StringBuffer();
    		int lines = 0;
    
    		// 通过readLine()方法遍历读取文件
    		while ((str = br.readLine()) != null) {
        
    			// 使用readLine()方法无法进行换行,需要手动在原本输出的字符串后面加"\n"或"\r"
    			sb.append(str).append('\n');
    			osw.write(sb.toString());
    			osw.flush();				
    			lines++;
    		}
    		
    		if(isDeleteSrc) 
    		{
        
    			if(srcFile.delete())
    				System.out.println(srcFile.getAbsolutePath() + " file is already deleted.");
    			else
    				System.out.println(srcFile.getAbsolutePath() + " file delete fail.");
    		}
    		
    //		System.out.println(lines);
    		return lines;
    	} catch (FileNotFoundException e) {
        
    		// TODO Auto-generated catch block
    //		e.printStackTrace();
    		throw e;
    	} catch (UnsupportedEncodingException e) {
        
    		// TODO Auto-generated catch block
    		throw e;
    	} catch (IOException e) {
        
    		// TODO: handle exception
    		throw e;
    	} finally {
        
    		// 与同一个文件关联的所有输出流(输入流)，只需关闭一个即可
    		if (null != fis)
    			try {
        
    				fis.close();
    				fis = null;
    			} catch (IOException e) {
        
    				// TODO Auto-generated catch block
    				e.printStackTrace();
    			}
    
    		if (null != fos)
    			try {
        
    				fos.close();
    				fos = null;
    			} catch (IOException e) {
        
    				// TODO Auto-generated catch block
    				e.printStackTrace();
    			}
    	}
    }

# java不同编码的字节数组的转换 #

定义class FileUtil实现

public class EncodingUtil {
        
    
    
    	/**
    	 * 将原正确编码的字符串src，转化为编码为srcCharset的字符串
    	 * 
    	 * 前提是：确保原字符串的编码是无损（完整的）. 无需知道原字符串的具体编码，
    	 * 转化为目标编码的字符串由java库自动实现，无需自己手动实现。
    	 * 
    	 * 如果原编码字符串不能转化为目标编码，将会抛出UnsupportedEncodingException
    	 * 
    	 * @param src
    	 * @param srcCharset
    	 * @param destCharet
    	 * @return 转换后的字符串
    	 * @throws UnsupportedEncodingException
    	 */
    	public static String convertEncoding_Str(String src,String srcCharset,String destCharet) 
    			throws UnsupportedEncodingException{
        
    		byte[] bts = src.getBytes(destCharet);
    		return new String(bts, destCharet);
    	}
    
    	
    	/**
    	 * 将编码为srcCharset的字节数组src转化为编码为destCharet的字节数组
    	 * 
    	 * @param src
    	 * @param srcCharset
    	 * @param destCharet
    	 * @return
    	 * @throws UnsupportedEncodingException
    	 */
    	public static byte[]  convertEncoding_ByteArr(byte[] src,String srcCharset,String destCharet) 
    			throws UnsupportedEncodingException{
        	
    		String s = new String(src, srcCharset);
    		return s.getBytes(destCharet);
    	}
    
    
    	/**
    	 * 将字节数组byteArr转化为2位16进制字符串,每个16进制之间用空格分割
    	 * 
    	 * @param byteArr
    	 * @return
    	 */
    	public static String byteToHex(byte... byteArr) {
        
    		if (null == byteArr || byteArr.length == 0)
    			return "";
    		StringBuffer sb = new StringBuffer();
    		String tmp = null;
    
    		for (byte b : byteArr) {
        
    
    			tmp = Integer.toHexString(b);
    			// 切记：byte进行运算时，会自动转化为int，否则可能会出错
    			if (b >>> 31 == 1) {
         // 最高位为1，负数
    				sb.append(tmp.substring(6));
    			} else {
         // 最高位为0，正数
    				if(tmp.length() < 2)
    					sb.append('0');
    				sb.append(tmp);
    			}
    			sb.append(' ');
    		}
    		sb.deleteCharAt(sb.length() - 1); // delete last space
    		return sb.toString();
    	}
    
    
    
    }

**特别说明：** java API中编码正确的字符串可通过String对象的 getBytes(String charset)获得不同编码的字节数组,但是通过字节数组构造字符串对象时，String(byte\[\],charset) 字节数组的原编码必须和构造字符串对象时指定的编码相同，否则可能构造的并非是预期的字符串。

测试程序： (操作系统为：ubuntu16.04)

import java.io.IOException;
    import java.nio.charset.Charset;
    
    public class Test {
        
    
    //	一
    //	GB2312编码：D2BB BIG5编码：A440 GBK编码：D2BB GB18030编码：D2BB Unicode编码：4E00 ,utf-8: E4B880
    //	丁
    //	GB2312编码：B6A1 BIG5编码：A442 GBK编码：B6A1 GB18030编码：B6A1 Unicode编码：4E01 ,utf-8: E4B881
    //	丂
    //	GB2312编码：没有 BIG5编码：没有 GBK编码：8140 GB18030编码：8140 Unicode编码：4E02 ,utf-8: E4B882
    //	七
    //	GB2312编码：C6DF BIG5编码：A443 GBK编码：C6DF GB18030编码：C6DF Unicode编码：4E03 ,utf-8: E4B883
    //  \r\n  ： 0A 0D
    //  0-9 : 30 39
    //  A-Z : 41~5A  a-z : 61~7A
    
    //  一18丁a丂七40
    // GBK      ： D2BB 3138 B6A1 61 8140 C6DF 3430
    // unicode  : 4E00 3138 4E01 61 4E02 4E03 3430
    // utf-8    : E4B880 3138 E4B881 61 E4B882 E4B883 3430
    
    	public static void main(String[] args) throws IOException {
        
    
    		System.out.println("start");
    
    		String s = "一18丁a丂七40";
    
    		System.out.println("系统默认编码： " + System.getProperty("file.encoding"));// 查询结果GBK
    		// 系统默认字符编码
    		System.out.println("系统默认字符编码:" + Charset.defaultCharset()); // 查询结果GBK
    		// 操作系统用户使用的语言
    		System.out.println("系统默认语言:" + System.getProperty("user.language")); // 查询结果zh
    
    		// 使用系统默认的字符编码
    		byte[] defaultCharsetArr = s.getBytes();
    		showByteArr(defaultCharsetArr,"defaultCharsetArr");
    
    		// unicode编码，在java所有字符（中英文）均占2个字节
    		byte[] unicodeArr = s.getBytes("unicode");
    		showByteArr(unicodeArr,"unicodeArr");
    
    		// gbk中文占2个字节，UTF-8k中文占3个字节; 英文字符2者均占1个字节
    		
    		byte[] gbkArr = s.getBytes("gbk");
    		showByteArr(gbkArr,"gbkArr");
    
    		byte[] utf8Arr = s.getBytes("utf-8");
    		showByteArr(utf8Arr,"utf8Arr");
    		
    		// ISO-8859-1编码中不能出现中文，因为其将每个中文字符编码为1个字节（非法）
    		// 会造成字节的丢失
    		byte[] isoArr = s.getBytes("ISO-8859-1");
    		showByteArr(isoArr,"isoArr");
    			
    		// gbk to utf-8
    		showByteArr(EncodingUtil.convertEncoding_ByteArr(gbkArr,"gbk","utf-8"),"gbk to utf-8");
    		showByteArr(EncodingUtil.convertEncoding_ByteArr(utf8Arr,"utf-8","gbk"),"utf-8 to gbk");
    		showByteArr(EncodingUtil.convertEncoding_ByteArr(utf8Arr,"utf-8","unicode"),"utf-8 to unicode");
    		
    		System.out.println("end");
    
    	}
    	
    	private static void showByteArr(byte[] arr,String msg) {
        
    		// TODO Auto-generated method stub
    		System.out.println(msg);
    		System.out.println("byte[] len=:" + arr.length + "\n" + EncodingUtil.byteToHex(arr));
    	}
    
    }

输出

start
    系统默认编码： UTF-8
    系统默认字符编码:UTF-8
    系统默认语言:en
    defaultCharsetArr
    byte[] len=:17
    e4 b8 80 31 38 e4 b8 81 61 e4 b8 82 e4 b8 83 34 30
    unicodeArr
    byte[] len=:20
    fe ff 4e 00 00 31 00 38 4e 01 00 61 4e 02 4e 03 00 34 00 30
    gbkArr
    byte[] len=:13
    d2 bb 31 38 b6 a1 61 81 40 c6 df 34 30
    utf8Arr
    byte[] len=:17
    e4 b8 80 31 38 e4 b8 81 61 e4 b8 82 e4 b8 83 34 30
    isoArr
    byte[] len=:9
    3f 31 38 3f 61 3f 3f 34 30
    gbk to utf-8
    byte[] len=:17
    e4 b8 80 31 38 e4 b8 81 61 e4 b8 82 e4 b8 83 34 30
    utf-8 to gbk
    byte[] len=:13
    d2 bb 31 38 b6 a1 61 81 40 c6 df 34 30
    utf-8 to unicode
    byte[] len=:20
    fe ff 4e 00 00 31 00 38 4e 01 00 61 4e 02 4e 03 00 34 00 30
    end

# Java判断文件编码格式 #

一般遇到的需要确定文件的编码类型主要是GBK or UTF-8.

## 对于UTF-8格式文件的判断： ##

如果是含有`BOM`的UTF-8编码文件，比较容易判断。

**BOM概念：**

在UCS 编码中有一个叫做`”ZERO WIDTH NO-BREAK SPACE“`的字符，它的编码是FEFF。而FFFE在UCS中是不存在的字符，所以不应该出现在实际传输中。

UCS规范建议我们在传输字节流前，先传输 字符”ZERO WIDTH NO-BREAK SPACE“。

如果接收者收到FEFF，就表明这个字节流是大字节序的；如果收到FFFE，就表明这个字节流是小字节序的。因此字符”ZERO WIDTH NO-BREAK SPACE“又被称作BOM。

**BOM作用**

UTF-8不需要BOM来表明字节顺序，但可以用BOM来表明编码方式。字符”ZERO WIDTH NO-BREAK SPACE“的UTF-8编码是EF BB BF。所以如果接收者收到以EF BB BF开头的字节流，就知道这是UTF-8编码了。

// BOM UTF-8编码 ： EF BB BF
    	public static boolean isUTF8Encoding_BOM(String path) throws IOException {
        
    		
    		FileInputStream fis = new FileInputStream(path);
    
    		int count = 0;
    		int data = 0;
    
    		boolean result = false;
    		int bn = 2;
    		
    		try {
        
    			while ((data |= fis.read() << bn * 8 ) != -1)
    			{
        
    				++count;
    				--bn;
    				if(count >= 3) 
    					break;
    			}
    					
    			if(count >= 3)
    				if(data == 0xEFBBBF) //bom
    					result = true;
    		} catch (IOException e) {
        
    			// TODO Auto-generated catch block
    			throw e;
    		}finally {
        
    			try {
        
    				fis.close();
    				fis = null;
    			} catch (IOException e) {
        
    				// TODO Auto-generated catch block
    				e.printStackTrace();
    			}
    		}
    		return result;
    	}

对于没有bom的UTF-8文件编码的判断比较麻烦，可以利用第三方库[cpdetector][]。它的类库很小，只有500K左右，利用该类库判定文本文件的代码如下（由于cpdetector的算法使用概率统计，所以结果并不是100%准确的).

refer to : [https://www.cnblogs.com/x\_wukong/p/3732955.html][https_www.cnblogs.com_x_wukong_p_3732955.html]

## 利用cpdetector开源库确定文件（网页）的编码 ##

[cpdetector][]开源库

public static String getFileEncoding(String path) throws IOException {
        
    
    		/*------------------------------------------------------------------------ 
    		  detector是探测器，它把探测任务交给具体的探测实现类的实例完成。 
    		  cpDetector内置了一些常用的探测实现类，这些探测实现类的实例可以通过add方法 加进来，
              	如ParsingDetector、 JChardetFacade、ASCIIDetector、UnicodeDetector。   
    		  detector按照“谁最先返回非空的探测结果，就以该结果为准”的原则返回探测到的字符集编码。 
    		--------------------------------------------------------------------------*/
    		info.monitorenter.cpdetector.io.CodepageDetectorProxy detector = 
                   info.monitorenter.cpdetector.io.CodepageDetectorProxy.getInstance();
    				
    		/*------------------------------------------------------------------------- 
    		  ParsingDetector可用于检查HTML、XML等文件或字符流的编码,构造方法中的参数用于 
    		  指示是否显示探测过程的详细信息，为false不显示。 
    		---------------------------------------------------------------------------*/
    		
    		detector.add(new info.monitorenter.cpdetector.io.ParsingDetector(false));
    		
    /*-------------------------------------------------------------------------- 
    		  JChardetFacade封装了由Mozilla组织提供的JChardetFacade，它可以完成大多数文件的编码 
    		  测定。所以，一般有了这个探测器就可满足大多数项目的要求，如果你还不放心，可以 
    		  再多加几个探测器，比如下面的ASCIIDetector、UnicodeDetector等。 
    		 ---------------------------------------------------------------------------*/
    		detector.add(info.monitorenter.cpdetector.io.JChardetFacade.getInstance());
    		// ASCIIDetector用于ASCII编码测定
    		detector.add(info.monitorenter.cpdetector.io.ASCIIDetector.getInstance());
    		// UnicodeDetector用于Unicode家族编码的测定
    		detector.add(info.monitorenter.cpdetector.io.UnicodeDetector.getInstance());
    		
    		java.nio.charset.Charset charset = null;
    		File f = new File(path);
    		try {
        
    //			charset = detector.detectCodepage(new URI(f.getPath()).toURL());
    			charset = detector.detectCodepage(f.toURL());
    		} catch (Exception ex) {
        
    			ex.printStackTrace();
    		}
    		if (charset != null) {
        
    			//System.out.println(f.getName() + " encoding is：" + charset.name());
    			return charset.name();
    		} else
    			//System.out.println(f.getName() + " unknown");
    			return null;
    	}

detector不仅可以用于探测文件的编码，也可以探测任意输入的文本流的编码，方法是调用其重载形式

`charset = detector.detectCodepage(InputStream ins, int test_byte_count)`

上面的字节数由程序员指定，字节数越多，判定越准确，当然时间也花得越长。要注意，字节数的指定不能超过文本流的最大长度。

判定文件编码的具体应用举例：  
属性文件(.properties)是Java程序中的常用文本存储方式，象STRUTS框架就是利用属性文件存储程序中的字符串资源。它的内容如下所示：

#注释语句  
    属性名=属性值

读入属性文件的一般方法是：

FileInputStream ios = new FileInputStream("属性文件名");  
    Properties prop = new Properties();  
    prop.load(ios);  
    ios.close();

利用java.io.Properties的load方法读入属性文件虽然方便，但如果属性文件中有中文，在读入之后就会发现出现乱码现象。发生这个原因是load方法使用字节流读入文本，在读入后需要将字节流编码成为字符串，而它使用的编码是“iso-8859-1”,这个字符集是ASCII码字符集，不支持中文编码，所以这时需要使用显式的转码:

String value = prop.getProperty("属性名");  
    String encValue = new String(value.getBytes("iso-8859-1"),"属性文件的实际编码");

[http_www.unicode.org_charts]: http://www.unicode.org/charts/
[https_www.cnblogs.com_csguo_p_7401874.html]: https://www.cnblogs.com/csguo/p/7401874.html
[CJK]: https://baike.baidu.com/item/CJK/10788027?fr=aladdin
[ISO-8859-1_ASCII]: https://baike.baidu.com/item/ISO-8859-1/7878872?fr=aladdin
[cpdetector]: http://cpdetector.sourceforge.net/
[https_www.cnblogs.com_x_wukong_p_3732955.html]: https://www.cnblogs.com/x_wukong/p/3732955.html