HttpClient对网页编码的精确识别-白红宇

HttpClient对网页编码的精确识别

阅读量：4078 次

发布时间：2019-05-25

本文共 3343 字，大约阅读时间需要 11 分钟。

最近用Httpclient对网页进行采集，因为采集的网页编码不确定，主要是中文的网址，而httpclient对编码的识别也是靠response的head来识别的，但是有的服务器根本不返回这个头，httpclient默认就采用了ISO-8859-1的编码。上网搜索了下，有人写出了浏览器对页面编码的自动识别原理，还是个北京人呢。我找到了他的java实现算法，为了保证写好的代码不会被重新大修改，我就extends了httpclient中的GetMethod（因为我在用这个类）对里面的编码识别进行了修改，完整的算法如下（使用了chardet.jar这个类库）：

这个代码是对chardet.jar的使用，这个算法来自网上：

/** * */package com.baseframework.support;import java.io.BufferedInputStream;import java.io.IOException;import java.io.InputStream;import org.mozilla.intl.chardet.nsDetector;import org.mozilla.intl.chardet.nsICharsetDetectionObserver;import org.mozilla.intl.chardet.nsPSMDetector;/** * @author sunyanan 判断字节流的编码 * */public class CharsetDetector { private boolean found = false; private String result; private int lang; private static CharsetDetector c = new CharsetDetector(); private CharsetDetector(){} public static CharsetDetector getInstance() { return c; } public String[] detectChineseCharset(InputStream in) throws IOException { lang = nsPSMDetector.CHINESE; String[] prob; // Initalize the nsDetector() ; nsDetector det = new nsDetector(lang); // Set an observer... // The Notify() will be called when a matching charset is found. det.Init(new nsICharsetDetectionObserver() { public void Notify(String charset) { found = true; result = charset; } }); BufferedInputStream imp = new BufferedInputStream(in); byte[] buf = new byte[1024]; int len; boolean isAscii = true; while ((len = imp.read(buf, 0, buf.length)) != -1) { // Check if the stream is only ascii. if (isAscii) isAscii = det.isAscii(buf, len); // DoIt if non-ascii and not done yet. if (!isAscii) { if (det.DoIt(buf, len, false)) break; } } imp.close(); in.close(); det.DataEnd(); if (isAscii) { found = true; prob = new String[] { "ASCII" }; } else if (found) { prob = new String[] { result }; } else { prob = det.getProbableCharsets(); } return prob; } public String[] detectAllCharset(InputStream in) throws IOException { try { lang = nsPSMDetector.ALL; return detectChineseCharset(in); } catch (IOException e) { throw e; } }}

下面是对GetMethod的扩充

/** * */package com.baseframework.httpcient;import java.io.IOException;import java.io.InputStream;import org.apache.commons.httpclient.methods.GetMethod;import org.apache.commons.logging.Log;import org.apache.commons.logging.LogFactory;import com.baseframework.support.CharsetDetector;/** * @author sunyanan * 对标准org.apache.commons.httpclient.methods.GetMethod的重写，主要是为了覆盖其父类对Charset的探测，这个探测时选用的开源的jar */public class GetMethodForCharset extends GetMethod { private Log log = LogFactory.getLog(GetMethodForCharset.class); public GetMethodForCharset() { super(); } public GetMethodForCharset(String uri) { super(uri); } /** * 主要实现的是对这个方法的重写 */ public String getResponseCharSet() { String charset = getContentCharSet(getResponseHeader("Content-Type")); // 默认情况下选择的是 ISO-8859-1，那么就判断如果是这个字符编码的时候再来探测 if(charset.equalsIgnoreCase("ISO-8859-1")) { // 使用组件来判断 try { InputStream is = getResponseBodyAsStream(); String cs[] = CharsetDetector.getInstance().detectAllCharset(is); if(cs != null && cs.length > 0) { charset = cs[0]; } } catch (IOException e) { e.printStackTrace(); } } log.debug("charset used: " + charset); return charset; }}

转载地址：http://ivini.baihongyu.com/

你可能感兴趣的文章

APM官方教程的视频的作者用的固件版本是3.6.9稳定版，苍穹四轴/阿木他们那个APM树莓派T265用的3.6.11版本的固件