-
Notifications
You must be signed in to change notification settings - Fork 231
Open
Description
爬虫有时候会遇到网页编码为gbk/gb2312的网页,这些网页爬取后,里面的中文是全部乱码的,解决方案是用iconv-lite进行转码。例如这个网页 http://1212.ip138.com/ic.asp ,就是gb2312编码的,爬取到的数据就会是中文乱码。具体转码过程如下:
import http from 'http'
import iconv from 'iconv-lite' //引入第三方模块
const url = 'http://1212.ip138.com/ic.asp' //获取到的会是服务器的ip地址
http.get(url, res => {
let arrBuf = [],
bufLength = 0
res.on("data", chunk => {
arrBuf.push(chunk)
bufLength += chunk.length
})
.on("end", () => {
const chunkAll = Buffer.concat(arrBuf, bufLength),
strJson = iconv.decode(chunkAll, 'gb2312'), // 汉字不乱码
startIndex = strJson.indexOf('省'),
endIndex = strJson.indexOf('市'),
city=strJson.substring(startIndex + 1, endIndex) //城市名
console.log(strJson,city) //均是中文
})
})
Metadata
Metadata
Assignees
Labels
No labels