Skip to content

获取gbk/gb2312编码的网页 #18

@lensh

Description

@lensh

爬虫有时候会遇到网页编码为gbk/gb2312的网页,这些网页爬取后,里面的中文是全部乱码的,解决方案是用iconv-lite进行转码。例如这个网页 http://1212.ip138.com/ic.asp ,就是gb2312编码的,爬取到的数据就会是中文乱码。具体转码过程如下:

import http from 'http'
import iconv from 'iconv-lite'    //引入第三方模块

const url = 'http://1212.ip138.com/ic.asp'  //获取到的会是服务器的ip地址
http.get(url, res => {
      let arrBuf = [],
      bufLength = 0
      res.on("data", chunk => {
		arrBuf.push(chunk)
		bufLength += chunk.length
      })
     .on("end", () => {
	       const chunkAll = Buffer.concat(arrBuf, bufLength),
	       strJson = iconv.decode(chunkAll, 'gb2312'), // 汉字不乱码
	       startIndex = strJson.indexOf('省'),
	       endIndex = strJson.indexOf('市'),
               city=strJson.substring(startIndex + 1, endIndex)     //城市名
               console.log(strJson,city)   //均是中文
     })
})

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions