知乎用户信息爬虫(规模化爬取)

2016-04-10 15:53:54

本博客采用创作共用版权协议, 要求署名、非商业用途和保持一致. 转载本博客文章必须也遵循署名-非商业用途-保持一致的创作共用协议

这一次终于要爬取知乎了,知乎作为现在很热门的一个社区,因为其用户的专业性,在这个自媒体时代自成一体啊,之前爬取知乎都是爬取回答精华或者图片

若只是爬取图片或者回答是不需要权限的

session


首先,我们必须先理解什么是session,这个东西呢就是拿来实现对话的

首先我们要知道浏览器端是可以发送cookie的,我们来看看我们发送的请求:

pic

然后呢,在服务端是可以获取这个cookie值,所以当你在浏览器选择禁用cookie时,你就无法登录了

session的实现是,当你执行了登录动作之后,服务端会生成两个特殊的值,一个存在cookie里面,一个存在本地,就好比产生了钥匙和锁,当你发送的cookie过来时,我就检查有没有这把锁,这样就能保证你的登录的唯一性,能实现对话了

策略


因为我们想要拿到zhihu的用户数据,但是我们知道肯定会有一堆水军,所以这一次我们的策略是,爬取每一个人的关注的人作为我们下一次爬取的目标,这样多少能筛掉大量的水号

事实上我之后做数据处理,发现数据质量还是很高的

开始


我们这次用的是requests,因为他够人性化,关于requests的用法,还是请看他的文档吧,写得非常好

首先我们只要这样的一个语句,我们就可以发送有cookie的请求了(如果看不懂的请到我之前写的爬虫教程看看)

requests.get(added_followee_url,cookies=self.cookies,headers=self.header,verify=False)

注意,我们为了伪装得不像爬虫,最好还要写上一些User-Agent之类的信息,

然后呢,因为这一次我们只需要实现爬取功能,当然我们是可以使用requests实现session,实现登录,当这一次,我们就懒一点,直接拿cookie用

pic

好的,我们的类可以这么写( crawler.py ):

class Zhihu_Crawler():

    '''
    basic crawler

    '''

    def __init__(self,url,option="print_data_out"):
        '''
        initialize the crawler

        '''

        self.option=option
        self.url=url
        self.header={}
        self.header["User-Agent"]="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:43.0) Gecko/20100101 Firefox/43.0"
#        self.header["Host"]="www.zhihu.com"
        self.header["Referer"]="www.zhihu.com"


        #cookie
        self.cookies={"z_c0":'"QUZDQUp3czV3QWtYQUFBQVlRSlZUZkx3TVZmaFBMWlp2em04Ym1PN01BMldtRktscHRMOVVBPT0=|1460298735|34e9183179a0555057f1cfcc2c8f63660a2f4fc5"',
                "unlock_ticket":'QUZDQUp3czV3QWtYQUFBQVlRSlZUZnBxQ2xmSWNXX3NuVXo3SVJleUM5Uy1BLUpEdXJEcEpBPT0',
                "login":'"ZjliNTRhNzViMDQ2NDMzY2FmZTczNjNjZDA4N2U0NGU=|1460298735|b1048ba322e44c391aa15306198503eab8b28f26"',
                "n_c":"1",
                "q_c1":"a15d5ad71c734d5b9ab4b1eddacea368|1460298703000|1460298703000",
                "l_cap_id":'"YjMzMGNjMTUxMWIzNGZiMWI2OWI2ZGI1ZDM5NTAzZTQ=|1460298703|dd2d5dec11620d64a65ea057bd852f10124a283f"',
                "d_c0":'"AJAAgETzqAmPTgxl_8gbpkFvESCkSwIZMoU=|1458736882"',
                "cap_id":'"MmI2OTJiNWVkZGFmNGNmZDk0NDY2YTBlODI1ZjgyMWQ=|1460298703|b185f46c6887417393049379e47d961708cfdac7"'}

我写的还是比较清晰的

然后呢,我们实现反问功能:

def send_request(self):
    '''
    send a request to get HTML source

    '''
    added_followee_url=self.url+"/followees"
    try:
        r=requests.get(added_followee_url,cookies=self.cookies,headers=self.header,verify=False)
    except:
        re_crawl_url(self.url)
        return

    content=r.text


    if r.status_code==200:
        self.parse_user_profile(content)

好的,我们现在拿到了一份html代码了,接着就是我们要解析的了,这一次我们要爬去的主要是用户的信息

我们的解析这一次使用lxml这个东西是用c写的,使用它很很快通过xpath方法爬取html源码

安装:

ubuntu/debian:

sudo apt-get install libxml

mac:

brew install libxml

接着:

pip install lxml

关于xpath我就不多说了吧,我之前有写,想学的朋友可以看看我之前写的scrapy爬虫系列

def process_xpath_source(self,source):
    if source:
        return source[0]
    else:
        return ''

def parse_user_profile(self,html_source):
    '''
    parse the user's profile to mongo
    '''

    #initialize variances

    self.user_name=''
    self.fuser_gender=''
    self.user_location=''
    self.user_followees=''
    self.user_followers=''
    self.user_be_agreed=''
    self.user_be_thanked=''
    self.user_education_school=''
    self.user_education_subject=''
    self.user_employment=''
    self.user_employment_extra=''
    self.user_info=''
    self.user_intro=''

    tree=html.fromstring(html_source)

    #parse the html via lxml
    self.user_name=self.process_xpath_source(tree.xpath("//a[@class='name']/text()"))
    self.user_location=self.process_xpath_source(tree.xpath("//span[@class='location item']/@title"))
    self.user_gender=self.process_xpath_source(tree.xpath("//span[@class='item gender']/i/@class"))
    if "female" in self.user_gender and self.user_gender:
        self.user_gender="female"
    else:
        self.user_gender="male"
    self.user_employment=self.process_xpath_source(tree.xpath("//span[@class='employment item']/@title"))
    self.user_employment_extra=self.process_xpath_source(tree.xpath("//span[@class='position item']/@title"))
    self.user_education_school=self.process_xpath_source(tree.xpath("//span[@class='education item']/@title"))
    self.user_education_subject=self.process_xpath_source(tree.xpath("//span[@class='education-extra item']/@title"))
    try:
        self.user_followees=tree.xpath("//div[@class='zu-main-sidebar']//strong")[0].text
        self.user_followers=tree.xpath("//div[@class='zu-main-sidebar']//strong")[1].text
    except:
        return

    self.user_be_agreed=self.process_xpath_source(tree.xpath("//span[@class='zm-profile-header-user-agree']/strong/text()"))
    self.user_be_thanked=self.process_xpath_source(tree.xpath("//span[@class='zm-profile-header-user-thanks']/strong/text()"))
    self.user_info=self.process_xpath_source(tree.xpath("//span[@class='bio']/@title"))
    self.user_intro=self.process_xpath_source(tree.xpath("//span[@class='content']/text()"))

这个呢就是我们要的解析程序了,关于要爬去什么信息,现在告诉你:

接着我只想打印出来看看我们爬的怎么样,加入:

    self.print_data_out()

然后写我们的打印程序:

    def print_data_out(self):
    '''
    print out the user data
    '''

    print "*"*60
    print '用户名:%s\n' % self.user_name
    print "用户性别:%s\n" % self.user_gender
    print '用户地址:%s\n' % self.user_location
    print "被同意:%s\n" % self.user_be_agreed
    print "被感谢:%s\n" % self.user_be_thanked
    print "被关注:%s\n" % self.user_followers
    print "关注了:%s\n" % self.user_followees
    print "工作:%s/%s" % (self.user_employment,self.user_employment_extra)
    print "教育:%s/%s" % (self.user_education_school,self.user_education_subject)
    print "用户信息:%s" % self.user_info
    print "*"*60

好的,试着用一下,我们就可以得到:

************************************************************
用户名:Mingo鸣哥

用户性别:female

用户地址:香港

被同意:59960

被感谢:14474

被关注:39055

关注了:806

工作:记者/
教育:香港中文大学 (Chinese University of Hong Kong)/新媒体
************************************************************

这个是我们爬取一个香港帅哥的信息,嘻嘻

接着,我们就要想呀,这些数据不能都打印出来啊,我们要的是数据,于是就需要数据库啦(上帝说要有光)

我们用的数据库是:mongodb,一个no-sql数据库,直接使用mongoengine,因为我有很久的orm历史,所以还是mongoengine好用

安装mongodb:

ubuntu/debian:

sudo apt-get install mongodb

mac:

brew install mongodb

安装mongoengine:

pip install mongoengine

好的,接着就是我们的程序( db.py )了:

程序非常简单,直接把我们刚刚打印的东西写成orm:

# -*- coding: utf-8 -*-
#encoding:utf-8

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

import mongoengine

mongoengine.connect('test_my_zhihu_data')

class Zhihu_User_Profile(mongoengine.Document):
    user_name=mongoengine.StringField()
    user_be_agreed=mongoengine.StringField()
    user_be_thanked=mongoengine.StringField()
    user_followees=mongoengine.StringField()
    user_followers=mongoengine.StringField()
    user_education_school=mongoengine.StringField()
    user_education_subject=mongoengine.StringField()
    user_employment=mongoengine.StringField()
    user_employment_extra=mongoengine.StringField()
    user_location=mongoengine.StringField()
    user_gender=mongoengine.StringField()
    user_info=mongoengine.StringField()
    user_intro=mongoengine.StringField()
    user_url=mongoengine.StringField()

然后在我们的爬虫类加入:

def store_data_to_mongo(self):
    '''
    store the data in mongo
    '''
    new_profile=Zhihu_User_Profile(
        user_name=self.user_name,
        user_be_agreed=self.user_be_agreed,
        user_be_thanked=self.user_be_thanked,
        user_followees=self.user_followees,
        user_followers=self.user_followers,
        user_education_school=self.user_education_school,
        user_education_subject=self.user_education_subject,
        user_employment=self.user_employment,
        user_employment_extra=self.user_employment_extra,
        user_location=self.user_location,
        user_gender=self.user_gender,
        user_info=self.user_info,
        user_intro=self.user_intro,
        user_url=self.url
    )
    new_profile.save()
    print "saved:%s \n" % self.user_name

好的,这样我们就可以调度我们的存储和打印了

好的,我们来看看我们的队列,这一次,我选择使用redis做我的任务队列,redis的使用我不再多说,主要是使用它的数据结构:

队列

import redis

red_queue="test_the_url_queue"
red_crawled_set="test_url_has_crawled"

process_pool=Pool(multiprocessing.cpu_count()*2)



#connect to redis server
red=redis.Redis(host='localhost', port=6379, db=1)


def re_crawl_url(url):
    red.lpush(red_queue,url)

def check_url(url):
    if red.sadd(red_crawled_set,url):
        red.lpush(red_queue,url)

所以也在我们的爬虫类中加入:

    #find the follower's url
    url_list=tree.xpath("//h2[@class='zm-list-content-title']/a/@href")
    for target_url in url_list:
        target_url=target_url.replace("https","http")
        check_url(target_url)

这样可以实现,我们从我们爬取的人的关注人那里拿到url

好的,这一次我们就基本上是把我们的爬虫程序写完了,接着就是什么策略爬取最大效益了

首先,我是做了一个测试,用来测试主要的多进程和携程的效率,程序如下,爬取百度贴吧500个页面:

# -*- coding: utf-8 -*-
#encoding:utf-8

import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import gevent.monkey
gevent.monkey.patch_socket()
import time

import gevent
from gevent.pool import Pool
from multiprocessing.dummy import Pool
import requests
import urllib2

the_list=range(6)

def create_mission(url):
    n=h(url)
    n.print_out()

class h():
    def __init__(self,n):
        self.n=n
    def print_out(self):
        try:
            requests.get(self.n)
        except requests.exceptions.ConnectionError as e:
            print "died:%s\n" % self.n
            return

url_list=[]
for i in range(1,500):
    url_list.append("http://tieba.baidu.com/p/2781190586?pn="+str(i))

pool=Pool(2)
start=time.time()
pool.map_async(create_mission,url_list)
pool.close()
pool.join()
print "multiprocessing used ",str(time.time()-start)


start=time.time()
jobs=[]
for url in url_list:
    jobs.append(gevent.spawn(create_mission,url))
gevent.joinall(jobs)
print "use gevent used ",time.time()-start

start=time.time()
gevent_pool.map(create_mission,url_list)

#pool.join()
print "use gevent_pool used ",str(time.time()-start)


start=time.time()
for url in url_list:
    create_mission(url)
print "use nothing ,it cost:",str(time.time()-start)

结果为:

multiprocessing used  225.918653965
use gevent used  14.7605810165
use gevent_pool used  24.5680589676

于是,我们可以看到,gevent效率报表,但是gevent的pool实现很不好,所以不用,那么我们怎么办呢?

以下是我们的引擎程序( engine.py ) 连同redis部分也在里面了:

我们把gevent封装成gevent_worker,然后跑十个greenlet,接着使用multiprocessing实现并行多进程:

# -*- coding: utf-8 -*-
#encoding:utf-8

import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import gevent.monkey
gevent.monkey.patch_all()

import gevent
import redis
import crawler
import time
from multiprocessing.dummy import Pool
import multiprocessing


red_queue="test_the_url_queue"
red_crawled_set="test_url_has_crawled"

process_pool=Pool(multiprocessing.cpu_count()*2)



#connect to redis server
red=redis.Redis(host='localhost', port=6379, db=1)


def re_crawl_url(url):
    red.lpush(red_queue,url)

def check_url(url):
    if red.sadd(red_crawled_set,url):
        red.lpush(red_queue,url)

#wrap the class method
def create_new_slave(url,option):
    new_slave=crawler.Zhihu_Crawler(url,option)
    new_slave.send_request()
    return "ok"

def gevent_worker(option):
    while True:
        url=red.lpop(red_queue)
        if not url:
            break
        create_new_slave(url,option)

def process_worker(option):
    jobs=[]
    for i in range(50):
        jobs.append(gevent.spawn(gevent_worker,option))
    gevent.joinall()



if __name__=="__main__":

    '''
    start the crawler

    '''

    start=time.time()
    count=0

    #choose the running way of using database or not

    try:
        option=sys.argv[1]
    except:
        option=''
    if "mongo" not in option:
        option="print_data_out"

    #the start page

    red.lpush(red_queue,"https://www.zhihu.com/people/gaoming623")
    url=red.lpop(red_queue)
    create_new_slave(url,option=option)
    for i in range(50):
        gevent_worker(option=option)

    process_pool.map_async(process_worker,option)
    process_pool.close()
    process_pool.join()


    print "crawler has crawled %d people ,it cost %s" % (count,time.time()-start)

好的,我们的程序就算是写完了

这个程序我已经在github上开源,github地址为:

https://github.com/salamer/Zhihu_Crawler

python爬虫 返回首页

Designed and built with all the love in the world by the Mr.ALJUN.

@SERVER BY NGINX AND POWER BY DIGITALOCEAN.

© COPYRIGHT BY GAGASALAMER 2015