03-为链接爬虫添加缓存支持

选择背景色：黄橙洋红淡粉水蓝草绿白色选择字体：宋体黑体微软雅黑楷体选择字体大小：小中大特恢复默认

[toc]

3.2　为链接爬虫添加缓存支持

要想支持缓存，我们需要修改第1章中编写的 download 函数，使其在URL下载之前进行缓存检查。另外，我们还需要把限速功能移至函数内部，只有在真正发生下载时才会触发限速，而在加载缓存时不会触发。为了避免每次下载都要传入多个参数，我们借此机会将 download 函数重构为一个类，这样只需在构造方法中设置参数，就能在后续下载时多次复用。下面是支持了缓存功能的代码实现。

from chp1.throttle import Throttle
from random import choice
import requests
class Downloader:
    def __init__(self, delay=5, user_agent='wswp', proxies=None, cache={}):
       self.throttle = Throttle(delay)
       self.user_agent = user_agent
       self.proxies = proxies
       self.num_retries = None # we will set this per request
       self.cache = cache
    def __call__(self, url, num_retries=2):
        self.num_retries = num_retries
        try:
             result = self.cache[url]
             print('Loaded from cache:', url)
        except KeyError:
             result = None
        if result and self.num_retries and 500 <= result['code'] < 600:
            # server error so ignore result from cache
            # and re-download
            result = None
        if result is None:
            # result was not loaded from cache
         # so still need to download
         self.throttle.wait(url)
         proxies = choice(self.proxies) if self.proxies else None
         headers = {'User-Agent': self.user_agent}
         result = self.download(url, headers, proxies)
         if self.cache:
             # save result to cache
             self.cache[url] = result
    return result['html']
def download(self, url, headers, proxies, num_retries):
    ...
    return {'html': html, 'code': resp.status_code }

下载类的完整源码可以在本书源码文件的 `chp3` 文件夹中找到，其名为 `downloader.py` 。

前面代码中的 Download 类有一个比较有意思的部分，那就是 __call__ 特殊方法，在该方法中我们实现了下载前检查缓存的功能。该方法首先会检查 URL 之前是否已经放入缓存中。默认情况下，缓存是一个 Python 字典。如果 URL 已经被缓存，则检查之前的下载中是否遇到了服务器端错误。最后，如果也没有发生过服务器端错误，则表明该缓存结果可用。如果上述检查中的任何一项失败，都需要正常下载该 URL，然后将得到的结果添加到缓存中。

这里的 download 方法和之前的 download 函数基本一样，只是现在返回了HTTP状态码，以便在缓存中存储错误码。此外，这里不再调用自身并检测 num_retries ，而是先减少 self.num_retries ，如果还有重试次数剩余的话，则递归使用 self.download 。当然，如果你只需要一个简单的下载功能，而不需要限速或缓存的话，可以直接调用该方法，这样就不会通过 call 方法调用了。

而对于 cache 类，我们可以通过调用 result = cache[url] 从 cache 中加载数据，并通过 cache[url] = result 向 cache 中保存结果，这是一个来自Python内置字典数据类型的便捷接口。为了支持该接口，我们的 cache 类需要定义 __getitem__() 和 __setitem__() 这两个特殊的类方法。

此外，为了支持缓存功能，链接爬虫的代码也需要进行一些微调，包括添加 cache 参数、移除限速以及将 download 函数替换为新的类等，如下面的代码所示。

def link_crawler(..., num_retries=2, cache={}):
    crawl_queue = [seed_url]
    seen = {seed_url: 0}
    rp = get_robots(seed_url)
    D = Downloader(delay=delay, user_agent=user_agent, proxies=proxies,
cache=cache)
    while crawl_queue:
        url = crawl_queue.pop()
        # check url passes robots.txt restrictions
        if rp.can_fetch(user_agent, url):
            depth = seen.get(url, 0)
            if depth == max_depth:
                continue
            html = D(url, num_retries=num_retries)
            if not html:
                continue
            ...

你会发现 num_retries 现在链接到了我们的调用中。这样我们就可以基于每个URL使用请求重试次数了。如果我们简单地使用相同的重试次数，而不重设 self.num_retries 的值，一旦某个页面出现 500 错误，就会用尽重试次数。

你可以在本书源码中的 chp3 文件夹中再次查看完整代码，其名为 advanced_link_crawler.py 。现在，这个网络爬虫的基本架构已经准备好了，下面就要开始构建实际的缓存了。

Previous 上一篇本节目录 Next 下一篇

03-为链接爬虫添加缓存支持

3.2 为链接爬虫添加缓存支持

3.2　为链接爬虫添加缓存支持