08-清理过期数据

选择背景色：黄橙洋红淡粉水蓝草绿白色选择字体：宋体黑体微软雅黑楷体选择字体大小：小中大特恢复默认

[toc]

3.3.4　清理过期数据

当前版本的磁盘缓存使用键值对的形式在磁盘上保存缓存，未来无论何时请求都会返回结果。对于缓存网页而言，该功能可能不太理想，因为网页内容随时都有可能发生变化，存储在缓存中的数据存在过期风险。本节中，我们将为缓存数据添加过期时间，以便爬虫知道何时需要下载网页的最新版本。在缓存网页时支持存储时间戳的功能也很简单。

下面的代码为该功能的实现。

from datetime import datetime, timedelta
class DiskCache:
    def __init__(..., expires=timedelta(days=30)):
        ...
        self.expires = expires
## in __getitem___ for DiskCache class
with open(path, mode) as fp:
    if self.compress:
        data = zlib.decompress(fp.read()).decode(self.encoding)
        data = json.loads(data)
    else:
        data = json.load(fp)
    exp_date = data.get('expires')
    if exp_date and datetime.strptime(exp_date,
                                      '%Y-%m-%dT%H:%M:%S') <=
datetime.utcnow():
        print('Cache expired!', exp_date)
        raise KeyError(url + ' has expired.')
    return data
## in __setitem___ for DiskCache class
result['expires'] = (datetime.utcnow() +
self.expires).isoformat(timespec='seconds')

在构造方法中，我们使用 timedelta 对象将默认过期时间设置为30天。然后，在 __set__ 方法中，把过期时间戳作为键保存到结果字典中；而在 __get__ 方法中，对比当前UTC时间和缓存时间，检查是否过期。为了测试过期时间功能，我们可以将其缩短为5秒，如下所示。

>>> cache = DiskCache(expires=timedelta(seconds=5))
>>> url = 'http://example.python-scraping.com'
>>> result = {'html': '...'}
>>> cache[url] = result
>>> cache[url]
{'html': '...'}
>>> import time; time.sleep(5)
>>> cache[url]
Traceback (most recent call last):
...
KeyError: 'http://example.python-scraping.com has expired'

和预期一样，缓存结果最初是可用的，经过5秒的睡眠之后，再次调用同一URL，则会抛出 KeyError 异常，也就是说缓存下载失效了。

Previous 上一篇本节目录 Next 下一篇

08-清理过期数据

3.3.4 清理过期数据

3.3.4　清理过期数据