07-实现多线程爬虫

选择背景色：黄橙洋红淡粉水蓝草绿白色选择字体：宋体黑体微软雅黑楷体选择字体大小：小中大特恢复默认

[toc]

4.4.1　实现多线程爬虫

幸运的是，在Python中实现多线程编程相对来说比较简单。我们可以保留与第1章开发的链接爬虫类似的队列结构，只是改为在多个线程中启动爬虫循环，从而并行下载这些链接。下面的代码是修改后链接爬虫的起始部分，这里把 crawl 循环移到了函数内部。

import time
import threading
...
SLEEP_TIME = 1
def threaded_crawler(..., max_threads=10, scraper_callback=None):
    ...
    def process_queue():
        while crawl_queue:
            ...

下面是 threaded_crawler 函数的剩余部分，这里在多个线程中启动了 process_queue ，并等待其完成。

threads = []
    while threads or crawl_queue:
        # the crawl is still active
        for thread in threads:
            if not thread.is_alive():
                # remove the stopped threads
                threads.remove(thread)
        while len(threads) < max_threads and crawl_queue:
            # can start some more threads
            thread = threading.Thread(target=process_queue)
            # set daemon so main thread can exit when receives ctrl-c
            thread.setDaemon(True)
            thread.start()
            threads.append(thread)
        # all threads have been processed # sleep temporarily so CPU can
focus execution elsewhere
        for thread in threads:
            thread.join()
        time.sleep(SLEEP_TIME))

当有URL可爬取时，上面代码中的循环会不断创建线程，直到达到线程池的最大值。在爬取过程中，如果当前队列中没有更多可以爬取的URL时，线程会提前停止。假设我们有2个线程以及2个待下载的URL。当第一个线程完成下载时，待爬取队列为空，因此该线程退出。第二个线程稍后也完成了下载，但又发现了另一个待下载的URL。此时 thread 循环注意到还有URL需要下载，并且线程数未达到最大值，因此它又会创建一个新的下载线程。

后续我们可能还需要为该多线程爬虫添加解析。为此，我们可以使用返回的HTML为函数回调添加一段代码。我们可能希望从该逻辑或抽取中获取更多链接，因此我们还需要在后边的 for 循环中扩展我们解析的链接。

html = D(url, num_retries=num_retries)
if not html:
    continue
if scraper_callback:
    links = scraper_callback(url, html) or []
else:
    links = []
# filter for links matching our regular expression
for link in get_links(html) + links:
    ...

完整代码可以在本书源码文件中chp4文件夹中的 threaded_crawler.py 中查看。要想公平测试，还需要清洗你的 RedisCache ，或者使用一个不同的默认数据库。如果你已经安装了 redis-cli ，则使用命令行可以很容易地实现该需求。

$ redis-cli
127.0.0.1:6379> FLUSHALL
OK
127.0.0.1:6379>

如果想要退出，可以使用通用的程序退出方式（通常为Ctrl + C或cmd + C）。现在，让我们测试一下该多线程版本链接爬虫的性能，命令如下所示。

$ python code/chp4/threaded_crawler.py
...
Total time: 361.50403571128845s

当你查看该爬虫的 __main__ 区域时，会注意到你可以很方便地向脚本传递参数，包括 max_threads 和 url_pattern 。在前面的例子中，我们使用了默认的 max_threads=5 以及 url_pattern='$^' 。

由于我们使用了5个线程，因此下载速度几乎是串行版本的5倍。同样，你的结果很可能依赖于网络运营商，或是你的脚本运行的服务器。在4.5节会对多线程性能进行更进一步的分析。

Previous 上一篇本节目录 Next 下一篇

07-实现多线程爬虫

4.4.1 实现多线程爬虫

4.4.1　实现多线程爬虫