Version: 3.6

添加更多链接

在上一课中，你构建了一个非常简单的网络爬虫，它可以下载单个页面的HTML，读取其标题并将其打印到控制台。这是原始源代码：

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ $, request }) {
        const title = $('title').text();
        console.log(`The title of "${request.url}" is: ${title}.`);
    },
});

await crawler.run(['https://crawlee.dev']);

在这节课中，将使用上一节的示例并对其进行改进。你将向队列中添加更多的链接并驱使爬虫继续运行，然后再找到新链接，再将它们加入RequestQueue，然后进行重复。

爬虫是如何工作的

过程十分简单:

在页面上找到新链接
筛选出只指向相同域名的链接，本例中为 crawlee.dev。
将它们加入 (add) 到 RequestQueue.
访问新加入的链接。
重复这个过程。

在接下来的段落中，你将了解到enqueueLinks函数，该函数将重复爬取网页的过程简化为一次函数调用。为了比较学习，在第二个代码Tab中我们将展示一个不使用enqueueLinks编写的等效解决方案。

tip

enqueueLinks 函数具有上下文感知能力。这意味着它将从上下文中读取关于当前爬取页面的信息，你无需明确提供任何参数。它将使用 Cheerio 函数 $ 查找链接，并自动将链接添加到正在运行的爬虫的 RequestQueue 中。

用`maxRequestsPerCrawl`限制你的爬取数量

当你在测试或者你爬取的网站拥有数百万个潜在的链接时，设置最大抓取页面限制非常有用。该选项称为maxRequestsPerCrawl，适用于所有的网络爬虫，并且可以像这样进行设置：

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 20,
    // ...
});

这意味着在第20个请求完成后，将不会启动新的请求。由于是并发操作，实际处理的请求数可能会略高一些，因为在大多数情况下，不可能将正在运行的请求强制中止。

查找新链接

在爬取网络时，有许多方法可以找到要跟踪的链接。对于我们的目的，我们将寻找包含href属性的<a>元素，因为这在大多数情况下是你所需要的。例如：

<a href="https://crawlee.dev/docs/introduction"
    >This is a link to Crawlee introduction</a
>

由于这是最常见的情况，它也是 enqueueLinks 的默认设置。

with enqueueLinks
without enqueueLinks

src/main.mjs
import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    // Let's limit our crawls to make our
    // tests shorter and safer.
    maxRequestsPerCrawl: 20,
    // enqueueLinks is an argument of the requestHandler
    async requestHandler({ $, request, enqueueLinks }) {
        const title = $('title').text();
        console.log(`The title of "${request.url}" is: ${title}.`);
        // The enqueueLinks function is context aware,
        // so it does not require any parameters.
        await enqueueLinks();
    },
});

await crawler.run(['https://crawlee.dev']);

src/main.mjs
import { CheerioCrawler } from 'crawlee';
import { URL } from 'node:url';

const crawler = new CheerioCrawler({
    // Let's limit our crawls to make our
    // tests shorter and safer.
    maxRequestsPerCrawl: 20,
    async requestHandler({ request, $ }) {
        const title = $('title').text();
        console.log(`The title of "${request.url}" is: ${title}.`);

        // Without enqueueLinks, we first have to extract all
        // the URLs from the page with Cheerio.
        const links = $('a[href]')
            .map((_, el) => $(el).attr('href'))
            .get();

        // Then we need to resolve relative URLs,
        // otherwise they would be unusable for crawling.
        const absoluteUrls = links.map(
            (link) => new URL(link, request.loadedUrl).href,
        );

        // Finally, we have to add the URLs to the queue
        await crawler.addRequests(absoluteUrls);
    },
});

await crawler.run(['https://crawlee.dev']);

如果你需要覆盖enqueueLinks中元素的默认选择器，可以使用selector参数。

await enqueueLinks({
    selector: 'div.has-link',
});

筛选出相同域名的链接

网站通常包含许多链接，这些链接会将用户带离原始页面。这是正常的，但当爬取一个网站时，我们通常只想爬取该网站，并不希望我们的网络爬虫漫游到谷歌、Facebook和Twitter等其他地方。因此，我们需要过滤掉跨域链接，并仅保留指向相同域名的链接。

with enqueueLinks
without enqueueLinks

src/main.mjs
import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 20,
    async requestHandler({ $, request, enqueueLinks }) {
        const title = $('title').text();
        console.log(`The title of "${request.url}" is: ${title}.`);
        // The default behavior of enqueueLinks is to stay on the same hostname,
        // so it does not require any parameters.
        // This will ensure the subdomain stays the same.
        await enqueueLinks();
    },
});

await crawler.run(['https://crawlee.dev']);

src/main.mjs
import { CheerioCrawler } from 'crawlee';
import { URL } from 'node:url';

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 20,
    async requestHandler({ request, $ }) {
        const title = $('title').text();
        console.log(`The title of "${request.url}" is: ${title}.`);

        const links = $('a[href]')
            .map((_, el) => $(el).attr('href'))
            .get();

        // Besides resolving the URLs, we now also need to
        // grab their hostname for filtering.
        const { hostname } = new URL(request.loadedUrl);
        const absoluteUrls = links.map(
            (link) => new URL(link, request.loadedUrl),
        );

        // We use the hostname to filter links that point
        // to a different domain, even subdomain.
        const sameHostnameLinks = absoluteUrls
            .filter((url) => url.hostname === hostname)
            .map((url) => ({ url: url.href }));

        // Finally, we have to add the URLs to the queue
        await crawler.addRequests(sameHostnameLinks);
    },
});

await crawler.run(['https://crawlee.dev']);

enqueueLinks 的默认行为是保持在相同主机上，所以不包括子域名。要在爬取中包含子域名，使用 strategy 参数。

await enqueueLinks({
    strategy: 'same-domain',
});

当你运行代码时，你将看到爬虫打印第一页的title，然后显示enqueueing消息中的链接，接着是第一个队列页面的title，以此类推。

跳过重复的网址

跳过重复的URL非常关键，因为多次访问同一页面会导致重复结果。这是由RequestQueue自动处理的，它使用它们的uniqueKey对请求进行去重。这个uniqueKey是通过将URL转换为小写、按字典顺序排、移除片段以及其他一些调整来自动生成的，以确保队列中只包含唯一的URL。

高级过滤参数

虽然enqueueLinks 的默认设置一般正是你所需要的，但它也可以让你对进入队列的URL进行精细控制。我们已经提到的一种方法是使用EnqueueStrategy。如果你想要跟踪每个链接，无论其域名如何，可以使用All策略；或者你可以使用SameDomain策略来入队指向相同域名的链接。

await enqueueLinks({
    strategy: 'all', // wander the internet
});

使用程式过滤URL

为了更加精确地控制，你可以使用 globs, regexps 和 pseudoUrls 来过滤URL。这些参数都是一个 Array，但内容可以有多种形式。关于它们以及其他选项的更多信息，请参阅参考文档。

caution

如果你提供了其中一种选项，除非在选项中明确设置，默认的 same-hostname 策略将不会被应用。

await enqueueLinks({
    globs: ['http?(s)://apify.com/*/*'],
});

转换请求

为了拥有绝对控制权，我们有transformRequestFunction。在新的Request被构建并加入到RequestQueue之前，可以使用此函数来跳过或修改其内容，如 userData, payload, 或者最重要的是 uniqueKey. 当你需要将多个请求加入队列时，并且这些请求共享相同的URL但在方法或负载上不同时，这将非常有用。另一个用例是动态更新或创建 userData.

await enqueueLinks({
    globs: ['http?(s)://apify.com/*/*'],
    transformRequestFunction(req) {
        // 忽略所有以“.pdf”结尾的链接。
        if (req.url.endsWith('.pdf')) return false;
        return req;
    },
});

这就是enqueueLinks() 全部了！这只是 Crawlee 强大辅助函数的一个例子。它们都旨在让你的更轻松，这样你就可以专注于获取数据，而将单调的爬取管理交给工具来处理。

在下一课中，你将开始进行真实网站的抓取项目，并在此过程中学习更多的Crawlee技巧。

爬虫是如何工作的​

用maxRequestsPerCrawl限制你的爬取数量​

查找新链接​

筛选出相同域名的链接​

跳过重复的网址​

高级过滤参数​

使用程式过滤URL​

转换请求​

下一节​