Version: 3.6

爬取示例商店

要爬取整个示例商店并找到所有数据，首先需要访问所有产品页面-浏览所有可用分类以及所有产品详细页。

爬取列表页面

在之前的课程中，你使用过enqueueLinks()函数，就像这样：

await enqueueLinks();

在之前那种情况下很有用，但现在你需要不同的东西。与其查找所有带有指向相同主机名的<a href="..">元素，倒不如找到只会将爬虫带到结果的下一页的特定元素。否则，爬虫会访问很多其他你不感兴趣的页面。利用DevTools和另一个enqueueLinks()参数的功能，这变得相当容易。

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        console.log(`Processing: ${request.url}`);
        // Wait for the category cards to render,
        // otherwise enqueueLinks wouldn't enqueue anything.
        await page.waitForSelector('.collection-block-item');

        // Add links to the queue, but only from
        // elements matching the provided selector.
        await enqueueLinks({
            selector: '.collection-block-item',
            label: 'CATEGORY',
        });
    },
});

await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']);

这段代码应该对你来说很熟悉。它是一个非常简单的requestHandler，在函数中我们可以将当前处理的URL打印到控制台，也可以将更多的链接插入到爬取队列。但也有一些新鲜、有趣的补充。让我来分解一下。

`enqueueLinks()`的`selector`参数

当你之前使用enqueueLinks()时，你没有提供任何selector参数，这是可以的，因为你想要使用默认值，即a - 查找所有 <a> 元素。但现在，你需要的更具体。在分类集合页面上有多个 <a> 链接，但你只对那些有用的链接感兴趣。在使用开发者工具时，你发现可以使用.collection-block-item选择器来选择所需的链接，该选择器会帮你选取了所有具有 class=collection-block-item 属性的元素。

`enqueueLinks()`的`label`参数

你将经常在Crawlee中看到label的使用，因为它是一种方便的方式来标记一个Request实例，以便后面可以快速识别。你可以通过request.label访问它，它是一个字符串。你可以按照自己的意愿命名请求。在这里，我们使用了标签 CATEGORY 来表示正在排队的链接为产品分类的页面。 enqueueLinks() 函数会将所有即将加入到 RequestQueue 队列的请求都添加上这个标签。这就是为什么会在后面变的相当有用。

爬取详情页面

同样地，你需要收集所有产品详情页面的URL，因为只有从那里才能抓取到你所需的所有数据。以下代码仅重复了你已经了解的概念，只不过用与另一种链接

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        console.log(`Processing: ${request.url}`);
        if (request.label === 'DETAIL') {
            // We're not doing anything with the details yet.
        } else if (request.label === 'CATEGORY') {
            // We are now on a category page. We can use this to paginate through and enqueue all products,
            // as well as any subsequent pages we find

            await page.waitForSelector('.product-item > a');
            await enqueueLinks({
                selector: '.product-item > a',
                label: 'DETAIL', // <= note the different label
            });

            // Now we need to find the "Next" button and enqueue the next page of results (if it exists)
            const nextButton = await page.$('a.pagination__next');
            if (nextButton) {
                await enqueueLinks({
                    selector: 'a.pagination__next',
                    label: 'CATEGORY', // <= note the same label
                });
            }
        } else {
            // This means we're on the start page, with no label.
            // On this page, we just want to enqueue all the category pages.

            await page.waitForSelector('.collection-block-item');
            await enqueueLinks({
                selector: '.collection-block-item',
                label: 'CATEGORY',
            });
        }
    },
});

await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']);

爬虫代码现在已经完成。当你运行这段代码时，你会看到爬虫访问所有的分类链接和所有的产品详情链接。

这就结束了爬虫课程，因为你已经学会了用爬虫访问所有你需要的页面。接下来我们继续抓取数据。

爬取列表页面​

enqueueLinks()的selector参数​

enqueueLinks()的label参数​

爬取详情页面​

下一节​

爬取列表页面

`enqueueLinks()`的`selector`参数

`enqueueLinks()`的`label`参数

爬取详情页面

下一节