Version: 3.6

重构

数据可能已经提取并完成了爬虫，但老实说，这只是个开始。为了简洁起见，我们完全省略了错误处理、代理、日志记录、架构、测试、文档和其他可靠软件应该具备的内容。好消息是，大部分错误处理都由Crawlee自己完成，所以在这方面不用担心，除非你需要一些定制魔法。

info

如果你已经到了这一点，并且想知道所有的反阻塞、避免机器人保护的隐秘功能在哪里，那么你是对的，我们还没有向你展示。但这就是重点！它们会与默认配置自动使用。

这并不意味着默认配置可以处理一切，但它应该能够让你走得更远。如果你想了解更多，请浏览避免被封锁、代理管理和会话管理指南。

无论如何，为了促进良好的编码实践，让我们看看如何使用路由来更好地构建你的爬虫代码。

路由

在以下代码中，我们进行了几处更改：

将代码拆分成多个文件。
用Crawlee日志打印替换console.log，使日志更美观、丰富多彩。
添加了一个Router来使我们的路由更清晰，不再需要使用if语句。

在我们的 main.mjs 文件中，我们放置了爬虫的一般结构：

src/main.mjs
import { PlaywrightCrawler, log } from 'crawlee';
import { router } from './routes.mjs';

// This is better set with CRAWLEE_LOG_LEVEL env var
// or a configuration option. This is just for show 😈
log.setLevel(log.LEVELS.DEBUG);

log.debug('Setting up crawler.');
const crawler = new PlaywrightCrawler({
    // Instead of the long requestHandler with
    // if clauses we provide a router instance.
    requestHandler: router,
});

await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']);

然后在一个单独的 routes.mjs 文件中：

src/routes.mjs
import { createPlaywrightRouter, Dataset } from 'crawlee';

// createPlaywrightRouter() is only a helper to get better
// intellisense and typings. You can use Router.create() too.
export const router = createPlaywrightRouter();

// This replaces the request.label === DETAIL branch of the if clause.
router.addHandler('DETAIL', async ({ request, page, log }) => {
    log.debug(`Extracting data: ${request.url}`);
    const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440']
    const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser'

    const title = await page.locator('.product-meta h1').textContent();
    const sku = await page
        .locator('span.product-meta__sku-number')
        .textContent();

    const priceElement = page
        .locator('span.price')
        .filter({
            hasText: '$',
        })
        .first();

    const currentPriceString = await priceElement.textContent();
    const rawPrice = currentPriceString.split('$')[1];
    const price = Number(rawPrice.replaceAll(',', ''));

    const inStockElement = page
        .locator('span.product-form__inventory')
        .filter({
            hasText: 'In stock',
        })
        .first();

    const inStock = (await inStockElement.count()) > 0;

    const results = {
        url: request.url,
        manufacturer,
        title,
        sku,
        currentPrice: price,
        availableInStock: inStock,
    };

    log.debug(`Saving data: ${request.url}`);
    await Dataset.pushData(results);
});

router.addHandler('CATEGORY', async ({ page, enqueueLinks, request, log }) => {
    log.debug(`Enqueueing pagination for: ${request.url}`);
    // We are now on a category page. We can use this to paginate through and enqueue all products,
    // as well as any subsequent pages we find

    await page.waitForSelector('.product-item > a');
    await enqueueLinks({
        selector: '.product-item > a',
        label: 'DETAIL', // <= note the different label
    });

    // Now we need to find the "Next" button and enqueue the next page of results (if it exists)
    const nextButton = await page.$('a.pagination__next');
    if (nextButton) {
        await enqueueLinks({
            selector: 'a.pagination__next',
            label: 'CATEGORY', // <= note the same label
        });
    }
});

// This is a fallback route which will handle the start URL
// as well as the LIST labeled URLs.
router.addDefaultHandler(async ({ request, page, enqueueLinks, log }) => {
    log.debug(`Enqueueing categories from page: ${request.url}`);
    // This means we're on the start page, with no label.
    // On this page, we just want to enqueue all the category pages.

    await page.waitForSelector('.collection-block-item');
    await enqueueLinks({
        selector: '.collection-block-item',
        label: 'CATEGORY',
    });
});

让我们更详细地描述一下这些变化。我们希望最终你会同意，这种结构使得爬虫更易读和可管理。

将你的代码拆分成多个文件

毫无理由不将代码拆分为多个文件，并保持逻辑分离。单个文件中的代码越少，你需要考虑的代码就越少，这是件好事。我们很可能会进一步将路由甚至拆分成独立的文件。

使用 Crawlee 的 `log` 而不是 `console.log`

我们不会在这里大谈Crawlee的log对象，因为你可以在文档中阅读到所有内容，但有一件事情我们需要强调：日志级别。

Crawlee的log具有多个日志级别，例如log.debug、 log.info或 log.warning。这不仅使你的日志更易读，还允许通过调用 log.setLevel() 函数或设置 CRAWLEE_LOG_LEVEL 环境变量来选择性地关闭某些级别。由此，你可以在遇到问题时添加大量调试日志到爬虫中，而当它们不需要时又不会污染你的日志，但在需要时能够提供帮助。

使用路由器来构建你的爬虫

起初，使用简单的if / else语句根据抓取的页面选择不同逻辑可能看起来更易读，但相信我们，当处理超过两个不同页面时，它变得远非令人印象深刻，并且当处理每个页面的逻辑跨越数十甚至上百行代码时，绝对开始崩溃。

在任何编程语言中，将逻辑分割成易于阅读和理解的小块是一个很好的实践。滚动浏览千行长的 requestHandler() 函数，其中所有东西都相互交互，并且变量可以随处使用，这不仅不美观，而且调试起来也很痛苦。这就是为什么我们更喜欢将路由分离到它们自己的文件中。

在接下来的最后一课中，我们将向你展示如何将你的Crawlee项目部署到云端。如果你使用CLI引导启动了项目，那么你已经准备好一个Dockerfile，接下来的部分将向你展示如何轻松地将其部署到Apify平台。

路由​

将你的代码拆分成多个文件​

使用 Crawlee 的 log 而不是 console.log​

使用路由器来构建你的爬虫​

下一节​

路由

将你的代码拆分成多个文件

使用 Crawlee 的 `log` 而不是 `console.log`

使用路由器来构建你的爬虫

下一节