Version: 3.6

保存数据

数据提取工作如果没有保存数据以供将来使用和处理，就不算完整。我们已经到达了本教程的最后部分，也是最困难的部分，请务必非常仔细地注意！

首先，在文件顶部添加一个新的导入：

import { PlaywrightCrawler, Dataset } from 'crawlee';

然后，将console.log(results)调用替换为：

await Dataset.pushData(results);

就是这样。与之前不同，我们现在很认真。就是这样，我们完成了。最终的代码看起来像这样：

Run on

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        console.log(`Processing: ${request.url}`);
        if (request.label === 'DETAIL') {
            const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440']
            const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser'

            const title = await page.locator('.product-meta h1').textContent();
            const sku = await page.locator('span.product-meta__sku-number').textContent();

            const priceElement = page
                .locator('span.price')
                .filter({
                    hasText: '$',
                })
                .first();

            const currentPriceString = await priceElement.textContent();
            const rawPrice = currentPriceString.split('$')[1];
            const price = Number(rawPrice.replaceAll(',', ''));

            const inStockElement = page
                .locator('span.product-form__inventory')
                .filter({
                    hasText: 'In stock',
                })
                .first();

            const inStock = (await inStockElement.count()) > 0;

            const results = {
                url: request.url,
                manufacturer,
                title,
                sku,
                currentPrice: price,
                availableInStock: inStock,
            };

            await Dataset.pushData(results);
        } else if (request.label === 'CATEGORY') {
            // We are now on a category page. We can use this to paginate through and enqueue all products,
            // as well as any subsequent pages we find

            await page.waitForSelector('.product-item > a');
            await enqueueLinks({
                selector: '.product-item > a',
                label: 'DETAIL', // <= note the different label
            });

            // Now we need to find the "Next" button and enqueue the next page of results (if it exists)
            const nextButton = await page.$('a.pagination__next');
            if (nextButton) {
                await enqueueLinks({
                    selector: 'a.pagination__next',
                    label: 'CATEGORY', // <= note the same label
                });
            }
        } else {
            // This means we're on the start page, with no label.
            // On this page, we just want to enqueue all the category pages.

            await page.waitForSelector('.collection-block-item');
            await enqueueLinks({
                selector: '.collection-block-item',
                label: 'CATEGORY',
            });
        }
    },

    // Let's limit our crawls to make our tests shorter and safer.
    maxRequestsPerCrawl: 50,
});

await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections']);

`Dataset.pushData()` 是什么？

Dataset.pushData()是一个函数，用于将数据保存到默认的Dataset。 Dataset 是一个设计用来以类似表格格式存储数据的存储器。每次调用 Dataset.pushData() 都会在表格中创建一行新记录，其中属性名称充当列标题。在默认配置下，这些行被表示为 JSON 文件并保存在你的磁盘上，但也可以将其他存储系统插入到 Crawlee 中。

info

每次启动Crawlee时，都会自动创建一个默认的 Dataset ，因此无需初始化或先创建实例。你可以创建任意数量的数据集，甚至可以为它们命名。有关更多详细信息，请参阅结果存储指南和Dataset.open()函数。

查找已保存的数据

除非你更改了Crawlee在本地使用的配置，这表明你知道自己在做什么，并且根本不需要这个教程，否则你将在由Crawlee创建运行脚本的工作目录中的storage目录中找到你的数据。

{项目文件夹}/storage/datasets/default/

以上文件夹将保存所有已保存的数据，以编号文件的形式存储，就像它们被推送到数据集中一样。每个文件代表一个Dataset.pushData()的调用或一个表格行。

tip

如果你想将数据存储在一个大文件中，而不是许多小文件中，请参阅结果存储指南了解键值存储。

在下一课中，我们将向你展示一些改进措施，可以添加到你的爬虫代码中，从而使其更易读和更易维护。

Dataset.pushData() 是什么？​

查找已保存的数据​

下一节​

`Dataset.pushData()` 是什么？

查找已保存的数据

下一节