Skip to main content
Version: Next

Quick Start

通过这个简短的教程,你可以在一两分钟内开始使用Crawlee进行爬取。要深入了解Crawlee的工作原理,请阅读介绍,这是一个全面的逐步指南,可帮助你创建第一个爬虫。

选择你的网络爬虫

Crawlee有三个主要的爬虫类:CheerioCrawlerPuppeteerCrawlerPlaywrightCrawler。所有这些类都共享相同的接口,以便在它们之间进行灵活切换。

CheerioCrawler

This is a plain HTTP crawler. It parses HTML using the Cheerio library and crawls the web using the specialized got-scraping HTTP client which masks as a browser. It's very fast and efficient, but can't handle JavaScript rendering.

PuppeteerCrawler

这是一个普通的HTTP爬虫。它使用Cheerio库解析HTML,并使用专门的got-scraping HTTP客户端来爬取网络,该客户端伪装成浏览器。它非常快速和高效,但无法处理JavaScript渲染。

PlaywrightCrawler

Playwright is a more powerful and full-featured successor to Puppeteer. It can control Chromium, Chrome, Firefox, Webkit and many other browsers. If you're not familiar with Puppeteer already, and you need a headless browser, go with Playwright.

在开始之前,请注意

Crawlee 需要 Node.js 16 或更高版本

Installation with Crawlee CLI

The fastest way to try Crawlee out is to use the Crawlee CLI and choose the Getting started example. The CLI will install all the necessary dependencies and add boilerplate code for you to play with.

npx crawlee create my-crawler

After the installation is complete you can start the crawler like this:

cd my-crawler && npm start

Manual installation

You can add Crawlee to any Node.js project by running:

npm install crawlee

Crawling

Run the following example to perform a recursive crawl of the Crawlee website using the selected crawler.

Don't forget about module imports

To run the example, add a "type": "module" clause into your package.json or copy it into a file with an .mjs suffix. This enables import statements in Node.js. See

Node.js docs

for more information.

Run on
import { CheerioCrawler, Dataset } from 'crawlee';

// CheerioCrawler crawls the web using HTTP requests
// and parses HTML using the Cheerio library.
const crawler = new CheerioCrawler({
// Use the requestHandler to process each of the crawled pages.
async requestHandler({ request, $, enqueueLinks, log }) {
const title = $('title').text();
log.info(`Title of ${request.loadedUrl} is '${title}'`);

// Save results as JSON to ./storage/datasets/default
await Dataset.pushData({ title, url: request.loadedUrl });

// Extract links from the current page
// and add them to the crawling queue.
await enqueueLinks();
},

// Let's limit our crawls to make our tests shorter and safer.
maxRequestsPerCrawl: 50,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev']);

When you run the example, you will see Crawlee automating the data extraction process in your terminal.

INFO  CheerioCrawler: Starting the crawl
INFO CheerioCrawler: Title of https://crawlee.dev/ is 'Crawlee · Build reliable crawlers. Fast. | Crawlee'
INFO CheerioCrawler: Title of https://crawlee.dev/docs/examples is 'Examples | Crawlee'
INFO CheerioCrawler: Title of https://crawlee.dev/docs/quick-start is 'Quick Start | Crawlee'
INFO CheerioCrawler: Title of https://crawlee.dev/docs/guides is 'Guides | Crawlee'

Running headful browsers

Browsers controlled by Puppeteer and Playwright run headless (without a visible window). You can switch to headful by adding the headless: false option to the crawlers' constructor. This is useful in the development phase when you want to see what's going on in the browser.

Run on
import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks, log }) {
const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);
await Dataset.pushData({ title, url: request.loadedUrl });
await enqueueLinks();
},
// When you turn off headless mode, the crawler
// will run with a visible browser window.
headless: false,

// Let's limit our crawls to make our tests shorter and safer.
maxRequestsPerCrawl: 50,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev']);

When you run the example code, you'll see an automated browser blaze through the Crawlee website.

note

For the sake of this show off, we've slowed down the crawler, but rest assured, it's blazing fast in real world usage.

An image showing off Crawlee scraping the Crawlee website using Puppeteer/Playwright and ChromiumAn image showing off Crawlee scraping the Crawlee website using Puppeteer/Playwright and Chromium

Results

Crawlee stores data to the ./storage directory in your current working directory. The results of your crawl will be available under ./storage/datasets/default/*.json as JSON files.

./storage/datasets/default/000000001.json
{
"url": "https://crawlee.dev/",
"title": "Crawlee · The scalable web crawling, scraping and automation library for JavaScript/Node.js | Crawlee"
}
tip

You can override the storage directory by setting the CRAWLEE_STORAGE_DIR environment variable.

Examples and further reading

You can find more examples showcasing various features of Crawlee in the Examples section of the documentation. To better understand Crawlee and its components you should read the Introduction step-by-step guide.

Related links