ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified developer tools Safety 3/5

crawlee-web-scraper

Resilient web scraper with bot-detection evasion using the Crawlee library. Use when web_fetch is blocked by rate limits or bot detection. Supports single URLs, bulk file input, and automatic fallback from requests to Crawlee on 403/429 responses.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/bryantegomoh/crawlee-web-scraper
Or

What This Skill Does

The crawlee-web-scraper is a powerful, resilient utility designed specifically for OpenClaw agents to bypass modern web anti-bot defenses. While standard HTTP fetchers are easily blocked by Cloudflare, reCAPTCHA, or simple rate limits, this skill leverages the Crawlee library to mimic legitimate human browser behavior. It provides two primary interfaces: a direct command-line scraper for high-volume tasks and a drop-in library helper that automatically upgrades standard requests to Crawlee sessions when 403, 429, or 503 errors are detected. It supports full HTML capture, automated text extraction, and batch processing via text files, returning clean, structured JSON output ready for further agent processing.

Installation

To integrate this skill into your environment, run the following command in your terminal:

clawhub install openclaw/skills/skills/bryantegomoh/crawlee-web-scraper

Ensure you have the required Python dependencies installed globally or in your agent's virtual environment:

pip install crawlee requests

Use Cases

  • Bypassing Bot Protection: Use this when a target website uses Cloudflare, Datadome, or similar providers that block standard requests.
  • Bulk Data Collection: Efficiently scrape lists of URLs from a file without worrying about aggressive rate-limiting causing your agent to stall.
  • Resilient Pipelines: Integrate into existing workflows where you want to start with a lightweight request but automatically fallback to a robust scraper if the target server rejects the initial connection.
  • Clean Data Extraction: Quickly strip boilerplate HTML to get straight to the readable text content for LLM ingestion.

Example Prompts

  1. "Use crawlee-web-scraper to fetch the latest tech news from these 50 URLs listed in tech_sites.txt and save the clean text to news_data.json."
  2. "I'm getting a 403 Forbidden error when trying to access the documentation page at https://target-site.com/api. Can you switch to the crawlee-web-scraper to bypass this check?"
  3. "Scrape the content of https://example.com/pricing and extract only the main body text so I can summarize their subscription tiers."

Tips & Limitations

  • Performance: Crawlee is resource-intensive compared to basic requests. Only use it when standard fetches fail or are likely to fail.
  • Rate Limiting: Even with bot evasion, respect robots.txt and avoid hitting servers with excessive concurrency that could be perceived as a DDoS attack.
  • Execution Time: Because this often spins up a browser instance or simulates complex handshakes, individual requests may take significantly longer than standard HTTP GET calls.
  • Memory: When scraping large bulk lists, ensure your system has enough memory, as concurrent browser instances can consume significant RAM.

Metadata

Stars4190
Views0
Updated2026-04-18
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-bryantegomoh-crawlee-web-scraper": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#web-scraping#automation#bot-evasion#data-extraction#crawlee
Safety Score: 3/5

Flags: network-access, file-write, file-read