crawlee-web-scraper
Resilient web scraper with bot-detection evasion using the Crawlee library. Use when web_fetch is blocked by rate limits or bot detection. Supports single URLs, bulk file input, and automatic fallback from requests to Crawlee on 403/429 responses.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/bryantegomoh/crawlee-web-scraperWhat This Skill Does
The crawlee-web-scraper is a powerful, resilient utility designed specifically for OpenClaw agents to bypass modern web anti-bot defenses. While standard HTTP fetchers are easily blocked by Cloudflare, reCAPTCHA, or simple rate limits, this skill leverages the Crawlee library to mimic legitimate human browser behavior. It provides two primary interfaces: a direct command-line scraper for high-volume tasks and a drop-in library helper that automatically upgrades standard requests to Crawlee sessions when 403, 429, or 503 errors are detected. It supports full HTML capture, automated text extraction, and batch processing via text files, returning clean, structured JSON output ready for further agent processing.
Installation
To integrate this skill into your environment, run the following command in your terminal:
clawhub install openclaw/skills/skills/bryantegomoh/crawlee-web-scraper
Ensure you have the required Python dependencies installed globally or in your agent's virtual environment:
pip install crawlee requests
Use Cases
- Bypassing Bot Protection: Use this when a target website uses Cloudflare, Datadome, or similar providers that block standard requests.
- Bulk Data Collection: Efficiently scrape lists of URLs from a file without worrying about aggressive rate-limiting causing your agent to stall.
- Resilient Pipelines: Integrate into existing workflows where you want to start with a lightweight request but automatically fallback to a robust scraper if the target server rejects the initial connection.
- Clean Data Extraction: Quickly strip boilerplate HTML to get straight to the readable text content for LLM ingestion.
Example Prompts
- "Use crawlee-web-scraper to fetch the latest tech news from these 50 URLs listed in
tech_sites.txtand save the clean text tonews_data.json." - "I'm getting a 403 Forbidden error when trying to access the documentation page at
https://target-site.com/api. Can you switch to the crawlee-web-scraper to bypass this check?" - "Scrape the content of
https://example.com/pricingand extract only the main body text so I can summarize their subscription tiers."
Tips & Limitations
- Performance: Crawlee is resource-intensive compared to basic requests. Only use it when standard fetches fail or are likely to fail.
- Rate Limiting: Even with bot evasion, respect
robots.txtand avoid hitting servers with excessive concurrency that could be perceived as a DDoS attack. - Execution Time: Because this often spins up a browser instance or simulates complex handshakes, individual requests may take significantly longer than standard HTTP GET calls.
- Memory: When scraping large bulk lists, ensure your system has enough memory, as concurrent browser instances can consume significant RAM.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-bryantegomoh-crawlee-web-scraper": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: network-access, file-write, file-read
Related Skills
gateway-watchdog
Production-grade bash watchdog for the OpenClaw gateway. Runs via launchd every 5 minutes. Handles boot grace periods, progressive retry with backoff, port-level fallback checks, stale PID detection, and restart cooldowns — preventing restart loops while keeping the gateway reliably alive.
dronemobile
Control vehicles via DroneMobile (Firstech/Compustar remote start systems). Use when the user asks to start their car, stop the engine, lock/unlock doors, open the trunk, check battery voltage, or get vehicle status. Triggers on phrases like "start my car", "remote start", "lock my car", "unlock the car", "check battery", "open trunk", "stop the engine", "vehicle status". Requires DRONEMOBILE_EMAIL and DRONEMOBILE_PASSWORD environment variables. Optionally DRONEMOBILE_DEVICE_KEY for multi-vehicle accounts.
content-security-filter
Prompt injection and malware detection filter for external content. Scans text, files, or URLs for 20+ attack patterns including instruction overrides, credential exfiltration, persona hijacking, encoded payloads, fake system messages, and invisible character injection. Returns JSON with risk level and sanitized text.