web-scraper
Web scraping and content comprehension agent — multi-strategy extraction with cascade fallback, news detection, boilerplate removal, structured metadata, and LLM entity extraction
Why use this skill?
Master web scraping with OpenClaw. Utilize a 5-stage pipeline for clean, automated content extraction, metadata processing, and LLM-ready entity analysis.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/guifav/web-scraperWhat This Skill Does
The web-scraper skill for OpenClaw is a sophisticated data engineering tool designed to navigate the complexities of modern web structures. It operates on a robust 5-stage pipeline, prioritizing efficiency by utilizing a cascade fallback method: starting with lightweight HTTP requests and only escalating to headless browser automation or LLM-driven comprehension when absolutely necessary. This prevents wasted resources and minimizes the risk of IP-based blocking.
The tool is built to handle content cleaning, boilerplate removal, and structured metadata extraction. It distinguishes between raw HTML and semantically meaningful text, ensuring that any subsequent AI entity extraction is performed on high-quality, sanitized data rather than noisy DOM elements.
Installation
To integrate this capability into your OpenClaw environment, execute the following command in your terminal:
clawhub install openclaw/skills/skills/guifav/web-scraper
Ensure you have Python 3.10+ and the necessary system dependencies for Playwright, as the tool uses headless browser rendering for JavaScript-heavy targets.
Use Cases
- Competitive Intelligence: Monitoring product pricing, feature updates, and press releases across multiple competitor domains.
- Market Research: Aggregating news articles from various sources and using LLMs to extract key entities, sentiment, and trends into a structured JSON database.
- Content Migration: Extracting clean text from legacy documentation sites or blogs for migration into knowledge bases or LLM fine-tuning datasets.
- Lead Generation: Identifying contact information and business metadata from directory listings while maintaining compliance with local access protocols.
Example Prompts
- "Scrape the latest blog posts from https://example.com/blog and save the extracted titles, authors, and publish dates into a structured JSON file."
- "Go to https://techcrunch.com, identify the main article content, remove all navigation and footer boilerplate, and extract all mentioned company names and funding amounts."
- "Perform a bulk crawl of the documentation pages listed in this text file, normalize the content to Markdown, and ensure all internal links are preserved in the metadata."
Tips & Limitations
- Planning First: Always allow the agent to execute its Planning Protocol. Skipping this leads to inefficient crawling and increased risk of being rate-limited.
- Resource Management: For large-scale crawls, monitor your local disk space and memory usage, as headless browser sessions consume significantly more RAM than simple HTTP clients.
- Credential Security: The skill is designed to reference
OPENROUTER_API_KEYonly within template code. Never hardcode credentials into your scraping scripts. The skill will never read your.envfiles directly. - Ethical Crawling: Always check
robots.txtbefore initiating a scrape. If a site explicitly forbids bot access, the agent's safety protocols will prevent the action to ensure ethical compliance.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-guifav-web-scraper": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: network-access, file-write, external-api, code-execution
Related Skills
supabase-ops
Manages Supabase migrations, types generation, RLS policies, and edge functions
interop-forge
Integration architect for multi-app monorepos — shared contracts, API-first design with OpenAPI, cross-app auth, auto-generated SDKs, and full MCP server scaffolding per app
cloudflare-guard
Configures and manages Cloudflare DNS, caching, security rules, rate limiting, and Workers
stack-scaffold
Scaffolds a full-stack project with Next.js App Router, Supabase, Firebase Auth, Vercel, and Cloudflare
gcp-fullstack
Full-stack super agent for projects on Google Cloud Platform with GitHub and Cloudflare — covers scaffolding, compute, database, auth, deploy, CDN, and security