Official Verified developer tools Safety 3/5

web-scraper

Web scraping and content comprehension agent — multi-strategy extraction with cascade fallback, news detection, boilerplate removal, structured metadata, and LLM entity extraction

Why use this skill?

Master web scraping with OpenClaw. Utilize a 5-stage pipeline for clean, automated content extraction, metadata processing, and LLM-ready entity analysis.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/guifav/web-scraper

Download Source Code (.zip)

What This Skill Does

The web-scraper skill for OpenClaw is a sophisticated data engineering tool designed to navigate the complexities of modern web structures. It operates on a robust 5-stage pipeline, prioritizing efficiency by utilizing a cascade fallback method: starting with lightweight HTTP requests and only escalating to headless browser automation or LLM-driven comprehension when absolutely necessary. This prevents wasted resources and minimizes the risk of IP-based blocking.

The tool is built to handle content cleaning, boilerplate removal, and structured metadata extraction. It distinguishes between raw HTML and semantically meaningful text, ensuring that any subsequent AI entity extraction is performed on high-quality, sanitized data rather than noisy DOM elements.

Installation

To integrate this capability into your OpenClaw environment, execute the following command in your terminal: clawhub install openclaw/skills/skills/guifav/web-scraper Ensure you have Python 3.10+ and the necessary system dependencies for Playwright, as the tool uses headless browser rendering for JavaScript-heavy targets.

Use Cases

Competitive Intelligence: Monitoring product pricing, feature updates, and press releases across multiple competitor domains.
Market Research: Aggregating news articles from various sources and using LLMs to extract key entities, sentiment, and trends into a structured JSON database.
Content Migration: Extracting clean text from legacy documentation sites or blogs for migration into knowledge bases or LLM fine-tuning datasets.
Lead Generation: Identifying contact information and business metadata from directory listings while maintaining compliance with local access protocols.

Example Prompts

"Scrape the latest blog posts from https://example.com/blog and save the extracted titles, authors, and publish dates into a structured JSON file."
"Go to https://techcrunch.com, identify the main article content, remove all navigation and footer boilerplate, and extract all mentioned company names and funding amounts."
"Perform a bulk crawl of the documentation pages listed in this text file, normalize the content to Markdown, and ensure all internal links are preserved in the metadata."

Tips & Limitations

Planning First: Always allow the agent to execute its Planning Protocol. Skipping this leads to inefficient crawling and increased risk of being rate-limited.
Resource Management: For large-scale crawls, monitor your local disk space and memory usage, as headless browser sessions consume significantly more RAM than simple HTTP clients.
Credential Security: The skill is designed to reference OPENROUTER_API_KEY only within template code. Never hardcode credentials into your scraping scripts. The skill will never read your .env files directly.
Ethical Crawling: Always check robots.txt before initiating a scrape. If a site explicitly forbids bot access, the agent's safety protocols will prevent the action to ensure ethical compliance.

Read Full Documentation on GitHub

Metadata

Author@guifav

Stars2387

Updated2026-03-09

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-guifav-web-scraper": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#web-scraping#automation#data-extraction#python#ai-agent

Safety Score: 3/5

Flags: network-access, file-write, external-api, code-execution

Related Skills

supabase-ops

Manages Supabase migrations, types generation, RLS policies, and edge functions

guifav 2387

interop-forge

Integration architect for multi-app monorepos — shared contracts, API-first design with OpenAPI, cross-app auth, auto-generated SDKs, and full MCP server scaffolding per app

guifav 2387

cloudflare-guard

Configures and manages Cloudflare DNS, caching, security rules, rate limiting, and Workers

guifav 2387

stack-scaffold

Scaffolds a full-stack project with Next.js App Router, Supabase, Firebase Auth, Vercel, and Cloudflare

guifav 2387

gcp-fullstack

Full-stack super agent for projects on Google Cloud Platform with GitHub and Cloudflare — covers scaffolding, compute, database, auth, deploy, CDN, and security

guifav 2387