data-scraper
Web page data collection and structured text extraction
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/mupengi-bot/data-scraperdata-scraper
Web Data Scraper — Extract structured data from web pages using curl + parsing. Lightweight, no browser required. Supports HTML-to-text, table extraction, price monitoring, and batch scraping.
When to Use
- Extract text content from web pages (articles, blogs, docs)
- Scrape product prices, reviews, or listings
- Monitor pages for changes (price drops, new content)
- Batch-collect data from multiple URLs
- Convert HTML tables to structured formats (JSON/CSV)
Quick Start
# Extract readable text from URL
data-scraper fetch "https://example.com/article"
# Extract specific elements
data-scraper extract "https://example.com" --selector "h2, .price"
# Monitor for changes
data-scraper watch "https://example.com/product" --interval 3600
Extraction Modes
Text Mode (default)
Fetches page and extracts readable content, stripping HTML tags, scripts, and styles. Similar to reader mode.
data-scraper fetch URL
# Output: clean markdown text
Selector Mode
Target specific CSS selectors for precise extraction.
data-scraper extract URL --selector ".product-title, .price, .rating"
# Output: matched elements as structured data
Table Mode
Extract HTML tables into structured formats.
data-scraper table URL --index 0
# Output: JSON array of row objects (header → value mapping)
Link Mode
Extract all links from a page with optional filtering.
data-scraper links URL --filter "*.pdf"
# Output: filtered list of absolute URLs
Batch Scraping
# Scrape multiple URLs
data-scraper batch urls.txt --output results/
# With rate limiting
data-scraper batch urls.txt --delay 2000 --output results/
urls.txt format:
https://site1.com/page
https://site2.com/page
https://site3.com/page
Change Monitoring
# Watch for changes, alert on diff
data-scraper watch URL --selector ".price" --interval 3600
# Compare with previous snapshot
data-scraper diff URL
Stores snapshots in data-scraper/snapshots/ with timestamps. Alerts via notification-hub when changes detected.
Output Formats
| Format | Flag | Use Case |
|---|---|---|
| Text | --format text | Reading, summarization |
| JSON | --format json | Data processing |
| CSV | --format csv | Spreadsheets |
| Markdown | --format md | Documentation |
Headers & Auth
# Custom headers
data-scraper fetch URL --header "Authorization: Bearer TOKEN"
# Cookie-based auth
data-scraper fetch URL --cookie "session=abc123"
# User-Agent override
data-scraper fetch URL --ua "Mozilla/5.0..."
Rate Limiting & Ethics
- Default: 1 request per second per domain
- Respects
robots.txtwhen--politeflag is set - Configurable delay between requests
- Stops on 429 (Too Many Requests) and backs off
Error Handling
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-mupengi-bot-data-scraper": {
"enabled": true,
"auto_update": true
}
}
}Related Skills
prompt-engineer
Expert prompt engineer specializing in advanced prompting techniques, LLM optimization, and AI system design. Masters chain-of-thought, constitutional AI, and production prompt strategies. Use when building AI features, improving agent performance, or crafting system prompts.
appointment-scheduler
Automated appointment management for beauty salons, clinics, studios, and photo booths. Handles booking requests, calendar sync, conflict detection, reminders, no-show tracking, and waitlist management.
Mupeng Social Postcjo
Skill by mupengi-bot
brand-voice
Manage brand tone/style for all writing skills
auto-reply
Instagram DM auto-reply system. DM monitoring, reading, replying, security check (injection rejection). Use when checking Instagram DMs, reading unread messages, replying to DMs, setting up DM monitoring cron jobs, or handling DM auto-reply workflows. Triggers on: Instagram DM, DM check, DM reply, DM auto-reply, dm-alert.