ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified developer tools Safety 4/5

html2md

Convert HTML pages to clean, agent-friendly markdown using Readability + Turndown. Strips navigation, ads, footers, cookie banners, social CTAs. Supports URL fetch, local files, stdin, token budgeting, and output flags. Ideal for research tasks, content extraction, and web scraping in agent workflows.

Why use this skill?

Optimize web content for AI agents with html2md. Strip ads, navs, and clutter to get clean, token-optimized markdown for better research and extraction tasks.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/saikatkumardey/html2md
Or

What This Skill Does

html2md is a high-performance utility designed to convert complex HTML content into clean, agent-readable markdown. By leveraging Mozilla’s Readability engine and the Turndown library, it effectively strips away clutter such as navigation bars, sidebars, advertisements, cookie banners, and social media call-to-actions. This transformation ensures that AI agents receive only the relevant semantic information, optimizing context windows and reducing hallucination risks. The tool is highly configurable, supporting direct URL fetching, local file processing, and standard input piping, while offering robust features like token budgeting to ensure content fits within specific model constraints.

Installation

To install the html2md skill within your OpenClaw environment, execute the following command in your terminal: clawhub install openclaw/skills/skills/saikatkumardey/html2md Once installed, ensure you have Node.js version 22 or higher. Navigate to the skill directory, run npm install to resolve dependencies, and use npm link to make the html2md command globally accessible across your agent workflows.

Use Cases

This skill is indispensable for research-heavy AI workflows. Use html2md when you need to ingest long-form articles, documentation pages, or blog posts for summarization, entity extraction, or RAG (Retrieval-Augmented Generation) indexing. It is particularly effective for cron-job-based agents that monitor websites for updates, as the output is consistently cleaned. Furthermore, developers can leverage the --json output flag to integrate the extracted text and token metadata directly into programmatic pipelines.

Example Prompts

  1. "html2md https://paulgraham.com/greatwork.html --max-tokens 2000 - extract the core thesis of this essay into a clean markdown format for my notes."
  2. "Take this local file, page.html, run html2md on it, and strip all links so I only get the raw text content."
  3. "Fetch the documentation from https://docs.openclaw.com, convert it to markdown, and provide me with the JSON output so I can analyze the token count."

Tips & Limitations

  • Token Budgeting: Always use the --max-tokens flag when dealing with massive documents to prevent exceeding model context windows. The tool intelligently keeps headings while truncating body text.
  • Readability Limits: If a website relies on non-standard layouts (like complex data tables), Readability might return less content than expected; the tool features a fallback mode to the raw body to mitigate this.
  • Network Security: Note that this tool performs direct network requests. Ensure you only provide trusted URLs to avoid unintended SSRF exposure in your agent architecture.
  • Error Handling: The tool is designed for reliability; it exits with code 1 and provides stderr feedback for timeouts, bad URLs, or file access issues, making it highly reliable for automated script integration.

Metadata

Stars1133
Views23
Updated2026-02-18
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-saikatkumardey-html2md": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#markdown#web-scraping#content-processing#ai-agent#data-extraction
Safety Score: 4/5

Flags: network-access, file-read