Crawl4ai

Overview

Crawl4ai is an AI-powered web scraping framework designed to extract structured data from websites efficiently. It combines traditional HTML parsing with AI to handle dynamic content, extract text intelligently, and clean and structure data from complex web pages.

When to Use This Skill

Use when Codex needs to:

Extract structured data from web pages (products, articles, forms, tables, etc.)
Scrape websites with dynamic content or complex JavaScript
Clean and normalize extracted data from various HTML structures
Work with APIs or web services that return HTML
Handle CORS limitations by scraping directly
Process web content at scale with reliability

Trigger phrases:

"Extract data from this website"
"Scrape this page for [specific data]"
"Parse this HTML"
"Get data from [URL]"
"Extract structured information from [website]"
"Scrape [website] for [data type]"
"Web scrape [URL]"

Quick Start

Basic Usage

from crawl4ai import AsyncWebCrawler, BrowserMode

async def scrape_page(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            browser_mode=BrowserMode.LATEST,
            headless=True
        )
        return result.markdown, result.clean_html

Extracting Structured Data

from crawl4ai import AsyncWebCrawler, JsonModeScreener
import json

async def extract_products(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            screenshot=True,
            javascript=True,
            bypass_cache=True
        )
        # Extract product data
        products = []
        for item in result.extracted_content:
            if item['type'] == 'product':
                products.append({
                    'name': item['name'],
                    'price': item['price'],
                    'url': item['url']
                })
        return products

Common Tasks

Web Scraping Basics

Scenario: User wants to scrape a website for all article titles.

from crawl4ai import AsyncWebCrawler

async def scrape_articles(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            javascript=True,
            verbose=True
        )
        # Extract article titles from HTML
        articles = result.extracted_content if result.extracted_content else []
        titles = [item.get('name', item.get('text', '')) for item in articles]
        return titles

Trigger: "Scrape this site for article titles" or "Get all titles from [URL]"

Dynamic Content Handling

Scenario: Website loads data via JavaScript.

from crawl4ai import AsyncWebCrawler

crawl4ai

Install via CLI (Recommended)