Rag Pipeline Starter
Skill by abhinas90
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/abhinas90/rag-pipeline-starterRAG Pipeline Starter
Production-grade RAG pipeline setup with chunking strategies, embedding benchmarks, and retrieval tuning for 50K-500K row datasets.
Overview
This skill provides a complete toolkit for building and optimizing RAG (Retrieval-Augmented Generation) pipelines. It analyzes your data, recommends optimal chunking strategies, benchmarks embedding models, and helps tune retrieval parameters for maximum accuracy.
When to Use
- Building a new RAG system from scratch
- Optimizing an existing RAG pipeline's retrieval quality
- Choosing the right embedding model for your domain
- Processing large document collections (50K-500K rows)
- Need to balance speed vs. accuracy for your use case
Scripts
chunking_analyzer.py
Analyzes documents and recommends optimal chunking strategies based on content structure.
Usage:
# Assess data and get strategy recommendation
python chunking_analyzer.py --assess ./data
# Apply chunking strategy to documents
python chunking_analyzer.py --strategy recursive --input ./data/doc.txt --output ./chunks/ --chunk-size 500 --overlap 50
Options:
--assess <dir>- Analyze documents and recommend strategy--strategy <name>- Chunking strategy: fixed, semantic, recursive, hierarchical--input <path>- Input file or directory--output <dir>- Output directory for chunks--chunk-size <int>- Chunk size (default: 500)--overlap <int>- Overlap between chunks (default: 50)
embedding_benchmark.py
Tests multiple embedding models on your data to find the best fit for your domain.
Usage:
python embedding_benchmark.py --data ./chunks/ --domain finance --output results.json
Options:
--embeddings <models>- Embedding models to test (space-separated)--data <dir>- Directory with chunked text files (required)--domain <name>- Domain name for context-specific recommendations--output <file>- Output file for results (JSON)
Supported Embeddings:
- sentence-transformers/all-MiniLM-L6-v2 (384 dims, fast, free)
- sentence-transformers/all-mpnet-base-v2 (768 dims, medium, free)
- openai/text-embedding-ada-002 (1536 dims, fast, paid)
- cohere/embed-english-v3.0 (1024 dims, fast, paid)
- bm25 (sparse, fast, free)
retrieval_tuner.py
Optimizes retrieval parameters (top-k, similarity threshold) for your specific use case.
Usage:
python retrieval_tuner.py --index ./vector_store/ --queries ./test_queries.json --output tuning_results.json
Options:
--index <dir>- Vector store index directory--queries <file>- JSON file with test queries and expected results--output <file>- Output file for tuning results--top-k-range <min> <max>- Range of top-k values to test (default: 1 20)--threshold-range <min> <max> <step>- Similarity threshold range
vector_store_manager.py
Manages vector store operations: create, update, search, and maintain indexes.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-abhinas90-rag-pipeline-starter": {
"enabled": true,
"auto_update": true
}
}
}Related Skills
Multi-Agent Deployment Skill for OpenClaw
Deploy a production-ready multi-agent fleet in OpenClaw. Includes step-by-step setup guide, workspace templates, and Python automation scripts for agent creation, routing config, memory sync, and cloud deployment — based on a real working 4-agent production setup.
Claude Code Mastery
Complete guide to mastering Claude Code CLI — installation to production workflows
Claude Code Memory Kit
Stop Claude Code from repeating mistakes — enforce guardrails, preserve context, maintain consistency across sessions
Claude Code Mastery
Complete guide to mastering Claude Code CLI — installation to production workflows