data-cleaning-annotation-workflow
Complete workflow for time series datasets (Energy, Manufacturing, Climate) on Kaggle to Data Annotation platform (data.smlcrm.com). Includes downloading, cleaning with pandas, uploading RAW with metadata, configuring columns (Time/Target/Covariate/Group), setting units (kWh, kVarh, tCO2, ratio, seconds), and assigning groups by selecting all variables and applying all group tags. Use when finding Kaggle datasets, cleaning for ML, uploading with metadata, configuring types/units, assigning groups to all variables, or complete pipeline to CLEAN status.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/deyashmukh/data-cleaning-annotation-workflowSimulacrum Data Annotation Workflow
Complete end-to-end workflow for time series dataset preparation and annotation on the Data Annotation platform (data.smlcrm.com).
What This Skill Does
This skill captures the precise workflow for processing time series datasets (Energy, Manufacturing, Climate) from discovery to CLEAN status:
- Find Dataset: Search Kaggle for Energy/Manufacturing/Climate time series data
- Download: Get CSV files via browser or Kaggle CLI
- Clean: Run Python/pandas script to handle missing values, duplicates, formatting
- Upload RAW: Upload original CSV with metadata (name, domain, source URL, description)
- Configure Headers: Set column types (Time, Target, Covariate, Group) and units
- Assign Groups: Select ALL variables (target + covariates), apply ALL group tags
- Upload Cleaned: Final upload → CLEAN status
Supported Domains
- Energy: Power consumption, utilities, renewable energy, grid data
- Manufacturing: Industrial processes, steel production, emissions, equipment data
- Climate: CO2 emissions, environmental monitoring, weather correlation data
Quick Start
For the full pipeline from Kaggle to annotated dataset:
1. Find dataset on Kaggle
2. Download (browser or kaggle CLI)
3. Clean with scripts/clean_dataset.py
4. Upload RAW dataset to data.smlcrm.com (with metadata)
5. Click "Clean" and upload cleaned file
6. Configure column metadata (types, units)
7. Assign groups to variables
8. Upload cleaned dataset → CLEAN status
Workflow Steps
Step 1: Find and Download Dataset
From Kaggle (Browser Method):
- Navigate to kaggle.com/datasets
- Search for relevant dataset (e.g., "steel industry energy consumption", "manufacturing emissions", "climate CO2")
- Review data description, file list, and preview
- Click "Download" button
- Extract CSV file from downloaded zip
Alternative: Kaggle CLI
# Install if needed: pip install kaggle
# Configure: kaggle competitions list
scripts/download_kaggle.sh <dataset-name> [output-dir]
# Example: scripts/download_kaggle.sh csafrit2/steel-industry-energy-consumption
Step 2: Clean the Dataset
Always run the cleaning script before upload:
python3 scripts/clean_dataset.py <input.csv> [-o <output.csv>]
What the script does:
- Strips whitespace from column names
- Removes duplicate rows
- Fills missing numeric values with median
- Fills missing categorical values with mode or 'Unknown'
- Converts timestamp columns to datetime format
- Outputs column summary for metadata configuration
Output:
- Cleaned CSV file ready for upload
- Column summary printed to console (save this for metadata config)
Step 3: Upload Raw Dataset to Platform
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-deyashmukh-data-cleaning-annotation-workflow": {
"enabled": true,
"auto_update": true
}
}
}