Simulacrum Data Annotation Workflow

Complete end-to-end workflow for time series dataset preparation and annotation on the Data Annotation platform (data.smlcrm.com).

What This Skill Does

This skill captures the precise workflow for processing time series datasets (Energy, Manufacturing, Climate) from discovery to CLEAN status:

Find Dataset: Search Kaggle for Energy/Manufacturing/Climate time series data
Download: Get CSV files via browser or Kaggle CLI
Clean: Run Python/pandas script to handle missing values, duplicates, formatting
Upload RAW: Upload original CSV with metadata (name, domain, source URL, description)
Configure Headers: Set column types (Time, Target, Covariate, Group) and units
Assign Groups: Select ALL variables (target + covariates), apply ALL group tags
Upload Cleaned: Final upload → CLEAN status

Supported Domains

Energy: Power consumption, utilities, renewable energy, grid data
Manufacturing: Industrial processes, steel production, emissions, equipment data
Climate: CO2 emissions, environmental monitoring, weather correlation data

Quick Start

For the full pipeline from Kaggle to annotated dataset:

1. Find dataset on Kaggle
2. Download (browser or kaggle CLI)
3. Clean with scripts/clean_dataset.py
4. Upload RAW dataset to data.smlcrm.com (with metadata)
5. Click "Clean" and upload cleaned file
6. Configure column metadata (types, units)
7. Assign groups to variables
8. Upload cleaned dataset → CLEAN status

Workflow Steps

Step 1: Find and Download Dataset

From Kaggle (Browser Method):

Navigate to kaggle.com/datasets
Search for relevant dataset (e.g., "steel industry energy consumption", "manufacturing emissions", "climate CO2")
Review data description, file list, and preview
Click "Download" button
Extract CSV file from downloaded zip

Alternative: Kaggle CLI

# Install if needed: pip install kaggle
# Configure: kaggle competitions list

scripts/download_kaggle.sh <dataset-name> [output-dir]
# Example: scripts/download_kaggle.sh csafrit2/steel-industry-energy-consumption

Step 2: Clean the Dataset

Always run the cleaning script before upload:

python3 scripts/clean_dataset.py <input.csv> [-o <output.csv>]

What the script does:

Strips whitespace from column names
Removes duplicate rows
Fills missing numeric values with median
Fills missing categorical values with mode or 'Unknown'
Converts timestamp columns to datetime format
Outputs column summary for metadata configuration

Output:

Cleaned CSV file ready for upload
Column summary printed to console (save this for metadata config)

data-cleaning-annotation-workflow

Install via CLI (Recommended)