Tidy Excel Data

Overview

Use this skill to turn messy spreadsheet data into a clean, auditable workbook. Favor a reversible workflow: inspect first, preserve raw data when useful, clean deterministically, and report every meaningful assumption.

Workflow

Confirm the user's goal and output format when it is not obvious.
Inspect workbook structure: sheet names, dimensions, merged cells, header rows, blank rows or columns, data types, duplicate candidates, and obvious ID/date/amount fields.
Choose the safest cleaning level:
- Light cleanup: trim text, remove empty rows/columns, normalize headers, preserve all records.
- Standard cleanup: light cleanup plus duplicate removal and common type coercion.
- Custom cleanup: apply user-provided rules for columns, filters, joins, pivots, or sheet splitting.
Use scripts/tidy_excel.py for repeatable workbook cleanup when it fits the task.
For complex business rules, write a short task-specific script or notebook-like Python snippet, then save the cleaned workbook and a QA summary.
Verify the output by reopening the generated workbook or reading key sheets back with pandas/openpyxl.

Quick Start

Run the bundled script for common cleanup:

python scripts/tidy_excel.py input.xlsx --output cleaned.xlsx

Useful options:

python scripts/tidy_excel.py input.xlsx --output cleaned.xlsx --dedupe --coerce-types
python scripts/tidy_excel.py input.xlsx --output cleaned.xlsx --sheet Orders --summary qa_summary.csv
python scripts/tidy_excel.py input.csv --output cleaned.xlsx --header-row 2

The script writes cleaned sheets to an Excel workbook and adds a _tidy_summary sheet with row counts, column counts, duplicate counts, missing-value counts, and inferred column types.

Cleaning Rules

Apply these defaults unless the user gives different instructions:

Preserve the original file; write a new workbook instead of overwriting.
Normalize headers to lowercase snake_case; keep names unique with numeric suffixes.
Remove fully empty rows and columns.
Trim leading/trailing whitespace in text cells and collapse repeated internal whitespace.
Treat common empty markers (n/a, na, null, none, -, empty strings) as missing values.
Do not delete partial rows unless the user asks for filtering.
Deduplicate only when requested or when rows are exact duplicates and the user asked for "clean/organize".
Keep leading-zero identifiers as text unless the user explicitly wants numeric conversion.
Avoid destructive date or currency conversion when formats are ambiguous.

Data Quality Checks

Use references/quality-checks.md when deciding what to inspect or report. Include a short QA note in the final response covering:

workbook and sheet names processed
rows/columns before and after cleanup
duplicate rows removed, if any
columns with high missing rates
assumptions or conversions applied
output file path

Custom Tasks

For merges, pivots, category mapping, multi-row headers, or domain-specific rules, inspect the workbook first and adapt the workflow. Prefer pandas for tabular transformations and openpyxl when preserving workbook formatting, formulas, or sheet-level structure matters.