Production-Grade Data Cleaning Workflow for the Titanic Dataset (pandas, 2025)

Vintage photo of a large ocean liner representing the Titanic. — Photo by Pixabay

Open Table of Contents

Environment & Path Setup
Data Ingestion
Data Quality Checks
Missing Data Strategy
Categorical Encoding
Full Example
Best Practices & Notes

Environment & Path Setup

For reproducibility and portability, establish a consistent project root and data directory structure.

from pathlib import Path

# Project root: one level above the current working directory
project_dir = Path.cwd().parents[1]

# Structured data directories
data_dir = project_dir / 'data'
raw_dir = data_dir / 'raw'

Why: A fixed, relative path structure ensures your workflow runs identically across machines, CI pipelines, and environments without hardcoded absolute paths.

⸻

Data Ingestion

Leverage pandas’ flexible I/O to load raw data into a DataFrame.

    import pandas as pd

    df = pd.read_csv(raw_dir / 'titanic.csv')

⸻

Data Quality Checks

Before any transformation, profile the dataset:

    df.head(5)      # Preview first 5 rows
    df.info()       # Structure: row counts, dtypes, null counts
    df.dtypes       # Column data types

This step defines your cleaning strategy based on data shape, completeness, and types.

⸻

Missing Data Strategy

1.	Row-level NA removal

Drop rows exceeding a missing-value tolerance:

    threshold = len(df.columns)//2
    df.dropna(thresh=threshold, inplace=True)

2.	Imputation

For Age, median is preferred over mean to reduce skew impact:

    df['Age'] = df['Age'].fillna(df['Age'].median())

⸻

Categorical Encoding

Map human-readable categories to numeric codes for model compatibility.

    df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

For multi-category features, consider pd.get_dummies() or sklearn.preprocessing.OneHotEncoder.

⸻

Full Example

    from pathlib import Path
    import pandas as pd

    project_dir = Path.cwd().parents[1]
    raw_dir = project_dir / 'data' / 'raw'

    df = pd.read_csv(raw_dir / 'titanic.csv')

    print(df.head(5))
    print(df.info())

    threshold = int(len(df.columns) / 2)
    df.dropna(thresh=threshold, inplace=True)

    df['Age'] = df['Age'].fillna(df['Age'].median())
    df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

⸻

Best Practices & Notes

•	Check NA distribution early: df.isna().sum()
•	Always define NA thresholds in relative terms (e.g., percentage of columns), not magic numbers.
•	For categorical encoding, keep mapping logic centralized for consistency across pipelines.
•	Use inplace=True or reassign to avoid silent data loss from forgetting to store transformed results.

⸻

By following this workflow, you ensure your Titanic dataset is clean, consistent, and ready for downstream analytics or modeling—with practices that scale to production environments.