Skip to content
Go back

Production-Grade Data Cleaning Workflow for the Titanic Dataset (pandas, 2025)

Updated:
Edit page
Vintage photo of a large ocean liner representing the Titanic.
Photo by Pixabay

Table of Contents

Open Table of Contents

Environment & Path Setup

For reproducibility and portability, establish a consistent project root and data directory structure.

from pathlib import Path

# Project root: one level above the current working directory
project_dir = Path.cwd().parents[1]

# Structured data directories
data_dir = project_dir / 'data'
raw_dir = data_dir / 'raw'

Why: A fixed, relative path structure ensures your workflow runs identically across machines, CI pipelines, and environments without hardcoded absolute paths.

Data Ingestion

Leverage pandas’ flexible I/O to load raw data into a DataFrame.

    import pandas as pd

    df = pd.read_csv(raw_dir / 'titanic.csv')

Data Quality Checks

Before any transformation, profile the dataset:

    df.head(5)      # Preview first 5 rows
    df.info()       # Structure: row counts, dtypes, null counts
    df.dtypes       # Column data types

This step defines your cleaning strategy based on data shape, completeness, and types.

Missing Data Strategy

1.	Row-level NA removal

Drop rows exceeding a missing-value tolerance:

    threshold = len(df.columns)//2
    df.dropna(thresh=threshold, inplace=True)
2.	Imputation

For Age, median is preferred over mean to reduce skew impact:

    df['Age'] = df['Age'].fillna(df['Age'].median())

Categorical Encoding

Map human-readable categories to numeric codes for model compatibility.

    df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

For multi-category features, consider pd.get_dummies() or sklearn.preprocessing.OneHotEncoder.

Full Example

    from pathlib import Path
    import pandas as pd

    project_dir = Path.cwd().parents[1]
    raw_dir = project_dir / 'data' / 'raw'

    df = pd.read_csv(raw_dir / 'titanic.csv')

    print(df.head(5))
    print(df.info())

    threshold = int(len(df.columns) / 2)
    df.dropna(thresh=threshold, inplace=True)

    df['Age'] = df['Age'].fillna(df['Age'].median())
    df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

Best Practices & Notes

•	Check NA distribution early: df.isna().sum()
•	Always define NA thresholds in relative terms (e.g., percentage of columns), not magic numbers.
•	For categorical encoding, keep mapping logic centralized for consistency across pipelines.
•	Use inplace=True or reassign to avoid silent data loss from forgetting to store transformed results.

By following this workflow, you ensure your Titanic dataset is clean, consistent, and ready for downstream analytics or modeling—with practices that scale to production environments.


Edit page
Share this post on:

Previous Post
Building a Robust Multi-Format Data Ingestion Pipeline with Auditability and Versioning
Next Post
If I Had to Start Over Again