
Table of Contents
Open Table of Contents
Environment & Path Setup
For reproducibility and portability, establish a consistent project root and data directory structure.
from pathlib import Path
# Project root: one level above the current working directory
project_dir = Path.cwd().parents[1]
# Structured data directories
data_dir = project_dir / 'data'
raw_dir = data_dir / 'raw'
Why: A fixed, relative path structure ensures your workflow runs identically across machines, CI pipelines, and environments without hardcoded absolute paths.
⸻
Data Ingestion
Leverage pandas’ flexible I/O to load raw data into a DataFrame.
import pandas as pd
df = pd.read_csv(raw_dir / 'titanic.csv')
⸻
Data Quality Checks
Before any transformation, profile the dataset:
df.head(5) # Preview first 5 rows
df.info() # Structure: row counts, dtypes, null counts
df.dtypes # Column data types
This step defines your cleaning strategy based on data shape, completeness, and types.
⸻
Missing Data Strategy
1. Row-level NA removal
Drop rows exceeding a missing-value tolerance:
threshold = len(df.columns)//2
df.dropna(thresh=threshold, inplace=True)
2. Imputation
For Age, median is preferred over mean to reduce skew impact:
df['Age'] = df['Age'].fillna(df['Age'].median())
⸻
Categorical Encoding
Map human-readable categories to numeric codes for model compatibility.
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
For multi-category features, consider pd.get_dummies() or sklearn.preprocessing.OneHotEncoder.
⸻
Full Example
from pathlib import Path
import pandas as pd
project_dir = Path.cwd().parents[1]
raw_dir = project_dir / 'data' / 'raw'
df = pd.read_csv(raw_dir / 'titanic.csv')
print(df.head(5))
print(df.info())
threshold = int(len(df.columns) / 2)
df.dropna(thresh=threshold, inplace=True)
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
⸻
Best Practices & Notes
• Check NA distribution early: df.isna().sum()
• Always define NA thresholds in relative terms (e.g., percentage of columns), not magic numbers.
• For categorical encoding, keep mapping logic centralized for consistency across pipelines.
• Use inplace=True or reassign to avoid silent data loss from forgetting to store transformed results.
⸻
By following this workflow, you ensure your Titanic dataset is clean, consistent, and ready for downstream analytics or modeling—with practices that scale to production environments.