Skip to main content
April 2, 2026Colin Jaffe/3 min read

Preparing Titanic Dataset: Splitting and Scaling Techniques

Master essential preprocessing for machine learning models

Kaggle Competition Context

The Titanic dataset comes pre-split from Kaggle with separate training and test files, eliminating the need for manual train-test splitting that's typically required in machine learning projects.

Data Preparation Workflow

1

Feature Selection

Select relevant columns based on domain knowledge and exploratory data analysis results

2

Target Variable Isolation

Separate the Survived column as the target variable for supervised learning

3

Feature Scaling

Apply standardization to numerical features with different scales

4

Model Preparation

Prepare scaled data for training machine learning algorithms

Selected Features for Model Training

Passenger Class

Categorical variable indicating the ticket class and social status of passengers.

Embarkation Port

Port where passengers boarded the ship, providing geographical context.

Demographics

Gender and age information crucial for survival prediction patterns.

Fare Amount

Ticket price reflecting economic status and cabin location on the ship.

Family Relationships

Number of siblings, spouses, parents, and children traveling together.

Why Scale Age and Fare

Age and fare operate on completely different scales and ranges. Without scaling, the model might incorrectly interpret the numerical magnitude differences as meaningful relationships between these features.

Before vs After Scaling

FeatureBefore ScalingAfter Scaling
Age Range0-80 yearsCentered around 0
Fare Range0-500+ currencyCentered around 0
Mean ValueOriginal meansZero mean
ScaleOriginal unitsStandard deviation units
Recommended: Standard scaling ensures both features contribute equally to model training
We're going to use a somewhat fancy pandas trick called fancy indexing
Fancy indexing in pandas allows simultaneous selection and modification of multiple columns, making the scaling operation more efficient and readable.
Common Mistake Alert

Always remember to execute your code cells after writing them. Forgetting to run the standard scaler initialization is a frequent oversight that can cause confusion during data preprocessing.

Data Preprocessing Validation

0/4

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Now we'll proceed with data splitting, though our approach differs slightly from typical machine learning workflows. Since Kaggle has already provided separate training and test datasets, we're working exclusively with pre-designated training data—eliminating the need for our usual train-test-validation split.

This pre-split structure is common in competitive data science environments and mirrors real-world scenarios where test data remains sequestered until final model evaluation. It's a practice that prevents data leakage and ensures more robust model validation.

Based on our exploratory data analysis and domain expertise, we'll define our feature matrix X_train using the most predictive variables from our Titanic dataset: passenger class (Pclass), embarkation port, sex, age, fare, number of siblings/spouses aboard, and number of parents/children aboard. These features represent a carefully curated selection that balances predictive power with data quality—each chosen for its statistical significance and logical relationship to survival outcomes.

Our target variable Y_train consists of the 'Survived' column—a binary classification where 1 indicates survival and 0 indicates death. This straightforward labeling makes our supervised learning task clearly defined: predict passenger survival based on demographic and ticket information.

Before feeding our data into any machine learning algorithm, we must address the critical issue of feature scaling. Age values range from infants (less than 1 year) to elderly passengers (80+ years), while fare values span from nearly free passage to luxury suite prices exceeding several hundred dollars. Without proper scaling, algorithms might incorrectly weight fare as more important simply because its numerical values are larger—a classic case of letting measurement units drive model decisions rather than actual predictive relationships.

We'll employ StandardScaler, the industry standard for feature normalization, which transforms each feature to have a mean of zero and standard deviation of one. This z-score normalization ensures that all features contribute equally to distance calculations in our upcoming random forest model, preventing any single feature from dominating due to scale rather than significance.

Using pandas' powerful .loc indexer—what the data science community affectionately calls "fancy indexing"—we can selectively scale only the continuous variables (age and fare) while leaving categorical variables unchanged. The syntax X_train.loc[:, ['age', 'fare']] allows us to target specific columns across all rows, applying the fit_transform method in a single, elegant operation.

After transformation, both age and fare variables are centered around zero with unit variance, placing them on equivalent scales for our machine learning algorithm. This preprocessing step is crucial for model performance and interpretability—a fundamental practice that separates professional data science work from amateur attempts.

With our features properly scaled and our data scientifically prepared, we're ready to implement our chosen algorithm: the random forest classifier, a robust ensemble method that excels at handling mixed data types and providing reliable predictions even with limited feature engineering.

Key Takeaways

1Kaggle competitions often provide pre-split datasets, eliminating the need for manual train-test splitting
2Feature selection should be based on domain knowledge and exploratory data analysis results
3Target variable separation is essential for supervised learning model preparation
4Standard scaling is crucial when features operate on different numerical scales like age and fare
5Fancy indexing in pandas enables efficient simultaneous operations on multiple columns
6StandardScaler centers data around zero mean and scales by standard deviation
7Common preprocessing mistakes include forgetting to execute code cells and improper feature scaling
8Proper data preprocessing sets the foundation for effective machine learning model training

RELATED ARTICLES