Skip to main content
Colin Jaffe/3 min read

Preparing Titanic Dataset: Splitting and Scaling Techniques

Data Prep Pipeline

1

Handle Missing Values

Impute Age with median, drop or fill Cabin, encode Embarked.

2

Encode Categoricals

One-hot encode Sex and Embarked. Use pd.get_dummies().

3

Train/Test Split

from sklearn.model_selection import train_test_split — 80/20 typical.

4

Scale Features

StandardScaler fits on train, transforms both train and test.

Master Machine Learning at Noble Desktop

Noble Desktop's Python Machine Learning Bootcamp covers scikit-learn, Keras, neural networks, and applied ML.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Next, we're going to split our data. We don't actually need to split it quite as much as we normally do.

Next, we're going to split our data. We don't actually need to split it quite as much as we normally do. In this case, the test data is actually in another file.

So, Kaggle has split it up into training and test data for us. So, everything we've got here is training data. So, let's actually split it up.

We're going to, based on our domain knowledge and the data analysis we've done, say X train is our Titanic data with the following columns: P class, embarked, sex, age, fare, siblings and spouses, and parents and children. Let's take a look at that X train and see if it's what we think it is. It's good.

And we're going to split up Y as well; Y is just the Survived column from the original. And there we go.

Survived or perished. Our answers, our labels. Okay.

Now, we're going to want to scale age and fare because age varies quite a lot, and fare varies quite a lot as well.

They're on a different scale. We don't want them to think that fare being twice as much as age has any meaning on it. So, to help the model realize that, we're going to scale everything around a mean of zero and scale it down to the standard deviation.

We're going to use our typical tool for that, which is the standard scaler. We're going to say SC equals standard scaler. And we don't really need to, as it says here, deal with Y data.

Y data is already zero and one. We're going to use a somewhat fancy pandas trick called fancy indexing.

That's actually the community's term for it. We're going to do some fancy indexing to scale age and fare at the same time. We're going to say Xtrain.loc all rows, columns, age, and fare equals standard scaler fit transform version of that.

Xtrain.loc all rows, columns, age, and fare. And then we'll take a look at Xtrain. Did I forget to run this? That's exactly what I did.

Very common mistake I make. And then some people make as well. But definitely, I make that a lot.

All right. Great. So, we can see that age and fare are now on the same scale.

And they are now both centered around zero as a mean and scaled by standard deviation. Yeah. All right.

Now that we've scaled everything, it's time to start talking about the model we're going to use, random forest classifiers.