Skip to main content
Colin Jaffe/3 min read

Label Encoding, Scaling, and Model Compatibility

Encoding & Scaling Tools

Label Encoder

Maps categories to integers (0, 1, 2). Watch for ordinal implication.

One-Hot Encoder

Binary columns per category — avoids implying order.

Standard Scaler

Mean 0, std 1. Best for normally-distributed features.

MinMax Scaler

Scales to [0, 1]. Useful when bounds matter (e.g., neural nets).

Master Machine Learning at Noble Desktop

Noble Desktop's Python Machine Learning Bootcamp covers scikit-learn, Keras, neural networks, and applied ML.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Preprocess the test dataset by selecting columns, label encoding categorical features, and scaling numerical features. Watch this tutorial to learn the key concepts and techniques.

Let's label, encode, and scale our columns. So, let's first pick out which columns we want in this final version of X_test. We want the columns Pclass, Embarked, Sex, and Fare.

I want to make sure I'm consistent with the other one because we're going to be running it through the same model. Siblings, spouses, and parents with children—let me make sure I haven't missed one. If you have a different value, make sure to adjust accordingly.

If you have different values or columns, the model won't be able to work with it the same way. All right, now we're going to set X_test['Embarked'] to equal the le.fit_transform version of X_test['Embarked']. Let's run this code.

I want to make sure I'm doing this right. Yeah, this is an interesting Pandas problem where we're trying to work on a copy of a slice of a data frame. More properly, we should be using.loc here, which is not hard to do.

We just say all rows in the 'Embarked' column. Okay, and we'll do this. Let me double-check that.

No warning this time. And let's do the same thing for Sex. Those are the two columns that need label encoding.

And Age and Fare are the columns that need scaling. We'll use StandardScaler's fit_transform on those same values. This way we can do it all in one nice line.

And there we go. Yep, that's all good. Now let's just check our X_test because we just wrote a lot of code.

Did it work? No. I'm afraid I've made some kind of dreadful error. I often make errors like this.

Invalid key error. Let's investigate this. Let's see if we can figure out what I did wrong here.

Oh, I missed the.loc method here. I was saying I got all the.locs, but nope, I missed this one here. Yep.

All right, so I still have an issue. Age is not an index.

Oh, yep, I just forgot to include Age initially. All right. I'm going to reset things by running all previous code blocks.

Make sure I'm running this from the beginning. And let's try this again. There we go.

Some intricate code, but we got there in the end. I missed including age initially, and I missed a method call earlier, but we got it. All right.

Let's take a look at this. Age and Fare appear to be scaled. Embarked and Sex appear to be label encoded.

I think in the next step, we're ready to get everything set to submit to Kaggle.