Skip to main content
Colin Jaffe/2 min read

Data Processing with LabelEncoder for Categorical Variables

Build a Classification Model

1

Load and Inspect Data

pd.read_csv, check shape, dtypes, missing values.

2

Split and Scale

train_test_split, StandardScaler fit on train only.

3

Fit and Predict

model.fit(X_train, y_train); model.predict(X_test).

4

Evaluate

classification_report, confusion_matrix — beyond just accuracy.

Master Machine Learning at Noble Desktop

Noble Desktop's Python Machine Learning Bootcamp covers scikit-learn, Keras, neural networks, and applied ML.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Apply LabelEncoder to convert categorical variables 'sex' and 'embarked' into numeric form. Watch this tutorial to learn the key concepts and techniques.

We're going to use LabelEncoder now for the first time. LabelEncoder is another way similar to one-hot encoding. We're going to encode using numbers instead of words.

Because it's much easier for the computer to understand a category 0,1, or 2 rather than passenger class 1,2, or 3 or using sex, male and female, versus 0 or 1. It understands the 0 or 1 a lot better. The same applies to embarked. Embarked is another feature we'll need to label and code because embarked is S, Q, or C, and we want that to be 0,1, or 2. So it's a bit of a simpler version of one-hot encoding.

Let's use LabelEncoding for this. To use LabelEncoding, we'll instantiate a LabelEncoder. We'll say le, that's a common name for this, is LabelEncoder, and we call it to get an actual LabelEncoder.

And just as I was saying, there are two categories, sex and embarked, that we'll label and code. And we're going to do this one at a time here for now. We'll explore ways to speed this up later.

But let's use LabelEncoder for this. Let's say titanicData.sex equals the LabelEncoder.fit_transform version of sex. And now, if I just check out that series… Looks like I have an error here.

Interesting. Oh, I forgot to run this code block. There we go.

Let's try that again. There we go. Now, it's… We can see that males and females have been split into ones and zeros.

Okay, let's do the same thing for embarked. TitanicData.embarked is also equal to the le.fit_transform version of titanicData.embarked.

Then, we'll take a look at the entire TitanicData and see both of these at once. There we go. Sex has been encoded as one for male and zero for female.

And embarked is now encoded as zero, one, or two, depending on the port they embarked from. Great. Our next step is to begin manipulating this data further.

We'll split it into X and Y and see what we can do next.