Skip to main content
Colin Jaffe/3 min read

One-Hot Encoding for Categorical Data in Machine Learning

One-Hot Encoding

What It Does

Converts categorical variable into binary columns — one per category.

Why It Matters

Most ML algorithms expect numeric inputs; one-hot avoids implying order.

pd.get_dummies()

Pandas one-liner that does the encoding for you.

Drop First

drop_first=True avoids multicollinearity in regression models.

Master Machine Learning at Noble Desktop

Noble Desktop's Python Machine Learning Bootcamp covers scikit-learn, Keras, neural networks, and applied ML.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Convert categorical salary data into numerical format using Pandas' one-hot encoding. Watch this tutorial to learn the key concepts and techniques.

Let's take a look at how we can convert these values into numbers for the computer to be able to model. If you remember our linear regression, we ended up with sets of arrays that contained just numbers. We can do a similar thing here, except that we are going to want to convert them into zeros and ones.

That's because our high, low, and medium aren't really like a measurement. They don't have a meaningful median. There's a mean salary, but we're not sure exactly what these categories represent numerically.

Instead, they are just going to be zeros and ones—a one for low, a one for medium, or a one for high. Now, one for everything would mean that everything would be one. So instead, what we're going to do is we're going to have three separate columns.

For each row, there will be a low column, a medium column, and a high column. Each of these columns will simply have a zero if that row is not, say, a high here, or a one if it is. So everything will have a one in either the low column or the medium column or the high column, and zeros in the others.

That way, the computer will just look at zeros and ones and say, "Okay, there's a one here in this column." And again, it doesn't know what these columns represent, but a one in this column must mean that, you know, I'm finding a pattern where ones stayed and zeros left—or the other way around. But it will give it predictive information, information it can hopefully predict based on, and it will be in a format that it understands.


We'll use a technique called one-hot encoding that takes categorical data—which category you're in—and converts it into ones and zeros. And we'll use the Pandas get_dummies function to return a new DataFrame with these new columns. "Get dummies" is a historical name; "dummy data" is essentially what it produces here.

But that's not how we think of it. We think of this as one-hot encoding. So here's how we're going to do that.

We're going to say salary_OHE, for "one-hot encoding, " is the DataFrame we get when we run pandas.get_dummies on a column. In this case, it's HR data salary, which again is a string—low, medium, or high. And the second thing we pass it is, what's the data type? It's int.

That's not a string note; that's the Python function int, which converts values to integers. So it'll run that on each one to make sure it's an integer. Let's look at that salary one-hot encoding DataFrame.


All right, it's showing us the first five and the last five rows. Same number of rows, so it gave us one for each one. And we can see this one was a low salary.

These two were medium salaries. These two were high salaries. And our last five were all low salaries.

Sorry, those were low. But you saw what I meant, even though I said it incorrectly. So that's what we've got here.

What we need to do now is append that DataFrame to our original DataFrame so that we have high, low, and medium to work with. We'll do that next.