January 5, 2025 (Updated April 19, 2026)Colin Jaffe/3 min read

Logistic Regression with Data Scaling and Preparation

Logistic Regression Data Prep

0/5

Numeric features scaled (StandardScaler or MinMaxScaler).

Categorical features encoded (one-hot, label encoding).

Missing values imputed or rows dropped.

Train/test split with stratify if classes are imbalanced.

Class imbalance addressed (class_weight='balanced' or SMOTE).

Master Machine Learning at Noble Desktop

Noble Desktop's Python Machine Learning Bootcamp covers scikit-learn, Keras, neural networks, and applied ML.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Prepare data, split into training/testing sets, scale features, and train logistic regression model. Watch this tutorial to learn the key concepts and techniques.

Now that we have our data in a pretty good format, we did some data analysis, and now we can think about our domain knowledge. And we're going to try—and again, this is the kind of thing that you should work on, that you're welcome to keep working on, and that we encourage you, in fact, to continue thinking through—which of these columns will help, trying out different amounts, massaging the data in any way you want, looking at outliers, and any of the other tools that we'll look at. But we want to show you, we want to talk to instead, we want to speak to the amount, we want to speak to what's different here with a logistic regression instead of a linear regression.

So let's make that happen. We're going to use for our X the following columns. We're going to use low, medium, and high.

Then we're also going to use satisfaction level, average monthly hours, and that is the way the original column is spelled. And how many promotions, oops, it's got to be a string, how many promotions they've had, pardon me, in the last five years. Y, on the other hand, will simply be, as always, a series of our label, our answer, which in this case is the left column, zero or one, left or stayed.

But if I run that, I've got my X and y split now. Now I'm going to split those X and y into our training and testing data. We're going to do our train test split and pass into it our X, our y, and what our train size should be, usually 20%.

And then we're going to unpack the tuple it gives us back into X_train, X_test, y_train, and y_test. All right. A lot of our data is still numerical and at sort of different scales, right? You look at our average monthly hours here, 157,159, but up to 200-plus, almost 300 in some cases.

So there's quite a lot of variation. And you look at that versus promotions in the last five years, it could be zero, one, or two, some very small number. We want to scale them all around the mean.

We'll use our standard scaler to do so. We'll say, give me a standard scaler. And then we'll say X_train is actually the scaled version of X_train.

And the same for X_test, same for our test inputs, okay? Now this line is almost exactly like it was last time, except instead of saying our model is a linear regression, it's a logistic regression.

And as our last bit before we get to, you know, evaluating it, let's train our model with model.fit on this data. Here's X_train. Here's y_train.

Try to learn a pattern from that, please, model. And I neglected to run this. You can see there's no check mark.

There we go. Run that, run that. All right.

Next we'll evaluate how our model did.