Skip to main content
Colin Jaffe/2 min read

Refining Data: Removing Outliers for Improved Model Training

ML Project Workflow

1

Define the Problem

What outcome are you predicting and why?

2

Prepare the Data

Clean, normalize, encode categoricals, split into train/test.

3

Train Models

Start simple — logistic regression baselines often surprise.

4

Evaluate & Iterate

Confusion matrix, ROC, F1 — pick metrics that match the problem.

Master Machine Learning at Noble Desktop

Noble Desktop's Python Machine Learning Bootcamp covers scikit-learn, Keras, neural networks, and applied ML.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Remove outliers from car sales data and retrain the model. Watch this tutorial to learn the key concepts and techniques.

Let's go back a step and, instead of looking at X, look at cars, our overall car sales. That has all of our values, and again, still 153 rows. This is before we split it off into X and Y, because we want to look at rows where engine size is more towards the norm, and price in thousands is more towards the norm.

We'll remove those outliers. We're at 153 rows. We'll set car sales equal to car sales where the price in thousands column is less than or equal to 80.

All right, that cut out two rows, two outliers where the price was greater than 80. Let's do one more filter, though this may remove the same outliers.

We might not actually see less. Let's see. For car sales, let's also keep rows where the engine size column is less than or equal to seven.

Yeah, that removed one more row. You can see down here, it changed from 151 to 150. All right, so we've removed a couple of outliers.

Now our next step is to use that filtered data to redeclare our X and Y, split them into training and testing sets, retrain our model, and see how it compares. Let's take a look.