Skip to main content
April 2, 2026Colin Jaffe/2 min read

Refining Data: Removing Outliers for Improved Model Training

Enhance Model Performance Through Strategic Data Filtering

Why Outlier Removal Matters

Outliers can significantly skew machine learning models, leading to poor generalization and inaccurate predictions on new data.

Dataset Overview

153 rows
Initial data rows
150 rows
Final filtered rows
3 rows
Outliers removed

Outlier Removal Process

1

Filter by Price Threshold

Remove cars with price greater than 80 thousand, eliminating 2 high-priced outliers from the dataset

2

Filter by Engine Size

Remove cars with engine size greater than 7, eliminating 1 additional outlier with unusually large engine

3

Redeclare Variables

Use filtered data to recreate X and Y variables for improved model training

Data Reduction Through Filtering

Original Dataset
153
After Price Filter
151
After Engine Size Filter
150

Before vs After Outlier Removal

FeatureBefore FilteringAfter Filtering
Total Rows153150
Max Price ThresholdNo limit≤ 80k
Max Engine SizeNo limit≤ 7.0
Data QualityContains outliersNormalized range
Recommended: Filtered dataset should provide better model training results with more representative data distribution

Next Steps for Model Training

0/4

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Let's step back from our feature matrix X and examine the complete dataset—our car sales data containing all variables across 153 rows. Before we split the data into features and targets, we need to address a critical preprocessing step: identifying and removing statistical outliers that could skew our model's performance.

Our focus centers on two key variables where extreme values appear most problematic: engine size and price. Both metrics show data points that fall well outside the normal distribution, potentially compromising our model's ability to generalize effectively to new data.

We'll begin our outlier removal process by filtering the price dimension. Setting our car sales dataset to include only records where the price in thousands is less than or equal to $80,000 immediately improves our data quality.

This initial filter eliminates two clear outliers—luxury vehicles priced significantly above the $80,000 threshold that would otherwise distort our model's understanding of the price-feature relationship. Our dataset now contains 151 rows of more representative data points.

Next, we'll apply a complementary filter to address engine size anomalies. While some overlap may exist between price and engine size outliers, this secondary filter ensures we capture any remaining edge cases that could impact model performance.

By restricting our dataset to vehicles with engine sizes of seven liters or below, we remove one additional outlier. Notice how our row count decreases from 151 to 150—confirming that this filter caught a data point missed by our price-based criteria alone.

With our cleaned dataset of 150 records, we've successfully removed statistical anomalies while preserving the underlying patterns our model needs to learn. This preprocessing step is crucial for building robust, generalizable machine learning models that perform consistently on real-world data.

Now we're ready to leverage this refined dataset for the next phase of our analysis. We'll redeclare our feature matrix X and target variable Y using the filtered data, split them into training and testing sets, retrain our model, and evaluate the performance improvements gained through proper outlier management.

Key Takeaways

1Outlier removal is essential for improving machine learning model performance and preventing skewed predictions
2The dataset was reduced from 153 to 150 rows by filtering extreme values in price and engine size
3Price filtering with a threshold of 80 thousand removed 2 outliers from the car sales data
4Engine size filtering with a threshold of 7.0 removed 1 additional outlier from the dataset
5Sequential filtering approaches allow for targeted removal of different types of outliers
6After outlier removal, X and Y variables must be redeclared to reflect the cleaned dataset
7The filtered data provides a more normalized distribution for better model training results
8Performance comparison between models trained on original vs filtered data will validate the outlier removal strategy

RELATED ARTICLES