Skip to main content
Colin Jaffe/3 min read

Domain Knowledge and Data Analysis in Model Training

Why Domain Knowledge Wins

An ML model is only as good as its features. Domain knowledge tells you which features matter, how to engineer derived signals, and which data quirks to clean. Even strong algorithms underperform with bad features — and even simple algorithms thrive when feature engineering reflects real-world insight.

Master Machine Learning at Noble Desktop

Noble Desktop's Python Machine Learning Bootcamp covers scikit-learn, Keras, neural networks, and applied ML.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Apply domain knowledge and data analysis to select relevant features for modeling car prices. Watch this tutorial to learn the key concepts and techniques.

So as we're thinking through what to train our model on, what data is important, there are two ways we typically decide. One: data analysis, and two: domain knowledge. We're about to do some data analysis.

We're about to do both of these, really. We'll come to data analysis a little later, but let's talk about domain knowledge. Domain knowledge means knowledge about that particular area of the world.

In this case, what do we know about cars? And I'll admit, I don't know a lot about cars. That's fine, but I do know more than the computer knows about cars. This computer—again, this model—is just going to know numbers.

It doesn't know what a car is or that this column is actually meaningless, and it might see value. It might see some patterns that aren't really there, or if they are there, they're not predictive.


Maybe cars that sell on an odd day of the week or an odd day of the month—like the first, third, fifth, or seventh—show patterns that it sees in the data. Those cars sell for more, but we know that that's not going to be predictive at all, that that's not meaningful, that that doesn't make any sense, and that if we ran that on more data, and more data, and more data, we'd see that that would lead to incorrect predictions. Our domain knowledge refers to what we know about cars. What do we know about this, and what we, as humans with a bit of understanding, can bring to this? What can we bring to the table to inform our model? When you're thinking about what data needs to be considered, domain knowledge is a great beginning, but it is subjective, and maybe there is something significant about odd- and even-numbered days of the month.

That makes no sense, but lots of things in life don't make any sense, and when you analyze them objectively, as the computer does without any subjective bias that's saying, no, that can't possibly be a pattern, it finds patterns that might actually be meaningful. So it's important to keep that in mind as we're trying to decide what could be valuable, what could not be. All right, let's do that a little bit.

Let's filter our data down to five columns to start working with. So what we'll do is we'll take our car sales, and we will make a new data frame out of certain columns, and we'll pick sales in thousands, fuel efficiency, horsepower, and engine size, and we'll also include our target value, which is price in thousands. Let's take a look at that at car sales now.


So here they are now—the same cars, same 157 rows; we haven't lost any cars, but we've decreased how much data is here. So this is because we've thought to ourselves, okay, the main thing we're going to test, the inputs that we think are important are these four, and this one is our target. Again, we use some domain knowledge to think about the problem and think about like, okay, these seem like, you know, maybe the better the fuel efficiency, the more expensive the car, maybe the higher the horsepower or the bigger the engine size, the more expensive the car.

We're going to take a look and do some data analysis to see if our domain knowledge answer of what seems important is right or not.