Skip to main content
Colin Jaffe/2 min read

Visualizing and Interpreting Data

ML Project Workflow

1

Define the Problem

What outcome are you predicting and why?

2

Prepare the Data

Clean, normalize, encode categoricals, split into train/test.

3

Train Models

Start simple — logistic regression baselines often surprise.

4

Evaluate & Iterate

Confusion matrix, ROC, F1 — pick metrics that match the problem.

Master Machine Learning at Noble Desktop

Noble Desktop's Python Machine Learning Bootcamp covers scikit-learn, Keras, neural networks, and applied ML.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Check data for null values, confirm none exist, verify class imbalance, and examine random samples visually. Watch this tutorial to learn the key concepts and techniques.

Let's take a couple of quick looks at our data. So first, we can check for any null values. And the way we can do that is we can say, HRData—check for NA (not available) values and sum them up.

If we run that, there are none, which we already know because we prepared this material. So, there are a lot of rows in this dataset. If you didn't notice before, there are almost 15,000 rows.

That's a lot of data, without a single NA value. So it's fantastic data. We don't have to go through the steps that we did in the last one for removing values, removing rows that wouldn't have the data we actually want.

And we have a huge number of rows, which is a huge advantage when we're talking about the accuracy of our model. Providing more data will help the model train better. Let's take a look at visualizing our data.

All right, we can look at, you know, how many people left and stayed. One way we can do that is look at some random values. Here are 10 random values, and we can see this time most of them left, and one stayed.

Oh, I'm sorry, other way around. Most of them stayed; the one means they left. If I run that cell again, now we're looking at another random sample.

There are two out of 10 who left. Now one out of 10, now four out of 10 left. But, you know, these are just quick visual checks, right? And we get a very different perspective compared to just looking at the first five and last five rows.

It's like, oh, they all left. So here's how we're going to get the actual answer. How many left, how many stayed? We're going to look at our HRData.value_counts() for the "left" column.

What we get is 11,428 stayed (their "left" value was zero), and 3,571 left. Clearly, the majority of people stayed. All right, we'll dive into our data and perform a bit of data analysis next.