April 5, 2025 (Updated April 19, 2026)Colin Jaffe/5 min read

Train-Test Split for Predictive Modeling in Python

Train/Test Split Workflow

Import

from sklearn.model_selection import train_test_split.

Split Data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2).

Train Model

model.fit(X_train, y_train) on training set only.

Evaluate

score = model.score(X_test, y_test) — never test on training data.

Master Machine Learning at Noble Desktop

Noble Desktop's Python Machine Learning Bootcamp covers scikit-learn, Keras, and applied ML.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

We've split up our data into X, which are our inputs, our features, and Y, which is our price in thousands. Now that we've got those, we need to talk about training and testing.

We've split up our data into X, which are our inputs, our features, and Y, which is our price in thousands. Now that we've got those, we need to talk about training and testing. We have our features and our target, but we also want to split things up into our training and testing data.

We're going to name it X-train for the 80% of data the model is trained on, and Y-train for the answers that the model will predict—the targets. Now, we'll also use X-test and Y-test, and these are the standard names. If you name them anything else, you're doing it wrong, because that's not what those things are called.

So don't come up with fancy names for these. These are the standard names for them, and then other programmers know what those values are. So let's talk about how we split those up.

We already have our X and Y, and it's 100% of the X and 100% of the Y. We're going to split each of these up, and here's how we split them. Let's take a look at this image. If you execute this block, we're going to use a function called train-test-split.

Here's our full dataset. We've already done this. We've split it up into features, X, and target, just one column, Y. Our inputs, our characteristics of the car, our prices, the end goal.

Now, we split features up into X-train and X-test, and again, X-train is about 80% of it, and X-test is 20, and we split this target, this Y, into Y-train and Y-test, and the reason is that we can then take this and show the model our, here are these rows, and this should be the goal. Make a formula that goes from these inputs to this, and then let's test it with this new data, this new X data.

Can you get the correct Y data, the correct targets? All right, so the actual code itself is fairly simple. It may be a little more complex than some previous examples. It's not that hard.

It's just going to be that it gives us a tuple, so we're going to call train-test-split, and again, that's a function from scikit-learn that's going to split up X and Y for us, so we're going to call that function and give it X and Y, and finally, how big should our test size be, and the standard is 0.2 of the data, in other words, 20% of the data, so it's going to take the X data and split it up into X-train and X-test, the Y data and split it up into Y-train and Y-test. The order that we pass it in matters, as with any function call, but also, this is actually going to give us back a tuple. If I look at what's the type of this, it's a list.

That's actually technically a list, not a tuple, but yes, we're going to unpack that list, so we're going to say X-train, X-test. That's the first two values it returns. The next two are the Y-train and the Y-test, and again, these are the standard names for them, and it has to be in this order.

If we are confusing our, if we're putting the wrong values in the wrong places, then we're going to end up with Y data for our X and X data for our Y, and our training data will be trained against the test data, which is incorrect. It'll be all kinds of wrong, so we want to make sure we get those in the right order, and if we look at those, they should all be the same. The X-train data, if we look at its length, 122 rows, it should be the same for our Y-train, 122 rows.

For our Y-test, 31 rows. Again, that's 20% of the data now, and our X should also be 31 rows. Nope, because I didn't capitalize it.

There we go, 31 rows. Okay, so great, and you can confirm what is X-train. I'll show you the head and the tail, the first five and the last five, and we can see it's just the columns we want, and the same for Y-train, just the columns we want.

Now, you notice these are not in order anymore, these numbers on the left, the row numbers. That's because not only does it split it up, but it does it randomly. It shuffles it up.

I'll show you what I mean. Let's put here our X-train and our, let's, yeah, let's show our X-train. It starts with row 90,134,46.

Our Y-train, same numbers. Again, they're shuffled, but they got split up so that they line up. The first X, set of fuel efficiency, horsepower, and engine size, goes with the first Y, row 90 of Y, disk press.

So it'll know; it will look at the training data and know what the answer is for each one. And that's how you split up our testing and training data. One line of beautiful code.