Skip to main content
Colin Jaffe/2 min read

Splitting Data into Training and Testing Sets for Modeling

Common ML Algorithms

Linear/Logistic Regression

Interpretable baselines — start here.

Random Forest

Robust, handles mixed data, minimal tuning.

Gradient Boosting

XGBoost/LightGBM dominate tabular ML competitions.

Neural Networks

Best for images, audio, text, and high-dimensional data.

Master Machine Learning at Noble Desktop

Noble Desktop's Python Machine Learning Bootcamp covers scikit-learn, Keras, neural networks, and applied ML.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Split the iris dataset into training and testing sets for features and targets. Watch this tutorial to learn the key concepts and techniques.

Let's take a look at preparing our data into X and Y training and testing sets. This looks ready for use. We have our four features: sepal length, sepal width, petal length, and petal width, and we have our target and species to help us better interpret the data.

All right, so X represents the features for training, and that means that X should be the iris dataframe with the columns: sepal length (cm), sepal width (cm), petal length (cm), and petal width (cm). Let's see if we got that right. It's very easy to get a typo there.

It looks like we nailed it. Okay, those are our inputs. Let's get our target, our Y, and that's very simple.

Y is the iris dataframe target, and there it is. 150 rows, zeros, ones, and twos (representing species). Now let's use our train test split as we did previously.

We want X_train, X_test, Y_train, and Y_test to be the train_test_split of our X data, our Y data, and a test size of 0.2 (20%). And we want to take a look at, you know, maybe the test data. Here are the targets for our test data, and you can see that they're now randomly assigned—30 samples, as this is 20% of 150.

And the corresponding test data—30 samples. These are the inputs that go with those answers. All right, next up we'll create our model, train it, and get it working.