Skip to main content
April 2, 2026Colin Jaffe/2 min read

Splitting Data into Training and Testing Sets for Modeling

Master data preparation for machine learning models

Iris Dataset Overview

150
Total samples
4
Feature columns
3
Species classes

Dataset Features

Sepal Length

Measurement of sepal length in centimeters. One of the four key morphological features used for species classification.

Sepal Width

Measurement of sepal width in centimeters. Provides dimensional context for flower structure analysis.

Petal Length

Measurement of petal length in centimeters. Often the most distinguishing feature between iris species.

Petal Width

Measurement of petal width in centimeters. Completes the dimensional profile for accurate classification.

Data Preparation Workflow

1

Define Feature Matrix (X)

Extract the four feature columns from the iris dataframe: sepal length, sepal width, petal length, and petal width

2

Define Target Vector (Y)

Extract the target column containing species labels encoded as 0, 1, and 2

3

Apply Train-Test Split

Use train_test_split with 80% training data and 20% testing data for model validation

Training vs Testing Split Distribution

Training Data80%
Testing Data20%
Test Size Selection

The 0.2 test size (20%) provides 30 samples for testing from the 150 total samples. This ratio ensures sufficient training data while maintaining adequate test samples for reliable validation.

Split Results

120
Training samples
30
Testing samples
4
Features per sample

Data Validation Checklist

0/4
These are the inputs that go with those answers
Emphasizing the critical relationship between feature data (X_test) and target labels (Y_test) in the testing set
Ready for Model Training

With X_train, X_test, Y_train, and Y_test properly configured, the data is now prepared for creating, training, and evaluating machine learning models.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Let's examine the critical process of preparing our data into X and Y training and testing sets—a foundational step that determines the success of any machine learning pipeline. Our dataset is properly structured with four essential features: sepal length, sepal width, petal length, and petal width, alongside our target variable for species classification. This clean separation between features and targets will enable robust model training and evaluation.

Our feature matrix X contains the four measurement variables that serve as inputs for training our classification model. Specifically, X encompasses the iris dataframe columns: sepal length (cm), sepal width (cm), petal length (cm), and petal width (cm). These continuous variables provide the dimensional characteristics that distinguish between iris species. Let's verify our feature selection is accurate—column name typos are surprisingly common and can derail an entire analysis pipeline.

Perfect—our feature matrix is correctly configured. These four measurements represent the input space our model will learn from. Now we need to establish our target variable Y, which serves as the ground truth for our supervised learning approach.

Our target variable Y consists of the iris dataframe's target column, containing 150 categorical labels encoded as integers: zeros, ones, and twos representing the three iris species (setosa, versicolor, and virginica respectively). This numeric encoding is essential for most machine learning algorithms, which require numerical inputs rather than text labels. The balanced distribution across all three classes makes this an ideal dataset for classification tasks.

Next, we'll implement the train-test split methodology—a crucial validation technique that prevents overfitting and provides realistic performance estimates. We'll allocate our data into X_train, X_test, Y_train, and Y_test using an 80-20 split, reserving 20% of our data for final model evaluation. This test_size of 0.2 represents current best practices for datasets of this scale, ensuring sufficient training data while maintaining a meaningful test set.

Examining our test set reveals 30 randomly sampled observations (20% of 150 total samples), with targets distributed across all three species. This random sampling ensures our test set represents the full population distribution, providing unbiased performance metrics. The corresponding feature data maintains the same sample alignment, creating matched input-output pairs essential for accurate model evaluation. With our data properly partitioned, we're positioned to build, train, and validate a robust classification model.

Key Takeaways

1Feature matrix X contains four numerical columns representing iris flower measurements
2Target vector Y contains encoded species labels as integers 0, 1, and 2 for 150 samples
3Train-test split with 0.2 test size creates 120 training and 30 testing samples
4Random assignment in testing ensures unbiased model evaluation across all species
5Proper column name validation prevents common typos that cause data preparation errors
6Feature-target alignment is critical for maintaining data integrity during the split process
7The 80-20 split ratio provides sufficient training data while preserving adequate test samples
8Prepared datasets enable the next phase of model creation and training workflows

RELATED ARTICLES