Splitting Data into Training and Testing Sets for Modeling
Master data preparation for machine learning models
Iris Dataset Overview
Dataset Features
Sepal Length
Measurement of sepal length in centimeters. One of the four key morphological features used for species classification.
Sepal Width
Measurement of sepal width in centimeters. Provides dimensional context for flower structure analysis.
Petal Length
Measurement of petal length in centimeters. Often the most distinguishing feature between iris species.
Petal Width
Measurement of petal width in centimeters. Completes the dimensional profile for accurate classification.
Data Preparation Workflow
Define Feature Matrix (X)
Extract the four feature columns from the iris dataframe: sepal length, sepal width, petal length, and petal width
Define Target Vector (Y)
Extract the target column containing species labels encoded as 0, 1, and 2
Apply Train-Test Split
Use train_test_split with 80% training data and 20% testing data for model validation
Training vs Testing Split Distribution
The 0.2 test size (20%) provides 30 samples for testing from the 150 total samples. This ratio ensures sufficient training data while maintaining adequate test samples for reliable validation.
Split Results
Data Validation Checklist
Prevents errors from typos in column references
Ensures complete dataset without missing labels
Random sampling prevents bias in model evaluation
Ensures inputs correspond to correct outputs
These are the inputs that go with those answers
With X_train, X_test, Y_train, and Y_test properly configured, the data is now prepared for creating, training, and evaluating machine learning models.
This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.
Key Takeaways