Skip to main content
Colin Jaffe/2 min read

Understanding Dataset Structure in Machine Learning

Common ML Algorithms

Linear/Logistic Regression

Interpretable baselines — start here.

Random Forest

Robust, handles mixed data, minimal tuning.

Gradient Boosting

XGBoost/LightGBM dominate tabular ML competitions.

Neural Networks

Best for images, audio, text, and high-dimensional data.

Master Machine Learning at Noble Desktop

Noble Desktop's Python Machine Learning Bootcamp covers scikit-learn, Keras, neural networks, and applied ML.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Unpack digit data into training and testing datasets and verify array shapes and types. Watch this tutorial to learn the key concepts and techniques.

So having understood the shape of our data, we can now unpack it and see if our understanding was right enough to have some variables to work with. All right, so we're going to unpack into testing data and training data our digits data. All right, so those are two tuples.

Remember it's a tuple of tuples. And testing data has X_test and Y_test. And training data has, sorry, testing data has X, yeah, I think I've got this backwards.

Let's make sure we get it right. Training data first, then testing data. Okay, yes, yes, yes.

So that's why I was, that's why I was, something was wrong when I was saying that out loud. I realized it. So our training data here is going to be X train and Y train, and our testing data should be our X test and Y test.

But let's see. Let's say training images, and let's spell training right, and training labels equals unpacking our training data. And the same thing for testing.

Testing images and testing labels equals our test data. And I named them the same things that I named them before, so we've already got this printing. I'll print out the shape and the type of each of these, training images and labels, testing images and labels.

Let's run this. All right, so yep, this matches what we thought. Training images are 6,000 28 × 28 arrays.

Training labels are an array of 6,000 digits. Testing images are 10,000 28 × 28 arrays, and testing labels are 10,000 single values, in this case digits 0–9.

All right, next we'll take a look at why these are 28 × 28 pixels? Oh, I spoiled it.

Why are these 28 × 28 arrays? They're pixels. Spoiler alert. Let's take a look at that next.