February 9, 2025 (Updated April 19, 2026)Colin Jaffe/3 min read

Data for Readability: Enhancing Index and Column Clarity

DataFrame Readability Checklist

0/5

Index named to indicate what each row represents.

Column names lowercase, snake_case, no spaces.

Date columns parsed as datetime, not strings.

Numeric columns with appropriate dtype (int vs float).

Use df.style.format() for clean number/percent display.

Master Machine Learning at Noble Desktop

Noble Desktop's Python Machine Learning Bootcamp covers scikit-learn, Keras, neural networks, and applied ML.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Rename columns and rearrange data to improve readability and identify salary as a factor influencing employee retention. Watch this tutorial to learn the key concepts and techniques.

Okay, we're going to rename some things and move things around and make this more human readable. Zeros and ones are what the computer needs for left versus stayed, or actually stayed versus left. Again, it's not very human readable.

And high, low, medium, it's not in order. So let's see what we can do there. First, we can rename our index.

Instead of being zero and one, we can say for the left versus salary crosstab, it's index.indexValue equals a list of the row names. "Stayed, " which I will hopefully spell correctly, and "Left." And now we can look at that again.

And there, now it's "Stayed" and "Left." Much easier to read. Okay, our next step, which is a little more tricky code-wise, not hard or easy, but definitely a little more complex.

First, we need to remove and save the high column. If we want to move the high to the left, it's not simply a matter of moving it over. We actually have to take it off and then append it to the end.

So here's how we'll do that. We'll say high_column—though we can name it whatever we want—is what we get when we run.pop("high"), removing that column. Now don't run this yet.

If you run this partway through, you will end up losing your high column. You could go back and rerun all your earlier cells to get it back, but let's avoid having to do that at all. So let's run this only once we're at the end so that running this, it'll run everything before we lose the column, we'll put it back on.

Now we say insert, insert a column at index two, meaning after zero and one as the third option, the third column. We'll call it "high" again. And it'll be that high column that we just made.

And now we can run this and we should have "high" over on the right. All right, now this is much easier to read and we can see that there's a relationship here between salary—low, medium, and high—and "Stayed" or "Left." Now, again, we're not graphing this for a more visual audience.

This might be good to make a nice chart for. But for now, we can just see as numbers people, as data people, that for low salary folks, that's about 70% who stayed. For medium, it's higher; about the same number of people stayed among the medium-salary folks, but about two-thirds as many left.

That means it's about 80-20 instead for medium-salary folks. Only 20% of them left versus 30%. And it's vastly different for high folks.

There are fewer of them, so the sample size is smaller, but it's still fairly large. Without precisely calculating, that's over 90%. Around 92–93%.

So a lot more people stayed. This seems like it's a significant thing to train the model on. Given people's salaries, it might be worth examining further.

Now, the computer can't look at "Stayed" and "Left." It can't look at low, medium, and high, because those are words, and they don't mean anything to the computer as words. We need to give it numbers.

That's what computers understand. So the next step, we'll convert these values to numbers.