Skip to main content
Colin Jaffe/4 min read

Filling Missing Age and Embarked Data for Titanic Analysis

Missing Data Workflow

1

Identify Missing

df.isnull().sum() shows missing count per column.

2

Imputation Strategy

Mean/median for Age; mode for Embarked.

3

Apply Fix

df['Age'].fillna(df['Age'].median(), inplace=True).

4

Verify

Re-run isnull().sum() to confirm zeros.

Master Machine Learning at Noble Desktop

Noble Desktop's Python Machine Learning Bootcamp covers scikit-learn, Keras, and applied ML.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Fill missing age values with gender-based means and replace missing embarked entries with the mode value 'S'. Watch this tutorial to learn the key concepts and techniques.

We're going to fill our age values now. We're going to fill them in with the mean age. This is maybe not the perfect way to do it, but it is a way where we can have some real values that we can use even if they're not, even if about 20% of these ages will be generic.

They'll at least make the rest of it able to be worked with. We're going to calculate the mean age for women. This is just some math, some leveraging of our DataFrame knowledge and so on.

First, let's make a women's DataFrame that will be the Titanic data where the sex value is female. So that means it'll be just a filtered version where it'll only be those rows. Now we'll say, okay, great.

I want to get the ages of those. We'll say ages of women is women DataFrame age column. And then now that we've got just a series of numbers, we could say, okay, the mean women's age is, let's round the ages of women dot mean to one past the decimal place.

And we can check that out. Women's mean age is 27.9. And we'll do the same thing for men. And in fact, I'm going to do a little bit of copy and paste here and change women to men.

Okay. I think that did it. Let's run this again.

Yep. Men's mean age should be 30.7. All right. Now we're going to fill it in.

We're going to do an apply to fill this age to any empty spots, any NA values. So I'm going to define a function. This could be done—it should be called maybe fill mean age by gender.

We'll take in a passenger. This is a function that's just in charge of one row at a time, and then we'll apply this to all the rows.


We'll have an if. We'll say if it's not true that there's no age. This is a little bit of, you know, computer double-negative kind of thing.

But basically this is saying if there is already an age value. In that case, we want to return that passenger's age. Right? There already was an age value.

So that's what we want. Now, if we're here in this case, elif, then that means that they don't have an age already. We'll say, okay, if their sex is male, then return mean man age.

Else, they must be a woman—return mean woman age. Or it's just a little function that's in charge of one passenger returning the right value for them. If they already have an age, return their age.

If they're a man, return mean man age. Otherwise, return mean woman age. So what we need pandas to do for us is to apply this function to every value in age.

We'll say Titanic data at age is now equal to the version where we apply it to every single one. And the axis we want to apply it to is columns. And let's just take a look at Titanic data.

Okay. 27.9,30.7. I don't see any of those values here. Did it work? Well, we could check a couple of things.

First, we could check, is there still any NA values for age? No. So that's good. And now, let's try to see if we could find any ones where they are a man and have that men's age.


We'll say men_at_mean_age equals Titanic data where Titanic data at age is mean man age and they're a man. And then let's just look at that.

All right. Looks like they've got 30.7. And there are 124 that got fixed. All right.

That looks pretty good. We can check the women too. But I'll let you know right now.

It worked. All right. Next, we're going to take care of the embarked values, of which there are two NAs.

So the way we're going to do that is we're just going to fill in those two NA values with 'S'. That's because that's the mode; it's the most common value. So let's say Titanic data at embarked equals Titanic data at embarked, where we fill NA with 'S'. And let's just check the isna sum.

All right. The only ones left are cabins. And again, we're not going to worry about those.

So next, we are going to do some data analysis and try to figure out which of these features seems important enough to train our model on.