Skip to main content
Colin Jaffe/3 min read

Analyzing Titanic Data: Combining Class and Gender for Insights

Titanic Insights

Class Mattered

First class survival rate ~62%, third class ~24%.

Gender Mattered More

Female survival ~74%; male ~19% — biggest single factor.

Combined Effect

First-class women had highest survival; third-class men lowest.

Feature Engineering

Combining class × gender often outperforms either alone in models.

Master Machine Learning at Noble Desktop

Noble Desktop's Python Machine Learning Bootcamp covers scikit-learn, Keras, neural networks, and applied ML.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Combine passenger class and gender into a categorical feature to analyze Titanic survival rates. Watch this tutorial to learn the key concepts and techniques.

We're going to do a little bit of fancy pandas DataFrame work to make p-class sex a thing. P-class sex will be a combination column that will combine their passenger class—first class, second class, or third class—and their gender. So first we'll define a list of possible values.

First class female, first class male, second class female, second class male, third class female, and third class male. Then we're going to make their values a combination of the p-class value and the sex value. And the way we're going to do that is we're going to say titanic_data at p_class_sex.

It's a new column and it will be p-class plus an underscore plus sex. There's only one more thing we need to do, which is that p-class is a number (1,2, or 3), while titanic_data['sex'] is a string. To convert this one to a string so it can be concatenated with this underscore and with the value of titanic_data sex, we're going to use astype(str).

And then our last step to make this work is to make it a categorical value. That means it has only specific possible values. We're going to say now titanic p-class sex is pandas' categorical column from titanic p-class sex.

And the categories are the order up above this list here. Then we can take a look at the series titanic data p-class sex. There we go.

We've got the head and the tail of the series—third class male, first class female, third class female, etc.—all within 91 rows. Great.

It's going to be really helpful; now we can take a look at that as a graph. We can graph that and see if this could be valuable and observe how these three columns—survived, passenger class, and sex—interact. So here we're going to: our axis is a Seaborn count plot where X is Survived and the hue is p-class sex, our new column.

And the data is titanic_data. And now we can see how each of them did. Third class male did very poorly.

Barely any of them survived. Second class male also did very poorly. And if you look at the females: first-class female—only three perished.

Ninety-one survived. Second class females—six perished, 70 survived. It's only when you get to third class that it evens out the gender advantage.

Seventy-two and seventy-two. That class was maybe not so important by the time you get down to third-class passengers, so the advantage of being a woman didn't fully counteract that. So yeah, we're seeing quite a lot of good data analysis here.

Our next step is to start putting this into data that the computer can read for modeling. Then we'll dive into a random forest classifier and see how it can help us analyze all this data.