Skip to main content
Colin Jaffe/2 min read

Extracting Pagination Data: Navigating Web Elements

Data Science Foundations

Statistics

Hypothesis testing, distributions, sampling — the math behind decisions.

Programming

Python or R — pandas, numpy, scikit-learn.

Communication

Explain findings to non-technical stakeholders.

Domain Knowledge

Context separates analysis from insight.

Master Data Science at Noble Desktop

Noble Desktop's Data Science & AI Certificate covers Python, machine learning, and the modern data science stack.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Extract the maximum page number from the pagination element using BeautifulSoup. Watch this tutorial to learn the key concepts and techniques.

Let's take a look at how we could do this step-by-step. First step would absolutely be doing a little exploration to figure out how we can hook into this element. Let's inspect it.

It is an <li> tag, that's the name of the tag, just like <p>, <a>, and <h3> tags that we've been working with. <li>, and it has an attribute to identify a class of "current." That one has the class of "next"; that's not the one we want.

But the class equals "current." That's got the text in it we want: Page 1 of 50. All right, let's take a look.

Now that we know it's an <li> with a class of "current, " we could say pagination element—equals soup.find. We just want to find one. Find the <li> with the class of "current." All right, that should do that.

Now I want the text in it. I'm just going to break this up. It could potentially be done all in one line or two.

Let's do it in three. That's the text that's in it. And now for our bonus, which we definitely want to do, we ultimately want to get what the maximum number of pages is. It's that pagination content, but I want it split into words.

And that's what .split will do. It will take a string and make it into a list of words. Now that it's a list, I could say I want the last—now it's a list of words.

And we can check that out. For example, from "Page 1 of 50, " we can extract the list of words. I want the last word in that list, so after I split it, give me index -1.

And there it is. Ooh, it is the string "50." We should probably make it the integer version of all of that.

And there we go. Not a string now. That could have caused problems later.

Okay, our next step is to take this and do a very complex and beautiful loop to hit up every single page on this site and get all that beautiful data.