Skip to main content
April 2, 2026Colin Jaffe/5 min read

Web Scraping: Extracting Non-Truncated Titles and Prices with Python

Master Python web scraping for complete data extraction

Web Scraping Fundamentals

200
HTTP Success Status
2
Key Data Types to Extract
3
Main HTML Tags Used
Required Libraries

This tutorial uses Beautiful Soup for HTML parsing and the requests library for HTTP operations. Both are essential for effective web scraping in Python.

Basic Web Scraping Workflow

1

Make HTTP Request

Use requests.get() to fetch the webpage and verify the status code is 200

2

Parse HTML Content

Create a Beautiful Soup object from the response content using HTML parser

3

Locate Target Elements

Find specific HTML tags containing the data you need to extract

4

Extract and Process Data

Retrieve text content and clean it for your specific use case

Loop vs List Comprehension Approaches

FeatureTraditional LoopList Comprehension
ReadabilityMore verboseConcise
PerformanceStandardSlightly faster
Complexity HandlingBetter for complex logicBest for simple operations
Code LengthMultiple linesSingle line
Recommended: Use list comprehensions for simple operations, loops for complex logic

HTML Element Targeting Strategies

By Tag Name

Use find_all('h3') to locate all H3 tags. Simple but may capture unintended elements if the page structure is complex.

By Class Attribute

Target specific elements using class names like 'price_color'. More precise than tag-only selection for styled content.

Nested Element Search

Find A tags within H3 elements using tag.find('a'). Essential for extracting specific content from complex structures.

Text vs Title Attribute Extraction

Pros
Title attributes contain complete, non-truncated content
Provides full information even when display text is shortened
Better for data analysis requiring complete text
Avoids ellipsis and truncation issues
Cons
Not all HTML elements have title attributes
Requires knowledge of the specific attribute structure
May not match exactly what users see on the page
Additional step in the extraction process

Price Data Cleaning Process

1

Extract Raw Price Strings

Use Beautiful Soup to find all paragraph tags with class 'price_color' and get their text content

2

Remove Currency Symbols

Apply strip() method to each price string to remove pound symbols and other non-numeric characters

3

Convert to Numeric Format

Use float() function to convert cleaned strings into numerical values for calculations and analysis

4

Validate Results

Print and verify that prices are now proper floating-point numbers without currency symbols

Data Quality Verification

0/5
Best Practice for Data Processing

Always clean and validate your scraped data immediately after extraction. Converting prices to numerical format enables proper sorting, calculations, and integration with data analysis frameworks like pandas.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Let's examine several approaches to solving this web scraping challenge. We'll begin by making a request and storing the response: `response = requests.get(url)`. Before proceeding, it's crucial to validate that our request succeeded by checking if the status code is not 200.

If the status code indicates failure, we should handle this gracefully with an appropriate error message: "Error: book data not available for scraping." This defensive programming approach prevents downstream errors and provides clear feedback when resources are inaccessible.

Now let's extract the book titles from the page structure. Upon inspecting the HTML, we can see that each title is contained within an anchor tag (``) nested inside an `

` tags first, then extract the text from their child anchor elements.

Our strategy involves two steps: first, identify all `

` elements (which we'll call `title_tags` for clarity), then iterate through each to extract the anchor tag content. However, before we can parse the HTML structure, we need to initialize our BeautifulSoup parser object.

Let's create our soup object: `soup = BeautifulSoup(response.content, 'html.parser')`. Now we can proceed with finding our target elements: `title_tags = soup.find_all('h3')`. This gives us a collection of all heading elements containing our book titles.

With our title tags identified, we can extract the actual title text from each anchor element. While this could be accomplished with a traditional for loop, a list comprehension offers a more pythonic and readable solution for this straightforward transformation.

Here's our initial approach using a loop structure: we create an empty `titles` list, then iterate through each tag in `title_tags`. For each tag, we locate its child anchor element using `tag.find('a')` and extract the text content. However, we want the raw text content, not any HTML attributes, so we'll use the `.get_text()` method.

Let's refactor this into a more elegant list comprehension: `titles = [tag.find('a').get_text() for tag in title_tags]`. This single line accomplishes the same task as our loop while remaining highly readable. The comprehension clearly states our intent: "Create a list where, for every tag in title_tags, we find the anchor element and extract its text content."

This level of conciseness is ideal for list comprehensions. Any more complex logic would warrant returning to a traditional loop for better maintainability. The key is balancing brevity with clarity—a principle that becomes increasingly important in production web scraping applications.

Next, let's tackle price extraction. Examining the page structure, we find that prices are contained within paragraph (`

`) elements with the CSS class `price_color`. We can target these elements specifically using BeautifulSoup's attribute-based search functionality.

We'll use another list comprehension to extract prices: `prices = [p.get_text() for p in soup.find_all('p', class_='price_color')]`. Note how we pass the class name as a parameter to `find_all()`—BeautifulSoup handles the CSS class selection seamlessly.

This approach demonstrates the flexibility of our extraction strategy. We've used a pre-defined variable (`title_tags`) for titles but incorporated the element search directly into the list comprehension for prices. Both approaches are valid; choose based on code readability and whether you'll reuse the intermediate results.

Now let's address the bonus challenges that will make our scraped data more useful for analysis and storage.

First, we need to handle title truncation. When we print our current titles, you'll notice they're cut off with ellipses (...). This truncation occurs because the visible text is shortened for display purposes, but the complete title is preserved in the anchor tag's `title` attribute.

The solution is straightforward: instead of extracting the visible text with `.get_text()`, we'll access the `title` attribute: `titles = [tag.find('a')['title'] for tag in title_tags]`. This simple change provides us with the complete, untruncated book titles—essential for accurate data analysis and user presentation.

The price formatting challenge requires more involved string manipulation. Currently, our prices are strings containing currency symbols (£), which prevents numerical operations like sorting, averaging, or mathematical comparisons. We need to convert these to clean floating-point numbers.

This transformation requires two steps: removing the currency symbol and converting to a numerical data type. We can't simply apply `.strip()` to the entire list—it must be applied to each individual string element. This calls for another list comprehension that combines string cleaning with type conversion.

Here's our approach: `prices = [float(price.strip('£')) for price in prices]`. The `.strip('£')` method removes the pound symbol from each price string, and `float()` converts the cleaned string to a numerical value. If you're unsure of the exact currency symbol, you can copy it directly from the scraped data—a practical technique when dealing with various Unicode characters.

After applying this transformation, our prices become true numerical values suitable for mathematical operations, data analysis, and storage in structured formats like pandas DataFrames. You might notice some floating-point precision artifacts (like 22.6 instead of 22.60), but these don't affect numerical accuracy for most analytical purposes.

With both title and price data properly formatted—complete titles as clean strings and prices as numerical values—we've created a robust dataset ready for further analysis, visualization, or integration into larger data processing pipelines. This clean, structured approach to web scraping ensures our extracted data meets professional standards for reliability and usability.

Key Takeaways

1Always verify HTTP status code 200 before attempting to parse webpage content to ensure successful data retrieval
2Use Beautiful Soup's find_all() method with specific tag names and class attributes for precise element targeting
3Extract complete titles using HTML title attributes rather than truncated display text to get full content
4List comprehensions provide cleaner, more concise code for simple data extraction operations compared to traditional loops
5Clean extracted price data by removing currency symbols before converting strings to numerical float values
6Nested element searching (finding A tags within H3 tags) enables extraction from complex HTML document structures
7The strip() method is essential for removing unwanted characters like currency symbols from scraped text data
8Convert cleaned price strings to float data type to enable mathematical operations and proper data analysis integration

RELATED ARTICLES