Intro to HTML & Web Scraping
Master HTML fundamentals and web scraping techniques
If you can see it on a website, it can be scraped, mined, and put into a dataframe
HTML Node Types
Document Node
The root node that contains the entire HTML document. This is the starting point for all web scraping operations.
Element Nodes
HTML elements like div, p, and title tags that structure the content and contain the data you want to extract.
Text Nodes
The actual text content inside HTML elements. This is typically what web scrapers target for data extraction.
All HTML elements follow the same pattern: opening tag, content, and closing tag. The namespace in both tags must match exactly for valid HTML.
Common HTML Elements
Title Tags
Used for page titles and headings. Example: <title>Content</title>
Paragraph Tags
Contains paragraph text content. Example: <p>Text content</p>
Strong Tags
Makes text bold for emphasis. Example: <strong>Bold text</strong>
Elements can be both parents and children depending on your reference point. This hierarchical structure is crucial for navigating HTML during web scraping.
HTML Attributes
href Attribute
Used in anchor tags to specify link destinations. Essential for extracting URLs from web pages.
id Attribute
Provides unique identifiers for elements, making them easier to target during web scraping operations.
Installation Process
Install Beautiful Soup
Run 'pip install beautifulsoup4' in your terminal or command prompt to install the HTML parsing library.
Install LXML Parser
Run 'pip install lxml' to install the XML and HTML parser that works with Beautiful Soup for better performance.
Open Your IDE
Launch Jupyter Notebook or your preferred IDE to begin importing the libraries and setting up your scraping environment.
Always verify response.status_code equals 200. Some websites block requests, and you need to confirm successful connection before proceeding with scraping.
The scraping code successfully extracted three key data points: article titles, links, and origin URLs from the target website.
Extracted Data Types
Article Titles
The main headlines or titles of articles found on the scraped webpage.
Article Links
Direct URLs linking to the full articles for further processing or analysis.
Origin URLs
Source website URLs where the original content is hosted.
Converting to DataFrame
Create List of Dictionaries
Organize your scraped data into a structured list where each item is a dictionary containing the extracted information.
Import Pandas
Import the Pandas library to access data manipulation and analysis tools for your scraped data.
Use DataFrame Method
Apply the DataFrame method to convert your list of dictionaries into a structured, analyzable Pandas DataFrame.
Part II of this web scraping series will cover XPath techniques and introduce the Scrapy framework for more advanced scraping workflows.
Key Takeaways




