Parsing HTML with Ruby

Introduction

When working with web scraping or data extraction tasks, parsing HTML is a common requirement. Ruby, with its powerful libraries and tools, provides a convenient way to parse HTML documents and extract the desired information. In this article, we will explore how to parse HTML with Ruby using the Nokogiri gem.

Installing Nokogiri

Nokogiri is a popular Ruby gem for parsing HTML and XML documents. To install Nokogiri, you can simply add it to your Gemfile:

gem 'nokogiri'

Then run bundle install to install the gem.

Parsing HTML with Nokogiri

Once Nokogiri is installed, you can start parsing HTML documents. Here's a simple example of how to parse an HTML document using Nokogiri:

require 'nokogiri'
require 'open-uri'

html = open('https://example.com').read
doc = Nokogiri::HTML(html)

# Extracting all links from the document
links = doc.css('a')
links.each do |link|
  puts link['href']
end

In the above example, we first read the HTML content of a webpage using the open-uri library. Then, we create a Nokogiri document object from the HTML content. We can use CSS selectors to extract specific elements from the document, in this case, all the links (a tags) on the page.

Using CSS Selectors

Nokogiri allows you to use CSS selectors to target specific elements in the HTML document. Here are some common CSS selectors and their usage:

Element Selector: Selects all elements of a specific type
ID Selector: Selects an element with a specific ID
Class Selector: Selects elements with a specific class
Attribute Selector: Selects elements with a specific attribute

Here's an example of using CSS selectors to extract specific elements:

# Selecting all paragraphs
paragraphs = doc.css('p')

# Selecting an element with ID 'header'
header = doc.css('#header')

# Selecting elements with class 'content'
content = doc.css('.content')

# Selecting images with alt attribute
images = doc.css('img[alt]')

Extracting Text and Attributes

Once you have selected the desired elements using CSS selectors, you can extract text and attributes from them. Here's how you can extract text and attributes using Nokogiri:

# Extracting text from paragraphs
paragraphs.each do |paragraph|
  puts paragraph.text
end

# Extracting href attribute from links
links.each do |link|
  puts link['href']
end

By calling the text method on an element, you can extract the text content of that element. Similarly, you can access attributes of an element by using square brackets and the attribute name.

Handling Nested Elements

HTML documents often contain nested elements, such as lists within divs or tables within sections. Nokogiri allows you to navigate through nested elements using CSS selectors. Here's an example of handling nested elements:

# Selecting a div with class 'container'
container = doc.css('.container')

# Selecting all lists within the container
lists = container.css('ul')

lists.each do |list|
  list_items = list.css('li')
  list_items.each do |item|
    puts item.text
  end
end

In the above example, we first select a div with the class 'container'. Then, we select all unordered lists (ul) within the container and iterate over each list to extract the list items (li) and print their text content.

Conclusion

Parsing HTML with Ruby using the Nokogiri gem is a powerful and efficient way to extract information from HTML documents. By leveraging CSS selectors and methods provided by Nokogiri, you can easily navigate through HTML elements and extract the desired data. Whether you are scraping websites for data or processing HTML documents, Nokogiri makes the task straightforward and convenient.