When working with web scraping or data extraction tasks, parsing HTML is a common requirement. Ruby, with its powerful libraries and tools, provides a convenient way to parse HTML documents and extract the desired information. In this article, we will explore how to parse HTML with Ruby using the Nokogiri gem.
Nokogiri is a popular Ruby gem for parsing HTML and XML documents. To install Nokogiri, you can simply add it to your Gemfile:
gem 'nokogiri'
Then run bundle install to install the gem.
Once Nokogiri is installed, you can start parsing HTML documents. Here's a simple example of how to parse an HTML document using Nokogiri:
require 'nokogiri' require 'open-uri' html = open('https://example.com').read doc = Nokogiri::HTML(html) # Extracting all links from the document links = doc.css('a') links.each do |link| puts link['href'] end
In the above example, we first read the HTML content of a webpage using the open-uri library. Then, we create a Nokogiri document object from the HTML content. We can use CSS selectors to extract specific elements from the document, in this case, all the links (a tags) on the page.
Nokogiri allows you to use CSS selectors to target specific elements in the HTML document. Here are some common CSS selectors and their usage:
Here's an example of using CSS selectors to extract specific elements:
# Selecting all paragraphs paragraphs = doc.css('p') # Selecting an element with ID 'header' header = doc.css('#header') # Selecting elements with class 'content' content = doc.css('.content') # Selecting images with alt attribute images = doc.css('img[alt]')
Once you have selected the desired elements using CSS selectors, you can extract text and attributes from them. Here's how you can extract text and attributes using Nokogiri:
# Extracting text from paragraphs paragraphs.each do |paragraph| puts paragraph.text end # Extracting href attribute from links links.each do |link| puts link['href'] end
By calling the text method on an element, you can extract the text content of that element. Similarly, you can access attributes of an element by using square brackets and the attribute name.
HTML documents often contain nested elements, such as lists within divs or tables within sections. Nokogiri allows you to navigate through nested elements using CSS selectors. Here's an example of handling nested elements:
# Selecting a div with class 'container' container = doc.css('.container') # Selecting all lists within the container lists = container.css('ul') lists.each do |list| list_items = list.css('li') list_items.each do |item| puts item.text end end
In the above example, we first select a div with the class 'container'. Then, we select all unordered lists (ul) within the container and iterate over each list to extract the list items (li) and print their text content.
Parsing HTML with Ruby using the Nokogiri gem is a powerful and efficient way to extract information from HTML documents. By leveraging CSS selectors and methods provided by Nokogiri, you can easily navigate through HTML elements and extract the desired data. Whether you are scraping websites for data or processing HTML documents, Nokogiri makes the task straightforward and convenient.
© 2024 RailsInsights. All rights reserved.