Burly

A Ruby gem for extracting URLs from HTML, JSON, and plaintext documents.

Gem
Downloads
Source

Getting Started

Before installing and using Burly, you’ll want to have Ruby 2.6 (or newer) installed. Using a Ruby version managment tool like rbenv, chruby, or rvm is recommended.

Burly is developed using Ruby 3.4 and is tested against additional Ruby versions using Forgejo Actions.

Installation

Add Burly to your project’s Gemfile and run bundle install:

source "https://rubygems.org"

gem "burly"

Usage

Using Burly to parse plaintext documents is as straightforward as:

Burly.parse(File.read("example.txt"))

Parsing JSON or HTML documents is only slightly more complicated:

Burly.parse(File.read("example.html"), mime_type: "text/html")

Burly.parse(File.read("example.json"), mime_type: "application/json")

Burly uses slightly different parsing rules for each supported MIME type:

  • In plaintext documents, Burly extracts absolute URLs (e.g. https://website.example) from the document.
  • In JSON documents, Burly extracts string values that only contain absolute URLs (e.g. { "url": "https://website.example" } and { "urls": ["https://website.example", "https://another-website.example] })
  • In HTML documents, Burly extracts absolute and relative URLs from URL attributes and srcset attributes.

In all cases, neither order nor uniqueness is guaranteed. You may also consider converting relative URLs extract from HTML documents to absolute URLs using the document’s source URL and/or the `element'shrefattribute value (Ruby's [URI.join` class method](https://docs.ruby-lang.org/en/master/URI.html#method-c-join) is good for this!).

Parser Options

Burly’s HTML parser supports a single option, context, which accepts either a String or an Array of Strings. The values may be either CSS or XPath selectors

Burly.parse(File.read("example.html"), context: "main", mime_type: "text/html")

Burly.parse(File.read("example.html"), context: ["//main", "//div"], mime_type: "text/html")

In all cases, Burly will search for nodes matching the provided selector(s) and use the first match as the context within which to search for URLs. The context option is a great way to refine the list of extracted URLs based on their presence within the source document.

> [!NOTE]
> If Burly can’t locate a node matching the provided selector(s), the context is reset to the document root.

> [!TIP]
> Passing an Array of Strings can be used to achieve an effect similar to conditional logic with fallback behavior.
>
> ruby
> require "net/http"
>
> response = Net::HTTP.get(URI.parse("https://jgarber.example"))
>
> Burly.parse(response, context: [".h-entry .e-content", ".h-entry", "body"], mime_type: "text/html")
>

License

Burly is freely available under the MIT License.