ScrapKit

What does it do?

ScrapKit automates web scraping and converts the results in plain objects by using configuration objects called recipes.

Each recipe can be loaded as an object or as JSON file, and have the following structure:

{
  "url": "https://status.heroku.com/",
  "attributes": {
    "apps": ".subnav__inner .ember-view:nth-child(1) > .status-summary__description",
    "data": ".subnav__inner .ember-view:nth-child(2) > .status-summary__description",
    "tools": ".subnav__inner .ember-view:nth-child(3) > .status-summary__description"
  }
}

url: It defines the web page to scrape.
attributes: Is an object that maps each attribute name with its corresponding CSS selector.

attributes can have a more complex structure to handle collections. For example:

{
  "url": "https://hpneo.dev/",
  "attributes": {
    "posts": {
      "selector": ".post-item",
      "children_attributes": {
        "title": "h2"
      }
    }
  }
}

In this case attributes has a posts key, which will store the results of a collection, defined by a CSS selector and an object of children attributes.

children_attributes is an object that maps each attribute name with its corresponding CSS selector (similar to how attributes works in its simpler version).

Installation

Add this line to your application's Gemfile:

gem 'scrap_kit'

And then execute:

$ bundle

Or install it yourself as:

$ gem install scrap_kit

Usage

ScrapKit::Recipe.load can take an object with the recipe, or load a JSON file.

recipe = ScrapKit::Recipe.load(
  url: "https://status.heroku.com/",
  attributes: {
    apps: ".subnav__inner .status-summary:nth-child(1) > .status-summary__description",
    data: ".subnav__inner .status-summary:nth-child(2) > .status-summary__description",
    tools: ".subnav__inner .status-summary:nth-child(3) > .status-summary__description",
  }
)

output = recipe.run
#=> {:apps=>"ok", :data=>"ok", :tools=>"ok"}

For more complex structures it's recommended to store the recipe in a JSON file:

recipe = ScrapKit::Recipe.load("./spec/fixtures/file.json")

output = recipe.run
#=> {:posts=>[{:title=>"APIs de Internacionalización en JavaScript"}, {:title=>"Ejecutando comandos desde Ruby"}, {:title=>"Usando Higher-Order Components"}]}