You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

dreader

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

dreader

1.2.0
bundlerRubygems
Version published
Maintainers
1
Created
Source

#+TITLE: Dreader #+AUTHOR: Adolfo Villafiorita #+STARTUP: showall

Dreader is a simple DSL built on top of [[https://github.com/roo-rb/roo][Roo]] to read and process tabular data (CSV, LibreOffice, Excel) in a simple and structured way.

Main advantages:

  • All code to parse input data has the same structure, simplifying code management and understanding (convention over configuration).
  • It favors a declarative approach, clearly identifying from which data has to be read and in which way.
  • Has facilities to run simulations, to debug and check code and data.

We use Dreader for importing fairly big files (in the order of 10K-100K records) in MIP, an ERP to manage distribution of bins to the population. The main issues we had before using Dreader were errors and exceptional cases in the input data. We also had to manage several small variations in the input files (coming from different ERPs) and Dreader helped us standardizing the input code.

The gem depends on =roo=, from which it leverages all data reading/parsing facilities keeping its size in about 250 lines of code.

It should be relatively easy to use; /dreader/ stands for /d/ata /r/eader.

  • Installation

Add this line to your application's Gemfile:

#+BEGIN_EXAMPLE ruby gem 'dreader' #+END_EXAMPLE

And then execute:

#+BEGIN_EXAMPLE $ bundle #+END_EXAMPLE

Or install it yourself as:

#+BEGIN_EXAMPLE $ gem install dreader #+END_EXAMPLE

  • Usage

** Quick start

Print name and age of people from the following data:

| Name | Date of birth | |------------------+-----------------| | Forest Whitaker | July 15, 1961 | | Daniel Day-Lewis | April 29, 1957 | | Sean Penn | August 17, 1960 |

#+BEGIN_EXAMPLE ruby require 'dreader'

class Reader extend Dreader::Engine

options do
  # we start reading from row 2
  first_row 2
end

column :name do
  doc "column A contains :name, a string; doc is optional"
  colref 'A'
end

# column B contains :birthdate, a date. We can use a Hash and omit
# colref
column({ birthdate: 'B' }) do
  process do |c|
    Date.parse(c)
  end
end

# add as many example lines as you want to show examples of good
# records these example lines are added to the template generated with
# generate_template
example { name: "John", birthday: "27/03/2020" }

# for each line, :age is computed from :birthdate
virtual_column :age do
  process do |row|
    birthdate = row[:birthdate][:value]
    birthday = Date.new(Date.today.year, birthdate.month, birthdate.day)
    today = Date.today
    [0, today.year - birthdate.year - (birthday < today ? 1 : 0)].max
  end
end

# this is how we process each line of the input file
mapping do |row|
  r = Dreader::Util.simplify(row)
  puts "#{r[:name]} is #{r[:age]} years old (born on #{r[:birthdate]})"
end

end

reader = Reader

read the file

reader.read filename: "Birthdays.ods"

compute the virtual columns

reader.virtual_columns

run the mapping declaration

reader.mappings

Here we can do further processing on the data

File.open("ages.txt", "w") do |file| reader.table.each do |row| unless row[:row_errors].any? file.puts "#{row[:name][:value]} #{row[:age][:value]}" end end end #+END_EXAMPLE

** Gentler Introduction

To write an import function with Dreader:

  • Declare which is the input file and where we can find data (Sheet and first row) (This can also be specified in each call.)
  • Declare the content of columns and, then, how to check raw data, parse data, and check parsed data
  • Add virtual columns, that is, columns computed from other values in the row
  • Specify what lines you want to reject, if any
  • Specify how to transform lines. This is where you do the actual work (for instance, if you process a file line by line) or put together data for processing after the file has been fully read --- see the next step.

Dreader now knows ho to collect, shape, and tranform (map) data according to your instructions. We are now ready to do the actual work. This consists of the following steps, various of which can be performed together:

  • Read the file
  • Do the parsing/transformations
  • Compute the virtual columns
  • Do the mappings

Each step is described in more details in the following sections.

*** Declare which is the input file and where we can find data

Require =dreader= and declare a class which extends =Dreader::Engine=:

#+BEGIN_EXAMPLE ruby require 'dreader'

class Reader extend Dreader::Engine [...] end #+END_EXAMPLE

Specify parsing option in the class, using the following syntax:

#+BEGIN_EXAMPLE ruby options do filename 'example.ods' # this optional. Use it when the file does not have an extension extension :ods

sheet 'Sheet 1'

first_row 1
last_row 20

# optional (this allows to integrate with other applications already
# using a logger)
logger Logger.new
logger_level Logger::INFO

end #+END_EXAMPLE

where:

  • (optional) =filename= is the file to read. If not specified, you will have to supply a filename when loading the file (see =read=, below). Use =.tsv= for tab-separated files.
  • (optional) =extension= overrides or specify the extension of =filename=. Takes as input a symbol (e.g., =:xlsx=).
    Notice that value of this option is not appended to filename (see =read= below). Filename must thus be a valid reference to a file in the file system. This option is useful in one of these two circumstances:
    • When =filename= has no extension
    • When you want to override the extension of the filename, e.g., to force reading a "file.csv" as a tab separated file
  • (optional) =first_row= is the first line to read (use =2= if your file has a header)
  • (optional) =last_row= is the last line to read. If not specified, we will rely on =roo= to determine the last row. This is useful for those files in which you only want to process some of the content or contain "garbage" after the records.
  • (optional) =sheet= is the sheet name or number to read from. If not specified, the first (default) sheet is used
  • (optional) =debug= specifies that we are debugging
  • (optional) =logger= specifies the logger
  • (optional) =logger_level= specifies the logger level

You can override some of the defaults by passing a hash as argument to the =read= function. For instance:

#+BEGIN_EXAMPLE ruby Reader.read filename: another_filepath #+END_EXAMPLE

will read data from =another_filepath=, rather than from the filename specified in the options. This might be useful, for instance, if the same specification has to be used for different files.

*** Declare the content of columns and how to parse them

Declare the columns you want to read by assigning them a name and a column reference.

There are two notations:

#+BEGIN_EXAMPLE ruby

First notation, colref is put in the block

column :name do colref 'A' end

Second notation, a hash is passed in the name

column({ name: 'A' }) do end #+END_EXAMPLE

The reference to a column can either be a letter or a number. First column is ='A'= or =1=.

The =column= declaration can contain Ruby blocks:

  • one or more =check_raw= block check raw data as read from the input file. They can be used, for instance, to verify presence of a value in the input file. Check must return true if there are no errors; any other value (e.g. an array of messages) is considered an error.
  • =process= can be used to transform data into something closer to the input data required for the importing (e.g., it can be used for downcase or strip a string)
  • one or more =check= block perform a check on the =process=ed data, to check for errors. They can be used, for instance, to check that a model built with =process= is valid. Check must return true if there are no errors.

#+begin_example column({ name: 'A' }) do check_raw do |cell| !cell.nil? end end #+end_example

#+begin_quote If you declare more than a check block of the same type per column, use a unique symbol to distinguish the blocks or the error messages will be overwritten. #+end_quote

#+begin_example column({ name: 'A' }) do check_raw :must_be_non_nil do |cell| !cell.nil? end

check_raw :first_letter_must_be_a do |cell|
  cell[0] == 'A'
end

end #+end_example

#+begin_quote =process= is always executed before =check=. If you want to check raw data use the =check_raw= directive. #+end_quote

#+begin_quote There can be only one process block. If you define more than one per column, only the last one is executed. #+end_quote

#+begin_example column({ name: 'A' }) do check_raw do |cell| # Here cell is like in the input file end

process do |cell|
  cell.upcase
end

check do |cell|
  # Here cell is upcase and
end

end #+end_example

For instance, given the tabular data:

| Name | Date of birth | |------------------+-----------------| | Forest Whitaker | July 15, 1961 | | Daniel Day-Lewis | April 29, 1957 | | Sean Penn | August 17, 1960 |

we could use the following declaration to specify the data to read:

#+BEGIN_EXAMPLE ruby

we want to access column 1 using :name (1 and A are equivalent)

:name should be non nil and of length greater than 0

column :name do colref 1 check do |x| x and x.length > 0 end end

we want to access column 2 (Date of birth) using :birthdate

column :birthdate do colref 2

# make sure the column is transformed into a Date
process do |x|
  Date.parse(x)
end

# check age is a date (check is invoked on the value returned
# by process)
check do |x|
  x.class == Date
end

end #+END_EXAMPLE

#+BEGIN_NOTES

  • The column name can be anything Ruby can use as a key for a Hash, such as, for instance, symbols, strings, and even object instances.
  • =colref= can be a string (e.g., ='A'=) or an integer, with 1 and "A" being the first column.
  • You need to declare only the columns you want to import. For instance, we could skip the declaration for column 1, if 'Date of Birth' is the only data we want to import
  • If =process= and =check= are specified, then =check= will receive the result of invoking =process= on the cell value. This makes sense if process is used to make the cell value more accessible to ruby code (e.g., transforming a string into an integer). #+END_NOTES

If there are different columns that have to be read and processed in the same way, =columns= (notice the plural form) allows for a more compact representation:

#+BEGIN_EXAMPLE ruby columns { a: 'A', b: 'B' } #+END_EXAMPLE

is equivalent to:

#+BEGIN_EXAMPLE ruby column :a do colref 'A' end

column :b do colref 'B' end #+END_EXAMPLE

=columns= accepts a code block, which can be used to add =process= and =check= declarations:

#+BEGIN_EXAMPLE ruby columns({ a: 'A', b: 'B' }) do process do |cell| ... end end #+END_EXAMPLE

See [[file:examples/wikipedia_us_cities/us_cities_bulk_declare.rb][us_cities_bulk_declare.rb]] for an example of =columns=.

#+BEGIN_NOTES If you use code blocks, don't forget to put in parentheses the column mapping, or the Ruby parser won't be able to distinguish the hash from the code block. #+END_NOTES

The data read from each row of our input data is stored in a hash. The hash uses column names as the primary key and stores the values in the =:value= key.

*** Add virtual columns

Sometimes it is convenient to aggregate or otherwise manipulate the data read from each row, before doing the actual processing.

For instance, we might have a table with dates of birth, while we are really interested in the age of people.

In such cases, we can use virtual column. A virtual column allows one to add a column to the data read, computed using the values of other cells in the same row.

The following declaration adds an =age= column to each row of the data read from the previous example:

#+BEGIN_EXAMPLE ruby virtual_column :age do process do |row| # the function compute_birthday has to be defined compute_birthday(row[:birthdate]) end end #+END_EXAMPLE

Virtual columns are, of course, available to the =mapping= directive (see below).

*** Specify which lines to reject

You can reject some lines using the =reject= declaration, which is applied row by row, can predicate over columns and virtual columns, and has to return a Boolean value.

All lines returning a truish value will be be rejected, that is, not stored in the =@table= variable (and, consequently, passed to the mapping function).

For instance, the following declaration rejects all lines in which the population column is higher than =3_000_000=:

#+begin_src ruby reject { |row| row[:population][:value] > 3_000_000 } #+end_src

*** Specify how to process each line

The =mapping= directive specifies what to do with each line read. The =mapping= declaration takes an arbitrary piece of ruby code, which can reference the fields using the column names we declared.

For instance the following code gets the value of column =:name=, the value of column =:age= and prints them to standard output

#+BEGIN_EXAMPLE ruby mapping do |row| puts "#{row[:name][:value]} is #{row[:age][:value]} years old" end #+END_EXAMPLE

To invoke the =mapping= declaration on a file, use the =mappings= method, which invokes =map= to each row and it stores in the =@table= variable whatever value mapping returns.

*** Process data

If =mapping= does not work for your data processing activities (e.g., you need to make elaborations on data which span different rows), you can add your perform your elaborations on the data transformed by =mappings=.

A typical scenario works as follows:

  • Reference the class =i = Reader= and use =i.read= or =i.load= (synonyms), to read all data.

#+BEGIN_EXAMPLE ruby i = Reader

read uses the options if defined and takes the same arguments as options

examples:

i.read

i.read filename: "example.ods"

i.read filename: "example.ods", extension: :ods

i.read filename: "example", extension: :ods

(the line above opens the file "example" as an Open Document Spreasdheet)

i.read

alternately

Reader.read #+END_EXAMPLE

  • Use =errors= to see whether any of the check functions failed:

#+BEGIN_EXAMPLE ruby array_of_hashes = i.errors array_of_hashes.each do |error_hash| puts error_hash end #+END_EXAMPLE

  • Use =virtual_columns= to generate the virtual columns:

#+BEGIN_EXAMPLE ruby i.virtual_columns #+END_EXAMPLE

(Optionally: check again for errors.)

  • Use the =mappings= function to execute the =mapping= directive on each line read from the file.

#+BEGIN_EXAMPLE ruby i.mappings #+END_EXAMPLE

(Optionally: check again for errors.)

  • Add your own code to process the data returned after =mappings=, which you can assign to a variable (e.g., =returned_data = i.mappings=) or access with =i.table= or =i.data= (synonyms).

#+begin_quote Notice that =mappings= does a side effect and invoking the mapping twice in a row won't work: you need to reload the file first. #+end_quote

Look in the examples directory for further details and a couple of working examples.

*** Improving performances

While debugging your specification executing =read=, =virtual_columns=, and =mappings= in distinct steps is a good idea. When you go in production, you might want to reduce the number of passes you perform on the data.

You can pass the option =virtual: true= to =read= to compute virtual columns while you are reading data.

You can pass the option =mapping: true= to =read= to compute virtual columns and perform the mapping while you are reading data. Notice that:

  • =mapping= implies =virtual=, that is, if you pass =mapping: true= the read function will also compute virtual columns
  • =mapping= alters the content of =@table= and subsequent calls to =virtual_column= and =mapping= will fail. You have reset by invoking =read= again.

*** Managing Errors

**** Finding errors in input data

Dreader collects errors in three specific ways:

  • In each column specification, using =check_raw= and =check=. This allows to check each field for errors (e.g., a =nil= value in a cell)
  • In virtual columns, using =check_raw= and =check=. This allows to perform more complex checks by putting together all the values read from a row (e.g., =to_date= occurs before =from_date=)

The following, for instance checks that name or surname have a valid value:

#+begin_example ruby virtual_column :global_check do doc "Name or Surname must exist" check :name_or_surname_must_be_defined do |row| row[:name] || row[:surname] end end #+end_example

If you prefer, you can also define a virtual column that contains the value of the check:

#+begin_example ruby virtual_column :name_or_surname_exist do doc "Name or Surname must exist" process do |row| row[:name] || row[:surname] end end #+end_example

You can then act in the mapping directive according to value returned by the virtual column:

#+begin_example ruby mapping do |row| unless row[:global_check][:value] == false [...] end #+end_example

**** Managing Errors

You can check for errors in two different ways:

The first is in the =mapping= directive, where can check whether some checks for the =row= failed, by:

  • checking from the =:error= boolean key associated to each column, that is:

    =row[<column_name>][:error]=

  • looking at the value of the =:row_errors= key, which contains all error messages generated for the row:

    =row[:row_errors]=

  • After the processing, by using the method =errors=, which lists all the errors.

The utility function =Dreader::Util.errors= takes as input the errors generated by Dreader and extract those of a specific row and, optionally column:

#+begin_example ruby

get all the errors at line 2

Dreader::Util.errors i.errors, 2

get all the errors at line 2, column 'C'

Dreader::Util.errors i.errors, 2, 3 #+end_example

  • Generating a Template from the specification

From version 0.6.0 =dreader= allows to generate a template starting from the specification.

The template is generated by the following call:

#+begin_example ruby generate_template template_filename: "template.xlsx" #+end_example

(The =template_filename= directive can also be specified in the =options= section).

The template contains the following rows:

  • The first row contains the names of the columns, as specified in the =columns= declarations and made into a human readable form.
  • The second row contains the doc strings of the columns, if set.
  • The remaining rows contain the example records added with the =example= directive

The position of the first row is determined by the value of =first_row=, that is, if =first_row= is 2 (content starts from the second row), the header row is put in row 1.

Only Excel is supported, at the moment.

An example of template generation can be found in the Examples.

** Digging deeper

If you need to perform elaborations which cannot be performed row by row you can access all data, with the =table= method:

#+BEGIN_EXAMPLE ruby i.read i.table #+END_EXAMPLE

The function =i.table= returns an array of Hashes. Each element of the array is a row of the input file. Each element/row has the following structure:

#+BEGIN_EXAMPLE ruby { col_name1: { }, [...] col_nameN: { }, row_errors: [ ], row_number: } #+END_EXAMPLE

where =col_name1=, ..., =col_nameN= are the names you have assigned to the columns and the information stored for each cell is the following:

#+BEGIN_EXAMPLE ruby { value: ..., # the result of calling process on the cell row_number: ..., # the row number col_number: ..., # the column number error: ... # the result of calling check on the cell processed value } #+END_EXAMPLE

(Note that virtual columns only store =value= and a Boolean =virtual=, which is always =true=.)

Thus, for instance, given the example above returns:

#+BEGIN_EXAMPLE ruby i.table [ { name: { value: "John", row_number: 1, col_number: 1, errors: nil }, age: { value: 30, row_number: 1, col_number: 2, errors: nil } }, { name: { value: "Jane", row_number: 2, col_number: 1, errors: nil }, age: { value: 31, row_number: 2, col_number: 2, errors: nil } } ] #+END_EXAMPLE

  • Simplifying the hash with the data read

The =Dreader::Util= class provides some functions to simplify the hashes built by =dreader=. This is useful to simplify the code you write and to genereate hashes you can pass, for instance, to ActiveRecord creators.

** Simplify removes everything but the values

=Dreader::Util.simplify(hash)= removes all information but the value and makes the value accessible directly from the name of the column.

#+BEGIN_EXAMPLE ruby i.table[0] { name: { value: "John", row_number: 1, col_number: 1, errors: nil }, age: { value: 30, row_number: 1, col_number: 2, errors: nil } }

Dreader::Util.simplify i.table[0] { name: "John", age: 30 } #+END_EXAMPLE

** Slice and Clean select columns

=Dreader::Util.slice(hash, keys)= and =Dreader::Util.clean(hash, keys)=, where =keys= is an arrays of keys, are respectively used to select or remove some keys from the hash returned by Dreader. (Notice that the Ruby Hash class already provides similar methods.)

#+BEGIN_EXAMPLE ruby i.table[0] { name: { value: "John", row_number: 1, col_number: 1, errors: nil }, age: { value: 30, row_number: 1, col_number: 2, errors: nil } }

Dreader::Util.slice i.table[0], :name { name: { value: "John", row_number: 1, col_number: 1, errors: nil}

Dreader::Util.clean i.table[0], :name { age: { value: 30, row_number: 1, col_number: 2, errors: nil } #+END_EXAMPLE

The methods =slice= and =clean= are more useful when used in conjuction with =simplify=:

#+BEGIN_EXAMPLE ruby hash = Dreader::Util.simplify i.table[0] { name: "John", age: 30 }

Dreader::Util.slice hash, [:age] { age: 30 }

Dreader::Util.clean hash, [:age] { name: "John" } #+END_EXAMPLE

The output produced by =slice= and =simplify= is a hash which can be used to create an =ActiveRecord= object.

** Better Integration with ActiveRecord

Finally, the =Dreader::Util.restructure= method helps building hashes to create [[http://api.rubyonrails.org/classes/ActiveModel/Model.html][ActiveModel]] objects with nested attributes.

The starting point is a simplified row.

#+BEGIN_EXAMPLE ruby hash = { name: "John", surname: "Doe", address: "Unknown", city: "NY" }

Dreader::Util.restructure hash, [:name, :surname, :address_attributes, [:address, :city]] { name: "John", surname: "Doe", address_attributes: { address: "Unknown", city: "NY" } } #+END_EXAMPLE

  • Debugging your specification

The =debug= function prints the current configuration, reads some records from the input file(s), and shows the records read:

#+BEGIN_EXAMPLE ruby i.debug i.debug n: 40 # read 40 lines (from first_row) i.debug n: 40, filename: filepath # like above, but read from filepath #+END_EXAMPLE

By default =debug= invokes the =check_raw=, =process=, and =check= directives. Pass the following options, if you want to disable this behavior; this might be useful, for instance, if you intend to check only what data is read:

#+BEGIN_EXAMPLE ruby i.debug process: false, check: false #+END_EXAMPLE

Notice that =check= implies =process=, since =check= is invoked on the output of the =process= directive.`

If you prefer, in alternative to =debug= you can also use configuration variables (but then you need to change the configuration according to the environment):

#+begin_example ruby i.options do debug true end #+end_example

  • Changelog

See [[file:CHANGELOG.org][CHANGELOG]].

  • Known Limitations

At the moment:

  • it is not possible to specify column references using header names (like Roo does).
  • some more testing wouldn't hurt.
  • Known Bugs

Some known bugs and an unknown number of unknown bugs.

(See the open issues for the known bugs.)

  • Development

After checking out the repo, run =bin/setup= to install dependencies. You can also run =bin/console= for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run =bundle exec rake install=. To release a new version, update the version number in =version.rb=, and then run =bundle exec rake release=, which will create a git tag for the version, push git commits and tags, and push the =.gem= file to [[https://rubygems.org][rubygems.org]].

  • Contributing

Bug reports and pull requests are welcome.

You need to get in touch with me by email, till I figure how to enable it in Gitea.

  • License

[[https://opensource.org/licenses/MIT][MIT License]].

FAQs

Package last updated on 02 Nov 2023

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts