csvpack
work with tabular data packages using comma-separated values (CSV) datafiles in text with datapackage.json; download, read into and query comma-separated values (CSV) datafiles with your SQL database (e.g. SQLite, PostgreSQL, ...) of choice and much more
Usage
What's a tabular data package?
Tabular Data Package is a simple structure for publishing and sharing
tabular data with the following key features:
- Data is stored in CSV (comma separated values) files
- Metadata about the dataset both general (e.g. title, author)
and the specific data files (e.g. schema) is stored in a single JSON file
named
datapackage.json
which follows the Data Package format
(Source: Tabular Data Packages, Frictionless Data Initiative • Data Hub.io • Open Knowledge Foundation • Data Protocols.org)
Here's a minimal example of a tabular data package holding two files, that is, data.csv
and datapackage.json
:
data.csv
:
Brewery,City,Name,Abv
Andechser Klosterbrauerei,Andechs,Doppelbock Dunkel,7%
Augustiner Bräu München,München,Edelstoff,5.6%
Bayerische Staatsbrauerei Weihenstephan,Freising,Hefe Weissbier,5.4%
Brauerei Spezial,Bamberg,Rauchbier Märzen,5.1%
Hacker-Pschorr Bräu,München,Münchner Dunkel,5.0%
Staatliches Hofbräuhaus München,München,Hofbräu Oktoberfestbier,6.3%
...
datapackage.json
:
{
"name": "beer",
"resources": [
{
"path": "data.csv",
"schema": {
"fields": [{ "name": "Brewery", "type": "string" },
{ "name": "City", "type": "string" },
{ "name": "Name", "type": "string" },
{ "name": "Abv", "type": "number" }]
}
}
]
}
Where to find data packages?
For some real world examples see the Data Packages Listing (Sources) at the Data Hub.io • Frictionless Data Initiative
website for a start. Tabular data packages include:
Name | Comments |
---|
country-codes | Comprehensive country codes: ISO 3166, ITU, ISO 4217 currency codes and many more |
language-codes | ISO Language Codes (639-1 and 693-2) |
currency-codes | ISO 4217 Currency Codes |
gdb | Country, Regional and World GDP (Gross Domestic Product) |
s-and-p-500-companies | S&P 500 Companies with Financial Information |
un-locode | UN-LOCODE Codelist |
gold-prices | Gold Prices (Monthly in USD) |
bond-yields-uk-10y | 10 Year UK Government Bond Yields (Long-Term Interest Rate) |
and many more
Code, Code, Code - Script Your Data Workflow with Ruby
require 'csvpack'
CsvPack.import(
's-and-p-500-companies',
'gdb'
)
Using CsvPack.import
will:
-
download all data packages to the ./pack
folder
-
(auto-)add all tables to an in-memory SQLite database using SQL create_table
commands via ActiveRecord
migrations e.g.
create_table :constituents_financials do |t|
t.string :symbol
t.string :name
t.string :sector
t.float :price
t.float :dividend_yield
t.float :price_earnings
t.float :earnings_share
t.float :book_value
t.float :_52_week_low
t.float :_52_week_high
t.float :market_cap
t.float :ebitda
t.float :price_sales
t.float :price_book
t.string :sec_filings
end
- (auto-)import all datasets using SQL inserts e.g.
INSERT INTO constituents_financials
(symbol,
name,
sector,
price,
dividend_yield,
price_earnings,
earnings_share,
book_value,
_52_week_low,
_52_week_high,
market_cap,
ebitda,
price_sales,
price_book,
sec_filings)
VALUES
('MMM',
'3M Company',
'Industrials',
162.27,
2.11,
22.28,
7.284,
25.238,
123.61,
162.92,
104.0,
8.467,
3.28,
6.43,
'http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM')
- (auto-)add ActiveRecord models for all tables.
So what? Now you can use all the "magic" of ActiveRecord to query
the datasets. Example:
pp Constituent.count
pp Constituent.first
pp Constituent.find_by!( symbol: 'MMM' )
pp Constituent.find_by!( name: '3M Company' )
pp Constituent.where( sector: 'Industrials' ).count
pp Constituent.where( sector: 'Industrials' ).all
and so on
Frequently Asked Questions (F.A.Qs) and Answers
Q: How to dowload a data package ("by hand")?
Use the CsvPack::Downloader
class to download a data package
to your disk (by default data packages get stored in ./pack
).
dl = CsvPack::Downloader.new
dl.fetch( 'language-codes' )
dl.fetch( 's-and-p-500-companies' )
dl.fetch( 'un-locode')
Will result in:
-- pack
|-- language-codes
| |-- data
| | |-- ietf-language-tags.csv
| | |-- language-codes-3b2.csv
| | |-- language-codes-full.csv
| | `-- language-codes.csv
| `-- datapackage.json
|-- s-and-p-500-companies
| |-- data
| | `-- constituents.csv
| `-- datapackage.json
`-- un-locode
|-- data
| |-- code-list.csv
| |-- country-codes.csv
| |-- function-classifiers.csv
| |-- status-indicators.csv
| `-- subdivision-codes.csv
`-- datapackage.json
Q: How to add and import a data package ("by hand")?
Use the CsvPack::Pack
class to read-in a data package
and add and import into an SQL database.
pack = CsvPack::Pack.new( './pack/un-locode/datapackage.json' )
pack.tables.each do |table|
table.up!
table.import!
end
Q: How to connect to a different SQL database?
You can connect to any database supported by ActiveRecord. If you do NOT
establish a connection in your script - the standard (default fallback)
is using an in-memory SQLite3 database.
SQLite
For example, to create an SQLite3 database on disk - lets say mine.db
-
use in your script (before the CsvPack.import
statement):
ActiveRecord::Base.establish_connection( adapter: 'sqlite3',
database: './mine.db' )
PostgreSQL
For example, to connect to a PostgreSQL database use in your script
(before the CsvPack.import
statement):
require 'pg'
ActiveRecord::Base.establish_connection( adapter: 'postgresql'
username: 'ruby',
password: 'topsecret',
database: 'database' )
Install
Just install the gem:
$ gem install csvpack
Alternatives
See the "Tools and Plugins for working with Data Packages"
page at the Frictionless Data Initiative.
License
The csvpack
scripts are dedicated to the public domain.
Use it as you please with no restrictions whatsoever.
Send them along to the ruby-talk mailing list. Thanks!