Socket
Socket
Sign inDemoInstall

github.com/caltechlibrary/dataset

Package Overview
Dependencies
Alerts
File Explorer
Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

github.com/caltechlibrary/dataset


Version published
Created
Source

Dataset Project

DOI

Project Status: Active – The project has reached a stable, usable state and is being actively developed.

The Dataset Project provides tools for working with collections of JSON Object documents stored on the local file system. Two tools are provided.

dataset command line tool

dataset is a command line tool for working with collections of JSON objects. Collections are stored on the file system. JSON objects are stored in collections as plain UTF-8 text files. This means the objects can be accessed with common Unix text processing tools as well as most programming languages.

The dataset command line tool supports common data management operations such as initialization of collections; document creation, reading, updating and deleting; listing keys of JSON objects in the collection; and associating non-JSON documents (attachments) with specific JSON documents in the collection.

enhanced features include

  • aggregate objects into data frames
  • import, export and synchronize JSON objects to and from CSV files
  • generate sample sets of keys and objects

See Getting started with dataset for a tour and tutorial.

dataset as a web service

datasetd is a web service implementation of the dataset command line program. It features a sub-set of capability found in the command line tool. This allows dataset collections to be integrated safely into other web applications or used by multiple processes.

Design choices

dataset and datasetd are intended to be simple tools for managing collections JSON object documents in a predictable structured way.

dataset and datasetd are guided by the idea that you should be able to work with JSON documents as easily as you can any plain text document on the Unix command line. dataset is intended to be simple to use with minimal setup (e.g. dataset init mycollection.ds creates a new collection called 'mycollection.ds').

  • dataset and datasetd store JSON object documents in collections
    • collections are folder(s) containing
      • collection.json metadata file describing the collection and keys
      • a pairtree of JSON object documents
      • non-JSON attachments can be associated with a JSON document and found in a semver (semantic version number) named sub directory

The choice of plain UTF-8 is intended to help future proof reading dataset collections. Care has been taken to keep dataset simple enough and light weight enough that it will run on a machine as small as a Raspberry Pi Zero while being equally comfortable on a more resource rich server or desktop environment. dataset can be re-implement in any programming language supporting file input and output, common string operations and along with JSON encoding and decoding functions. The current implementation is in the Go language.

Features

dataset supports

  • Listing Keys in a collection
  • Object level actions
  • Import and export of CSV files
  • The ability to reshape data by performing simple object joins
  • The ability to create data frames from while collections or based on keys lists
    • frames are defined using dot paths describing what is to be pulled out of a stored JSON objects

datasetd supports

Both dataset and datasetd maybe useful for general data science applications needing intermediate JSON object management but not a full blown database or repository system.

Limitations of dataset and datasetd

dataset has many limitations, some are listed below

  • it is not a multi-process, multi-user data store
  • it is not a general purpose database system
  • it does not supply automatic version control on collections, objects or attachments
  • it stores all keys to lower case in order to deal with file systems that are not case sensitive
  • it does not have a built-in query language, search or sorting
  • it should NOT be used for sensitive or secret information

datasetd is a simple web service intended to run on "localhost:8485".

  • it is not a RESTful service
  • it does not include support for authentication
  • it does not support a query language, search or sorting
  • it does not support data frames
  • it does not support access control by users or roles
  • it does not provide auto key generation or versioning
  • it limits the size of JSON documents stored to less than 1 MiB
  • it limits the size of attached files to less than 250 MiB
  • it does not support partial JSON record updates or retrieval
  • it does not provide an interactive Web UI for working with dataset collections
  • it does not support HTTPS or "at rest" encryption
  • it should NOT be used for sensitive or secret information

Authors and history

  • R. S. Doiel
  • Tommy Morrell

Releases

Compiled versions are provided for Linux (x86), Mac OS X (x86 and M1), Windows 10 (x86) and Raspberry Pi OS (ARM7).

github.com/caltechlibrary/dataset/releases

You can use dataset from Python via the py_dataset package.

FAQs

Package last updated on 13 Oct 2021

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc