Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

shmr

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

shmr

A set of map-reduce high-order functions to use with parallel or xargs

  • 1.4.5
  • PyPI
  • Socket score

Maintainers
1

SHMR

A set of high-order map-reduce functions

PyPI Python GitHub Issues Contributions welcome License

Table of Contents

Introduction

The goal of this library is to make it easier to process large data in parallel while not spending lots of time writing code. The typical workflow of this library is to split a huge file into smaller partitions and process each partition in parallel as the map-reduce framework. However, it does not require any setup as Spark or Hadoop except for one simple pip installation command. It is more suitable if you want to do something quick (and just one time).

Usage

This library, shmr, is best used in the command line with xargs or parallel. You can see the list of supported arguments by printing help:

$ python -m shmr -h
usage: sh map-reduce (1.0.18) [-h] [-v] -i INFILE [--skip_nrows SKIP_NROWS]
                              [-d DESER_FN] [-s SER_FN] <command>
...
optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose
  -i INFILE, --infile INFILE
                        the path to one partition or list of partitions depend
                        on the sub-program
  --skip_nrows SKIP_NROWS
                        Skip first n rows of each partition
  -d DESER_FN, --deser_fn DESER_FN
                        Deserialization function. Default is `orjson.loads`
  -s SER_FN, --ser_fn SER_FN
                        Serialization function. Default is `orjson.dumps`

The most important argument is the positional argument <command>, which is the operator you want to run. There are two types of operator: the one that is applied on one partition, starts with partition.*, and the other one that is applied on all partitions, starts with partitions.*. The help command above will show you all of the possible commands, which is truncated for readability. You can get details of a command using help as well. For example:

$ python -m shmr partition.map -h
usage: sh map-reduce (1.0.18) partition.map [-h] --fn FN --outfile OUTFILE

Apply a map function on every record of this partition

optional arguments:
  -h, --help         show this help message and exit
  --fn FN            (str) an import path of the map function, which should
                     has this signature: (record: Any) -> Any
  --outfile OUTFILE  (str) output file of the new partition, `*` or `{stem}`
                     in the path are placeholders, which will be replaced by
                     the stem (i.e., file name without extension) of the
                     current partition

Automatic partition naming

There are some variables you can use in naming the output partitions:

  1. {stem}: will be replaced by the stem of the current mapping partition
  2. {auto}: an incremental number of the new partition
  3. * or {}: a special placeholder that is either {stem} or {auto}, depends on the function you are using. If the function generates multiple partitions (e.g., group_by function), then * or {} will be replaced by {auto}, otherwise, it will be replaced by {stem}. Note that {stem} of multiple partitions will always be replaced by an empty string

Installation

From PyPi: pip install shmr

Examples

Below are some examples:

  1. Split one file (partition) to multiple files (partitions)
shmr -i <file_path> partitions.coalesce --outfile <output_files> --num_partitions=128
  1. Parallel applying a mapping function
ls <input_files> | xargs -n 1 -I{} -P <n_threads> shmr -i {} partition.map --fn <func> --outfile <output_file>

If you provide the -v, it will show the progression bar telling you how long it will take to process one partition.

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc