
Security News
MCP Community Begins Work on Official MCP Metaregistry
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
Example query run on 10GB of GZIP compressed JSON data (>60GB uncompressed)
Amazon S3 select is one of the coolest features AWS released in 2018. It's benefits are:
Unfortunately, S3 select API query call is limited to only one file on S3 and syntax is quite cumbersome, making it very impractical for daily usage. These are and more flaws are intended to be fixed with this s3select command.
Most important features:
s3select is developed in Python and uses pip.
The easiest way to install/upgrade s3select is to use pip
in a virtualenv
:
$ pip install -U s3select
or, if you are not installing in a virtualenv
, to install/upgrade globally:
$ sudo pip install -U s3select
or for your user:
$ pip install --user -U s3select
s3select uses the same authentication and endpoint configuration as aws-cli. If aws command is working on your machine, there is no need for any additional configuration.
First get some help:
$ s3select -h usage: s3select [-h] [-w WHERE] [-d FIELD_DELIMITER] [-D RECORD_DELIMITER] [-l LIMIT] [-v] [-c] [-H] [-o OUTPUT_FIELDS] [-t THREAD_COUNT] [--profile PROFILE] [-M MAX_RETRIES] prefixes [prefixes ...] s3select makes s3 select querying API much easier and faster positional arguments: prefixes S3 prefix (or more) beneath which all files are queried optional arguments: -h, --help show this help message and exit -w WHERE, --where WHERE WHERE part of the SQL query -d FIELD_DELIMITER, --field_delimiter FIELD_DELIMITER Field delimiter to be used for CSV files. If specified CSV parsing will be used. By default we expect JSON input -D RECORD_DELIMITER, --record_delimiter RECORD_DELIMITER Record delimiter to be used for CSV files. If specified CSV parsing will be used. By default we expect JSON input -l LIMIT, --limit LIMIT Maximum number of results to return -v, --verbose Be more verbose -c, --count Only count records without printing them to stdout -H, --with_filename Output s3 path of a filename that contained the match -o OUTPUT_FIELDS, --output_fields OUTPUT_FIELDS What fields or columns to output -t THREAD_COUNT, --thread_count THREAD_COUNT How many threads to use when executing s3_select api requests. Default of 150 seems to be on safe side. If you increase this there is a chance you'll need also to increase nr of open files on your OS --profile PROFILE Use a specific AWS profile from your credential file. -M MAX_RETRIES, --max_retries MAX_RETRIES Maximum number of retries per queried S3 object in case API request fails
It's always useful to peek at first few lines of input files to figure out contents:
$ s3select -l 3 s3://testing.bucket/json_example/ {"name":"Gilbert","wins":[["straight","7♣"],["one pair","10♥"]]} {"name":"Alexa","wins":[["two pair","4♠"],["two pair","9♠"]]} {"name":"May","wins":[]}
It's JSON. Great - that's s3select default format. Let's get a subset of its data
$ s3select -l 3 -w "s.name LIKE '%Gil%'" -o "s.wins" s3://testing.bucket/json_example {"wins":[["straight","7♣"],["one pair","10♥"]]}
What if the input is not in JSON:
$ s3select -l 3 s3://testing.bucket/csv_example Exception caught when querying csv_example/example.csv: An error occurred (JSONParsingError) when calling the SelectObjectContent operation: Error parsing JSON file. Please check the file and try again.
Exception means input isn't parsable JSON. Let's switch to CSV file delimited with ,
but you can specify any other delimiter char. Often used is TAB
specified with \\t
$ s3select -l 3 -d , s3://testing.bucket/csv_example Gilbert,straight,7♣,one pair,10♥ Alexa,two pair,4♠,two pair,9♠ May,,,,
Since utilising the first line of CSV as a header isn't yet supported we'll select a subset of data using column enumeration:
$ s3select -l 3 -d , -w "s._1 LIKE '%i%'" -o "s._2" s3://testing.bucket/csv_example straight three of a kind
If you are interested in pricing for your requests, add -v
to increase verbosity which will include pricing information at the end:
$ s3select -v -c s3://testing.bucket/10G_sample Files processed: 77/77 Records matched: 5696395 Bytes scanned: 21 GB Cost for data scanned: $0.02 Cost for data returned: $0.00 Cost for SELECT requests: $0.00 Total cost: $0.02
Distributed under the MIT license. See LICENSE
for more information.
FAQs
S3 select utility package
We found that s3select demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
Research
Security News
Socket uncovers an npm Trojan stealing crypto wallets and BullX credentials via obfuscated code and Telegram exfiltration.
Research
Security News
Malicious npm packages posing as developer tools target macOS Cursor IDE users, stealing credentials and modifying files to gain persistent backdoor access.