format_parser
is a Ruby library for prying open video, image, document, and audio files.
It includes a number of parser modules that try to recover metadata useful for post-processing and layout while reading the absolute
minimum amount of data possible.
format_parser
is inspired by imagesize, fastimage
and dimensions, borrowing from them where appropriate.
Currently supported filetypes:
- AAC
- AIFF
- ARW
- BMP
- CR2
- CR3
- DOCX
- DPX
- FDX
- FLAC
- GIF
- HEIC
- HEIF
- JPEG
- JSON
- M3U
- M4A
- M4B
- M4P
- M4R
- M4V
- MOV
- MP3
- MP4
- MPEG
- NEF
- OGG
- PDF
- PNG
- PPTX
- PSD
- RW2
- TIFF
- WAV
- WEBP
- XLSX
- ZIP
...with more on the way!
Basic usage
Pass an IO object that responds to read
, seek
and size
to FormatParser.parse
and the first confirmed match will be returned.
match = FormatParser.parse(File.open("myimage.jpg", "rb"))
match.nature
match.format
match.display_width_px
match.display_height_px
match.orientation
You can also use parse_http
passing a URL or parse_file_at
passing a path:
match = FormatParser.parse_http('https://upload.wikimedia.org/wikipedia/commons/b/b4/Mardin_1350660_1350692_33_images.jpg')
match.nature
match.format
If you would rather receive all potential results from the gem, call the gem as follows:
array_of_results = FormatParser.parse(File.open("myimage.jpg", "rb"), results: :all)
You can also optimize the metadata extraction by providing hints to the gem:
FormatParser.parse(File.open("myimage", "rb"), natures: [:video, :image], formats: [:jpg, :png, :mp4], results: :all)
Return values of all parsers have built-in JSON serialization
img_info = FormatParser.parse(File.open("myimage.jpg", "rb"))
JSON.pretty_generate(img_info)
To convert the result to a Hash or a structure suitable for JSON serialization
img_info = FormatParser.parse(File.open("myimage.jpg", "rb"))
img_info.as_json
img_info.as_json(stringify_keys: true)
Creating your own parsers
See the section on writing parsers in CONTRIBUTING.md
Design rationale
We need to recover metadata from various file types, and we need to do so satisfying the following constraints:
- The data in those files can be malicious and/or incomplete, so we need to be failsafe
- The data will be fetched from a remote location (S3), so we want to obtain it with as few HTTP requests as possible
- ...and with the amount of data fetched being small - the number of HTTP requests being of greater concern
- The data can be recognized ambiguously and match more than one format definition (like TIFF sections of camera RAW)
- The information necessary is a small subset of the overall metadata available in the file.
- The number of supported formats is only ever going to increase, not decrease
- The library is likely to be used in multiple consumer applications
- The library is likely to be used in multithreading environments
Deliberate design choices
Therefore we adapt the following approaches:
- Modular parsers per file format, with some degree of code sharing between them (but not too much). Adding new formats
should be low-friction, and testing these format parsers should be possible in isolation
- Modular and configurable IO stack that supports limiting reads/loops from the source entity.
The IO stack is isolated from the parsers, meaning parsers do not need to care about things
like fetches using
Range:
headers, GZIP compression and the like - A caching system that allows us to ideally fetch once, and only once, and as little as possible - but still accomodate formats
that have the important information at the end of the file or might need information from the middle of the file
- Minimal dependencies, and if dependencies are to be used they should be very stable and low-level
- Where possible, use small subsets of full-feature format parsers since we only care about a small subset of the data.
- When a choice arises between using a dependency or writing a small parser, write the small parser since less code
is easier to verify and test, and we likely don't care about all the metadata anyway
- Avoid using C libraries which are likely to contain buffer overflows/underflows - we stay memory safe
Acknowledgements
We are incredibly grateful to Remco van't Veer for exifr and to
Krists Ozols for id3tag that we are using for crucial tasks.
Fixture Sources
Unless specified otherwise in this section the fixture files are MIT licensed and from the FastImage and Dimensions projects.
AAC
- Originals music files: “Furious Freak” and “Galway”, Kevin MacLeod (incompetech.com), Licensed under Creative Commons: By Attribution 3.0, http://creativecommons.org/licenses/by/3.0/
- The AAC samples were converted from 'wav' format and made available here by Espressif Systems, as part of their audio development framework (under the ESPRESSIF MIT License).
- Files:
- ff-16b-2c-44100hz.aac
- ff-16b-1c-44100hz.aac
- gs-16b-2c-44100hz.aac
- gs-16b-1c-44100hz.aac
AIFF
- fixture.aiff was created by one of the project maintainers and is MIT licensed
ARW
CR2
CR3
DOCX
- The .docx files were generated by the project maintainers
DPX
- DPX files were created by one of the project maintainers and may be used with the library for the purposes of testing
ERF
FDX
- fixture.fdx was created by one of the project maintainers and is MIT licensed
FLAC
- atc_fixture_vbr.flac is a converted version of the MP3 with the same name
- c_11k16btipcm.flac is a converted version of the WAV with the same name
JPEG
divergent_pixel_dimensions_exif.jpg
is used with permission from LiveKom GmbHextended_reads.jpg
has kindly been made available by Raphaelle Pellerin for use exclusively with format_parsertoo_many_APP1_markers_surrogate.jpg
was created by the project maintainers
JPEG (EXIF orientation)
KEY
- The
keynote_recognized_as_jpeg.key
file was created by the project maintainers
M3U
- The M3U fixture files were created by one of the project maintainers
MOV
MP3
- Cassy.mp3 has been produced by WeTransfer and may be used with the library for the purposes of testing
MP4
MPEG
NEF
OGG
hi.ogg
, vorbis.ogg
, with_confusing_magic_string.ogg
, invalid_with_garbage_at_the_end.ogg
have been generated by the project contributors
PDF
- PDF 2.0 files downloaded from the PDF Association public Github repository. These files are licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.
- Lorem Ipsum PDF files created at WeTransfer for this project.
PNG
RW2
TIFF
Shinbutsureijoushuincho.tiff
is obtained from Wikimedia Commons and is Creative Commons licensedIMG_9266_*.tif
and all it's variations were created by the project maintainers
WAV
- c_11k16bitpcm.wav and c_8kmp316.wav are from Wikipedia WAV, retrieved January 7, 2018
- c_39064__alienbomb__atmo-truck.wav is from freesound and is CC0 licensed
- c_M1F1-Alaw-AFsp.wav and invalid_d_6_Channel_ID.wav are from a McGill Engineering site
WEBP
- With the exception of extended-animation.webp, which was obtained from Wikimedia Commons and is Creative Commons
licensed, all of the WebP fixture files have been created by one of the project maintainers.
ZIP
- The .zip fixture files have been created by the project maintainers
Copyright
Copyright (c) 2020 WeTransfer.
format_parser
is distributed under the conditions of the Hippocratic License
- See LICENSE.txt for further details.