cuttlefish
A simple lowlevel synchronizing library for Joyent Manta.
USAGE
var cuttlefish = require('cuttlefish');
var fishy = cuttlefish({
path: '/isaacs/stor/path/on/manta/to/stuff',
client: myMantaClient,
request: function(filename, cb) {
},
files: {
'filename.txt': {},
'dir/a.txt': {,
size: 1234,
'md5': 'KwAEL3SBx7BWxLQQ0o8zzw==',
}
}
});
var fishy = cuttlefish({
path: '/isaacs/stor/path/on/manta/to/stuff',
client: myMantaClient,
request: function(filename, cb) {
},
files: {
'filename.txt': {
size: 1234,
type: 'text/plain',
'md5': 'KwAEL3SBx7BWxLQQ0o8zzw==',
'md5': '2b00042f7481c7b056c4b410d28f33cf',
'md5': new Buffer('2b00042f7481c7b056c4b410d28f33cf', 'hex'),
digest: 'md5-KwAEL3SBx7BWxLQQ0o8zzw==',
headers: {
'access-control-allow-origin': '*',
'access-control-allow-methods': 'GET',
'x-fry-is': 'the greetest'
},
},
'sub/folder/file.txt': {
},
...
},
headers: {
'access-control-allow-origin': '*',
'access-control-allow-methods': 'GET'
},
delete: true
onlyDelete: true,
dryRun: true
})
fishy.on('file', function(status, file, data) {
if (status === 'error')
console.error('%s failed: %s', file.name, data.stack)
else if (status === 'match')
console.error('%s already there', file.name)
else
console.error('%s ok!', file.name)
})
fishy.on('complete', function(error, data) {
if (error)
console.error('it didnt went well. first error was %s', error.stack)
else
console.log('ok! %d files uploaded', Object.keys(data).length)
})
Options
client {MantaClient object} Required client for accessing Manta
files {Object} The {<name>:<File>,...} hash. See below for
the fields that can be specified on each File object.
path {String} The path on Manta where the stuf gets synced to
request {Function} Function that gets a stream to send, if
appropriate
concurrency {Number} The max number of tasks to be doing at any
one time. Default = 50.
timeout {Number} Optional max amount of time to wait for any remote
task to complete, in ms. Default = Infinity
headers {Object} Optional headers to send with every PUT
operation. Does not check for or overwrite headers on pre-existing
remote objects.
delete {Boolean} Set to true to delete remote files that are not
found in the files hash. Default = false
onlyDelete {Boolean} Set to true to only delete remote files
that are not found in the files hash, but do not send any new
files. Implies delete. Default = false
dryRun {Boolean} Don't actually put or delete any files, but act
as if it would, performing the same length and MD5 comparisons etc.
getMd5 {Function} Optionally provide a getter function to look up
md5 values only when necessary. This is handy if you have large
files, and don't want to look up md5 checksums unless necessary
because the file lengths match.
timingDebug {Boolean} Optionally dump a bunch of timing info to
stderr. Defaults to false, obviously.
File Objects
Cuttlefish's file objects have the following fields. When you specify
one of the aliases, it'll be changed to the canonical name.
md5 The md5 checksum of the file. Can be in Base64, Hex, or
Buffer format, or come with a md5- prefix. Aliases:
content-md5, computed-md5, digest
size The length of the file in bytes. Aliases: length,
content-length, content_length, contentLength
type The type of the file. Aliases: content-type,
contentType, content_type, mime-type, mime_type, mimeType
headers Additional headers to pass to the Manta PUT operation.
Does not check against headers for pre-existing files.
skip A boolean to say "do not send this file, even if the remote
does not have it, or the md5 doesn't match". This is useful if you
want to only remove certain missing files, but not all. A skipped
file will be emitted as a match if the remote has a file by that
name, without checking length or md5. If the remote does not have a
copy of the file, then it is still not sent, but is not emitted as a
match.
name The key in the files hash. When file objects are cast to a
string, their name field is returned.
mkdirs Boolean true, but only because the file is passed as an
argument to a Manta PUT operation.
started Boolean false before the file is processed, true once
it starts.
error Error object or null depending on whether the file
encountered an error.
status Starts as null, but eventually changes to one of
'sent', 'match', or 'error'
Remote Objects
Remote objects will be either of the sort returned by Manta's ftw
operation, or returned by Manta's info operation if an md5 checksum
is provided and the ftw data does not contain it. Additionally,
they will have the following fields:
status One of 'sent', 'match', 'error', or 'deleted'
_path The full path of the remote object in Manta
_remote The path relative to Cuttlefish's directory (corresponding
to the local file.name property)
Events
The cuttlefish object is an event emitter that emits the following
events.
error
Emitted when there is a problem. This means something bad has
happened, which is probably unrecoverable. The error object may
have a file or task object attached with additional information.
complete
results {Object} Collection of result information
Emitted when the sync operation is finished. The results object
contains as much information as cuttlefish has about all the remote
objects it saw, as well as the status of each remote object
('delete', 'sent', or 'match').
file
file {Object} An object representing the file that was processed
status {String} Either the string 'sent' or 'match'
remote {Object} An object representing the remote file
This is emitted whenever a local file is processed, to tell you that
either it was sent, or it was skipped because it matches the remote
file.
task
task {Object} The task being performed
This is emitted whenever a new async task is scheduled.
delete
path {String} The remote path that is deleted (relative to
cuttlefish's root path)
remote {Object} The remote object info
Emitted whenever a remote file or directory is deleted.
send
file {Object} The local file being sent
result {Object} The results of the send operation
Emitted when a file is sent.
match
file {Obect} The local file that matches
remote {Object} The remote data that it matches against
ftw
path {String} The remote path being walked
This is emitted when we're about to process the list of remote files.
It will usually be emitted. The only time it wouldn't be emitted is
if there's an error instead, or if the remote path doesn't exist (so
there's nothing to walk), or in the trivial case where we're not
sending any files and not deleting extra files and folders.
entry
entry {Object} Remote object info
Emitted for each remote entry encountered in the ftw process.
unlink
remote {Object} The remote object info
result {Object} The results of the unlink operation
Emitted when a remote object is unlinked.
rmr
remote {Object} The remote object info
result {Object} The results of the rmr operation
Emitted when a remote directory is removed.
info
file {Object} The local file being queried for
result {Object} The results of the info operation
Emitted when cuttlefish has to look up the detailed info about a
remote object. Currently, this is only done when it is necessary to
compare the md5 value.
FAQ
These questions may or may not be frequently asked, but I predict that
you might ask them, so here they are.
Can it recursively add directories?
No. You feed the cuttlefish a bunch of things that you want it to
sync. It just figures out what to sync, and then tells you when its
all done. Probably what you want is
manta-sync.
I have a billion files. The stats won't fit in memory!
Another way to approach this solution would have been to have a more
stream-like fishy.addFile(file) method, instead of requiring that
you provide all the file stat info up front.
However, that approach requires that an extra call be made for each
file to get the remote info, and a ftw at the end to clean up extra
files that need to be deleted. And, in order to handle deleting files
at all, you have to keep the names around anyway, which would
eventually hit a memory limit (albeit a much higher one).
That approach would require about twice as many HTTP calls, and an ftw
walk. If you have a small to medium number of files (ie, a million or
less), many of which are already present on Manta, and are typically
setting delete: true, then cuttlefish's approach is much more
efficient.
A similarly efficient approach would be to require that you provide
another getter function to provide the file stat information, and then
a second getter to provide the file stream if needed, so that nothing
is stored in memory, and everything flows through, synchronizing
elegantly.
That's a much fancier lib with a more elaborate API, which should be
called thaumoctopus. If you have this use case, you should go write
it.
Why "cuttlefish"?
The Joyent Manta service has a venerable tradition of naming things
after sea creatures. The cuttlefish is a little thing with tentacles
that stays on the bottom of the sea, and mirrors whatever it's placed
against. So it is a natural fit for a lowlevel synchronizing utility.