Sciveyor MongoDB Tool
This is a utility designed to manipulate the MongoDB and Solr servers used to
store and search the documents in Sciveyor.
Contents
Building
- Check out the code, including the submodules:
git clone --recurse-submodules ...
- Build:
go build
- Run:
./mongo-tool ...
Requirements
- A MongoDB server, with a collection of documents that follow the schema
spelled out here. FIXME: At
some point in the future, this server will become more complex, with support
for other collections carrying information about disambiguated authors,
journals, and institutions. That support is not currently available in this
tool.
- A Solr server, pre-loaded with the schema described here. (FIXME: Not
currently available for public consumption. Watch this space; it needs more
debugging.)
Usage
A number of sub-commands can then be used to perform various maintenance tasks
on the MongoDB and Solr servers. You can get a list of all those tasks by
running mongo-tool --help
, and you can get more help on any command by running
mongo-tool <command> --help
.
sync
: Synchronize MongoDB to Solr
The tool can be used to perform a three-step synchronization of the content from
the MongoDB server to the Solr server. This is an extremely simple sync:
- For each document in the MongoDB database:
- If it is present in the Solr database, but either its
version
or its
dataSourceVersion
parameters have changed, delete and re-create it in
the Solr database. - If it is not present in the Solr database, create it.
- For each document in the Solr database:
- If it is not present in the Mongo database, delete it.
Notably, this is not a proper atomic synchronization. Documents are deleted
and re-created, not partially updated (in Solr's terminology, we do not use
"atomic updates"). We also do not detect any changes other than in the two
version parameters. Version numbers must be bumped to trigger a sync. (This is
an intentional policy choice.)
To use it, then, call mongo-tool
as follows:
./mongo-tool sync \
--mongo-address=mongodb://localhost \
--mongo-database=YourDatabase \
--mongo-collection=documents \
--mongo-timeout=SECS \
--solr-address=http://localhost:8983/solr \
--solr-collection=sciveyor
The parameters are simply the various connection options for the two servers.
The mongo-address
is a URL, which can specify username, password, and port
(mongodb://user:pass@address:port
). The mongo-database
parameter should be
familiar from any connection to MongoDB. In almost all Sciveyor cases, the
mongo-collection
should be set to documents
. The mongo-timeout
parameter
controls how long we will wait for MongoDB timeouts. It defaults to 30, but
might need to be much higher in some applications.
The two Solr parameters are the URL to the root of the server (which will almost
always end with /solr
), and the collection or core name currently in use. (The
final Solr URLs, then, will append the collection to the address.)
For debugging purposes, it is occasionally helpful to force a sync -- that is,
to delete and re-create every document in Solr with the corresponding copy from
MongoDB. If this behavior is desired, you can pass --force
. We strongly
recommend that you do not use this feature.
import
: Import JSON Files to MongoDB
This tool should be used whenever you want to import JSON documents (once again,
in the JSON schema specified by Sciveyor) into the MongoDB server.
To use it, call mongo-tool
as follows:
./mongo-tool import \
--batch-size=NUM \
--mongo-address=mongodb://localhost \
--mongo-database=YourDatabase \
--mongo-collection=documents \
--mongo-timeout=SECS \
<files> ...
For information about the MongoDB connection parameters, see the sync
command
above.
The files parameter may either refer to specific files or to
glob patterns.
The --batch-size
parameter can be set to any number of documents (it defaults
to 100). The optimal size will depend on your connection to your MongoDB server,
your document sizes, and your network configuration, but 100 works for most
purposes.
Note that no schema validation at all will be done on these documents, though
they will be passed through several kinds of essential transformations (for
example, converting the dates from JSON string format to MongoDB date format).
validate
: Validate MongoDB Documents
The tool can be used to check whether or not the contents of a given MongoDB
server conform to the Sciveyor JSON schema. To use it, call mongo-tool
as
follows:
./mongo-tool validate [--strict] \
--mongo-address=mongodb://localhost \
--mongo-database=YourDatabase \
--mongo-collection=documents \
--mongo-timeout=SECS
For information about the MongoDB connection parameters, see the sync
command
above.
If --strict
is set (it defaults to true, you may disable it by passing
--strict=false
), then the validation will check to make sure not only that the
attributes of each document are valid, it will also print errors if there are
any fields in a document which do not appear in the JSON schema (that is, it
will print errors on any "extra" fields). Strict mode is activated by default.
Passing --strict=false
will silently ignore the presence of any extra fields,
only printing errors if there are fields containing invalid data.
validate-files
: Validate JSON Files
The tool can be used to check whether or not a collection of JSON files on disk
conforms to the Sciveyor JSON schema. To use it, call mongo-tool
as follows:
./mongo-tool validate-files [--strict] [--unique] /path/to/*.json
The files parameter may either refer to specific files or to
glob patterns.
For information about the --strict
parameter, see the validate
command
above. If --unique
is passed, then the validation will parse each file, load
its ID value, and check to see if there are any duplicate ID values among the
JSON files that are passed. This will slow down validation, so it is disabled by
default.
General Options
--verbose
, -v
: By default, basic information about the sync will be
printed to the console. To see much more information (including printed dumps
of the IDs present in both the Mongo and Solr databases), pass the --verbose
flag.
Glob Patterns
All file parameters can also be passed a glob matching pattern. We use an
extended syntax with support for:
*
: any sequence of non-separator characters**
: any sequence of characters, including separators (recursive glob)?
: any single non-separator character[class]
: character classes, of the form [abcd]
(character list), [a-z]
(character range), or [^a-z]
(negated class){alt1,alt2,...}
: a finite list of alternatives
Changelog
- v0.6: Fix our entirely broken Mongo date handling, and export in a
different format to allow for storing them in Solr date objects. Fix a small
bug with batched import.
- v0.5: Add a batch-size parameter to
import
. - v0.4: Move glob handling into the app, allowing for a
--unique
test in
validate-files
. - v0.3: Port command-line handling to Kong, and introduce a robust
sub-command interface. Rename from
mongo-solr
to mongo-tool
. Integrate the
functionality of schema-tool
into mongo-tool
. - v0.2: Store all the
date
values in documents as ISODate
in MongoDB. - v0.1: Initial support for only the parameters mentioned in the JSON
document schema.
License
The code here is copyright © 2021 Charles H. Pence, and released under the
GNU GPL v3.