![Oracle Drags Its Feet in the JavaScript Trademark Dispute](https://cdn.sanity.io/images/cgdhsj6q/production/919c3b22c24f93884c548d60cbb338e819ff2435-1024x1024.webp?w=400&fit=max&auto=format)
Security News
Oracle Drags Its Feet in the JavaScript Trademark Dispute
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
|test| |release| |pypi|
mmmeta
is a command-line toolkit and python library to incrementally
synchronize file metadata between an archive that stores all files
and their metadata, publishers that add new files to the archive and
consumers that process these files (or a subset of them).
It’s better explained by a concrete example:
Publisher incrementally scrapes documents and stores them with metadata in the archive.
Consumer wants to import some files by a given filter criterion and keep track of the ones that are already imported.
As such file collections grow, we only want to transfer as less data as possible between archive, publisher and consumer.
To clarify the terms used in this manual:
_mmmeta
that is
synced <#synchronization>
__ between archive, publishers and
consumersstore <#store>
__: A simple implementation of a key-value store for
additional informationArchive
The archive can be any file-like (remote) location for the actual files,
their *metadata files* and the *metadir*. **Publishers** would need
write access to it, **Consumers** only need read-only.
``mmmeta`` usually doesn’t operate on the archive itself (as it would be
most likely just a data bucket), instead, maintaining the archive is
done by **publishers**
Publisher
An application that writes to the archive. This can be for example a scraper that incrementally adds new files.
The usual workflow would look like this:
synchronize <#synchronization>
__ metadir from archivesynchronize <#synchronization>
__ metadir back to archiveupdating metadir ^^^^^^^^^^^^^^^^^^
::
mmmeta generate
This will loop through all json files in the current directory and
create or add csv data in ./_mmmeta/db/
For other path locations, see initialization <#initialization>
__
managing files presence ^^^^^^^^^^^^^^^^^^^^^^^
Per default, mmmeta generate
only adds new files based on the
metadata files available (it doesn’t even check the presence of the
actual files). To “clean up” (e.g. delete non-existing files), the cli
interface provides the following options:
::
--replace Completly replace the meta database
--ensure Ensure metadata files are present, soft-delete non-existing
--ensure-files Ensure actual files are present (for local store only),
soft-delete non-existing
--no-meta Read in actual files instead of json metadata files
Consumer
An application that processes the files, e.g. import them into a
database.
``mmmeta`` is used to merge *remote* metadata into the local *state db*
(sqlite) and provides some functionallity to query and manage this data
in applications.
The usual workflow would look like this:
1. `synchronize <#synchronization>`__ *metadir* from archive
2. Update local *state db* (see below)
3. Run application that alters local state (see example below)
update local state db
^^^^^^^^^^^^^^^^^^^^^
via cli:
::
mmmeta update
or via python:
.. code:: python
from mmmeta import mmmeta
m = mmmeta()
m.update()
For other path locations, see `initialization <#initialization>`__
consumer application
^^^^^^^^^^^^^^^^^^^^
The ``files`` object on a metadir is a wrapper to a `dataset
table <https://dataset.readthedocs.io/en/latest/api.html#table>`__ with
all its functionallity, with the addition that it yields
``mmmeta.file.File`` objects that have a bit extra functionality like
directly saving and access to “proxy values” (see config below)
.. code:: python
from mmmeta import mmmeta
m = mmmeta()
for file in m.files(document_type="contract", imported=False):
download_url = file.remote.url # see config below
process_download(download_url)
file["downloaded"] = True
file.save()
See `config <#remote>`__ on how to generate remote urls or uris
Initialization
When mmmeta is initialized <#initialization>
__ with a path
argument named foo
, the directory foo/_mmmeta
will be the
metadir
The path foo
can be set via env var:
::
MMMETA=./foo/ mmmeta update
or in scripts:
.. code:: python
from mmmeta import mmmeta
m = mmmeta("./foo/")
On publishers <#publishers>
__ there is an additional env var
MMMETA_FILES_ROOT
if the location for the actual files is
different.
This package is totally agnostic about the remote storage backend (could be a local filesystem location or cloud storage) and doesn’t handle any of the local <-> remote synchronization.
Therefore the synchronization of the metadir ./foo/_mmmeta
is up
to you with the tool of your choice.
mmmeta
can optionally have a config stored in
./foo/_mmmeta/config.yml
Example (all settings are optional):
.. code:: yaml
metadata:
file_name: _file_name # key in json metadat for file name
include: # only include these keys from json metadata in meta db
- reference
- modified_at
- title
- originators
- publisher:name # nested keys are flattened with ":" between them
unique: content_hash # unqiue identifier for files
remote: # simple string replacement to generate File.remote.<attr>
attributes, like:
url: https://my_bucket.s3.eu-central-1.amazonaws.com/foo/bar/{_file_name}
uri: s3://my_bucket/foo/bar/{_file_name}
remote
The configuration section ``remote`` from above ensures that the file
objects have attributes to access the actual files from the remote:
.. code:: python
from mmmeta import mmmeta
m = mmmeta()
for file in m.files:
print(file.remote.uri)
Store
-----
``mmmeta`` ships with a simple key-value-store that can be used by both
the *remote* and *client* to store some additional data. The store lives
in the *metadir* ``./foo/_mmmeta/_store``
You can store any values in it:
.. code:: python
from mmmeta import mmmeta
m = mmmeta("./path/to/metadir/")
m.store["new_files"] = 17
any machine that `synchronizes <#synchronization>`__ the metadir can
read these values:
.. code:: python
from mmmeta import mmmeta
m = mmmeta("./path/to/metadir/")
new_files = m.store["new_files"] # 17
For storing timestamps, there is a shorthand via the ``touch`` function:
.. code:: python
m.touch("my_ts_key")
This will save the value of the current ``datetime.now()`` to the key
``my_ts_key``. The values are typed (``int``, ``float`` or
``timestamp``), so you can easily do something like this:
.. code:: python
from mmmeta import mmmeta
m = mmmeta("./path/to/metadir/")
if m.store["remote_last_updated"] > m.store["local_last_updated"]:
# run scraper
Installation
------------
Requires python3. Virtualenv use recommended.
Additional dependencies will be installed automatically:
::
pip install mmmeta
After this, you should be able to execute in your terminal:
::
mmmeta --help
You should as well be able to import it in your python scripts:
.. code:: python
from mmmeta import mmmeta
cli
---
.. code:: bash
Usage: mmmeta [OPTIONS] COMMAND [ARGS]...
Options:
--metadir TEXT Base path for reading meta info and storing state
[default: <current/working/dir>]
--files-root TEXT Base path for actual files to generate metadir from
[default: <current/working/dir>]
--help Show this message and exit.
Commands:
generate
inspect
update
developement
------------
Install testing requirements:
::
make install
Test:
::
make test
.. |test| image:: https://github.com/simonwoerpel/mmmeta/actions/workflows/test.yml/badge.svg
:target: https://github.com/simonwoerpel/mmmeta/actions/workflows/test.yml
.. |release| image:: https://github.com/simonwoerpel/mmmeta/actions/workflows/release.yml/badge.svg
:target: https://github.com/simonwoerpel/mmmeta/actions/workflows/release.yml
.. |pypi| image:: https://github.com/simonwoerpel/mmmeta/actions/workflows/publish.yml/badge.svg
:target: https://github.com/simonwoerpel/mmmeta/actions/workflows/publish.yml
FAQs
A simple toolkit for managing local state against remote metadata.
We found that mmmeta demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
Security News
The Linux Foundation is warning open source developers that compliance with global sanctions is mandatory, highlighting legal risks and restrictions on contributions.
Security News
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.