aswan
collect and organize data into a T1 data depot
named after the Aswan Dam
Collect and compress data from the internet for later parsing
- quick, parallel, customizable to collect
- compressed to store
- quick to sync with a remote store
- sync to continue collecting
- sync to parse
- immutable collection
To Setup a Remote
set the environment variables ASWAN_AUTH_HEX
and ASWAN_AUTH_PASS
according to the zimmauth package, and ASWAN_REMOTE
with the name of the default remote.
Concepts
- objects
- saved by collection events
- events
- collection
- registration (v2: registration for parsing)
- (v2) parsing
- runs
- manual run vs automated run
- makes manual adding of urls easy but revertible
- has unique id
- generates events
- linked to a specific version of the code
- ideally commit hash + pip freeze
- statuses
- determined by base status + runs integrated
- contains
- what urls need to be collected
- (v2) what collected objects need to be parsed
- sqlite file, constantly trimmed
Structure
-
objects
-
runs
- run-hash
- context.yaml
- commit-hash, pip-freeze, ...
- events.zip
-
statuses
- status-hash
- context.yaml
- parent-status, integrated
- db.sqlite.zip
-
current-run
- context.yaml
- events
- these to be compressed into ../runs
- status.sqlite
-
there is a 'TEST' status
- cannot be integrated whatever is based on it
- a test run can be made on it...
when starting a run:
- check if current-run is empty
- find latest status
- if it has not integrated all past runs, create a new status that has
- start collection (+ registration)
- either stops or breaks, all events and objects are saved to disk
- if properly stops, move and compress stuff
- based on one that was the starter, and current run id
Pre v1.0 laundry list