Product
Introducing License Enforcement in Socket
Ensure open-source compliance with Socket’s License Enforcement Beta. Set up your License Policy and secure your software!
A python library containing automated data validation tools for the Atmos Data Service
A library containing validation checks to be run on hindcast or measurement data to ensure API compliance and standardization.
Atmos is a project to streamline discoverability and data access for weather data sources through APIs. The first step to building a lean and efficient API is converting data to the standard format.
In order to ingest measurement or hindcast data into the Atmos Data Store, each source file needs to pass validation. Validation ensures that the file is compliant with the data conventions specified in Conventions. The standard file formats for source files are NetCDF files for hindcasts and ASCII or NetCDF for measurements.
Measurements by definition contain data for a single geographical location, whereas hindcasts are bigger models containing time series for a multitude of locations in (possibly rotated) grids. A measurement file is therefore expected to contain all of its data in a single file. A hindcast is comprised of a set of NetCDF files - all with the same coordinates, attributes, and variables, - where a single file contains data for a unique time period. Depending on the size of the files, the time separation should be either monthly or yearly. A rule-of-thumb is that if the file size grows more than 4 GB, it should be split into smaller files. See docs or example for how to name hindcast files.
To run validation on NetCDF and ASCII source files, we have built the atmos_validation CLI/library. The documentation below describes these checks, the standard format, and how to run validation using the CLI tool.
We welcome different types of contributions, including code, bug reports, issues, feature requests, and documentation. The preferred method of submitting a contribution is either to make an issue on GitHub or to fork the project on GitHub and make a pull request. In the case of bug reports, please provide a detailed explanation describing how to reproduce before submitting.
If data source is "measurement", then:
General instruction: When information is not available "NA" shall be used in place.
The required common attributes can be seen underneath,
# ../atmos_validation/schemas/metadata.py#L20-L30
class CommonMetadata(BaseModel, use_enum_values=True):
"""Common required attributes for all data types"""
comments: Union[List[str], str]
contractor: str
classification_level: ClassificationLevel = Field(default="Internal")
data_type: DataType
data_history: str
final_reports: List[str]
project_name: str
qc_provider: str
comments: Any relevant comments related to how the data has been treated shall be provided. It could be basic preprocessing steps etc.
contractor: Name of data provider
classification_level: Signifies data access according to classification level
data_type: If source data is hindcast, single point hindcast, or measurement
data_history: Any information about the origin of the data (if not measured directly by the contractor) or changes made to the data (if there has been previous versions of the same data) shall be stated here. If the data has been measured/created directly by the contractor and it is the first version delivered “Original data” shall be stated
final_reports: A list of report file names, separated by comma, shall be provided. All stated report files shall follow the data
project_name: Name of the project requesting data
qc_provider: Company responsible for the QC. It can be different from the contractor.
where data_type should take either value from the enum:
# ../atmos_validation/schemas/metadata.py#L9-L12
class DataType(str, Enum):
HINDCAST = "Hindcast"
MEASUREMENT = "Measurement"
SP_HINDCAST = "SinglePointHindcast"
The data_type value defines secondary requirements on the global attributes on the data file.
The classifcation level should take either value from the enum:
# ../atmos_validation/schemas/classification_level.py#L36-L39
class ClassificationLevel(OrderedEnum):
OPEN = "Open"
INTERNAL = "Internal"
RESTRICTED = "Restricted"
Single point hindcast and hindcast both use the hindcast metadata schema.
# ../atmos_validation/schemas/metadata.py#L33-L49
class HindcastMetadata(CommonMetadata, UnprotectedNamespaceModel):
"""Extra global attributes required if data_type == "Hindcast" or data_type == "SinglePointHindcast"."""
calibration: str
delivery_date: str
forcing_data: str
memos: Union[str, List[str]]
modelling_software: str
model_name: str
nests: Union[str, List[str]]
setup: str
spatial_resolution: Union[str, List[str]]
sst_source: str
task_manager_external: Union[str, List[str]]
task_manager_internal: Union[str, List[str]]
time_resolution: str
topography_source: str
calibration: Indicate whether calibration is applied to he data ‘yes’/ ‘no’
delivery_date: Date of the hindcast delivery
forcing_data: Data used as the boundary conditions
memos: Filenames of memos shall be specified
model_name: Name of the model. It shall be unique in the project
modelling software: Software and version used in hindcast computation
nests: Nests used to create given data
setup: Setup storage place in the cold storage. Valid for internal hindcasts only. For external hindcasts ‘NA’ shall be specified
spatial_resolution: Spatial resolution in km
sst_source: Source of SST data
task_manager_external: Hindcast provider task manager
*task_manager_internal: Equinor task manager handling the project
time_resolution: Temporal resolution
# ../atmos_validation/schemas/metadata.py#L52-L65
class MeasurementMetadata(CommonMetadata):
"""Extra global attributes if data_type == "Measurement"."""
asset: Optional[str] = Field(default=None)
averaging_period: str
country: str = Field(default="NA")
data_usability: str
instrument_types: str
instrument_specifications: str
installation_type: str
location: Union[str, List[str]]
mooring_name: str
source_file: str
total_water_depth: Union[str, float]
asset: Name of the asset which paid for the data. In case of sharing data to the third party, permission from the asset is required.
averaging_period: Averaging period of measurements in minutes
country: Country name on which territory data are acquired. In case of sharing data to the third party, one need obey to country regulation rules related to data sharing.
data_usability: Level of the data readiness
location: Latitude and longitude of the measurements in degrees (at least three decimals are required after ‘.’) and corresponding reference datum.
Format of the location: lat lon, reference
instrument_specifications: Instrument specifications for given measurement locations. Instrument specifications shall be listed in the same order as the corresponding instrument types.
installation_types: Measurement installation type.
instrument_types: Types of instruments used for the measurement location.
mooring_name: The mooring name shall be unique for each delivered measurement file (across projects, instruments, data deliveries etc) and shall be constructed as follows: project_name + mooring_name + instrument + phase. FOXTROT_MOOR1_GPS_Ph1. Only put single instrumentation in the name in cases where there are multiple instruments.
total_water_depth: Total water depth in meters. For wind data total water depth is NA (this parameter is not in the list, it should be included)
source_file: A reference to the original data file the NetCDF file was generated from. "NA" can be used if not applicable.
To avoid ambiguous terminology, "data_usability" and "installation_types" are validated against database documents, respectively:
Extra attributes relevant for the data source can be added using snake_case.
Prerequisites: Python >=3.8.
Run in your preferred environment:
pip install atmos_validation
Or, if using poetry as package manager, replace pip install
with poetry add
.
If using conda, run before the pip install command above:
conda install git pip
After installing, run python -m atmos_validation
to see docstring for available commands and options.
Example usage using the example datasets (need to clone/download repository and run from root for this to work):
python -m atmos_validation validate-netcdf examples/hindcast_example
python -m atmos_validation validate-netcdf examples/example_netcdf_measurement.nc
python -m atmos_validation validate-ascii examples/example_ascii_measurement.dat
python -m atmos_validation convert-ascii examples/example_ascii_measurement.dat
All commands can be run without arguments to trigger docstring output to list args and options documentation.
FAQs
A python library containing automated data validation tools for the Atmos Data Service
We found that atmos-validation demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Ensure open-source compliance with Socket’s License Enforcement Beta. Set up your License Policy and secure your software!
Product
We're launching a new set of license analysis and compliance features for analyzing, managing, and complying with licenses across a range of supported languages and ecosystems.
Product
We're excited to introduce Socket Optimize, a powerful CLI command to secure open source dependencies with tested, optimized package overrides.