Dapla Toolbelt Pseudo



Pseudonymize, repseudonymize and depseudonymize data on Dapla.
Features
Other examples can also be viewed through notebook files for pseudo and depseudo
Pseudonymize
from dapla_pseudo import Pseudonymize
import polars as pl
file_path="data/personer.csv"
dtypes = {"fnr": pl.Utf8, "fornavn": pl.Utf8, "etternavn": pl.Utf8, "kjonn": pl.Categorical, "fodselsdato": pl.Utf8}
df = pl.read_csv(file_path, dtypes=dtypes)
result_df = (
Pseudonymize.from_polars(df)
.on_fields("fornavn")
.with_default_encryption()
.run()
.to_polars()
)
result_df = (
Pseudonymize.from_polars(df)
.on_fields("fornavn", "etternavn")
.with_default_encryption()
.run()
.to_polars()
)
result_df = (
Pseudonymize.from_polars(df)
.on_fields("fnr")
.with_stable_id()
.run()
.to_polars()
)
The default encryption algorithm is DAEAD (Deterministic Authenticated Encryption with Associated Data). However, if the
field is a valid Norwegian personal identification number (fnr, dnr), the recommended way to pseudonymize is to use
the function with_stable_id()
to convert the identification number to a stable ID (SID) prior to pseudonymization.
In that case, the pseudonymization algorithm is FPE (Format Preserving Encryption).
[!IMPORTANT]
FPE requires minimum two bytes/characters to perform encryption and minimum four bytes in case of Unicode.
If a field cannot be converted using the function with_stable_id()
the default behaviour is to use the original value
as input to the FPE encryption function. However, this behaviour can be changed by supplying a on_map_failure
argument
like this:
from dapla_pseudo import Pseudonymize
result_df = (
Pseudonymize.from_polars(df)
.on_fields("fnr")
.with_stable_id(on_map_failure="RETURN_NULL")
.run()
.to_polars()
)
Reading dataframes
Note that you may also use a Pandas DataFrame as an input or output, by exchanging from_polars
with from_pandas
and to_polars
with to_pandas
. However, Pandas is much less performant, so take special care especially if your
dataset is large.
Example:
df_pandas = (
Pseudonymize.from_pandas(df)
.on_fields("fornavn")
.with_default_encryption()
.run()
.to_pandas()
)
Validate SID mapping
from dapla_pseudo import Validator
import polars as pl
file_path="data/personer.csv"
dtypes = {"fnr": pl.Utf8, "fornavn": pl.Utf8, "etternavn": pl.Utf8, "kjonn": pl.Categorical, "fodselsdato": pl.Utf8}
df = pl.read_polars(file_path, dtypes=dtypes)
result = (
Validator.from_polars(df)
.on_field("fnr")
.validate_map_to_stable_id()
)
result.to_polars()
A sid_snapshot_date
can also be specified to validate that the field values can be mapped to a SID at a specific date:
from dapla_pseudo import Validator
import polars as pl
file_path="data/personer.csv"
dtypes = {"fnr": pl.Utf8, "fornavn": pl.Utf8, "etternavn": pl.Utf8, "kjonn": pl.Categorical, "fodselsdato": pl.Utf8}
df = pl.read_csv(file_path, dtypes=dtypes)
result = (
Validator.from_polars(df)
.on_field("fnr")
.validate_map_to_stable_id(
sid_snapshot_date="2023-08-29"
)
)
result.metadata
result.to_polars()
Advanced usage
Pseudonymize
Read from file systems
from dapla_pseudo import Pseudonymize
from dapla import AuthClient
file_path="data/personer.csv"
options = {
"dtypes": {"fnr": pl.Utf8, "fornavn": pl.Utf8, "etternavn": pl.Utf8, "kjonn": pl.Categorical, "fodselsdato": pl.Utf8}
}
result_df = (
Pseudonymize.from_file(file_path)
.on_fields("fornavn", "etternavn")
.with_default_encryption()
.run()
.to_polars(**options)
)
options = {
"dtypes": {"fnr": pl.Utf8, "fornavn": pl.Utf8, "etternavn": pl.Utf8, "kjonn": pl.Categorical, "fodselsdato": pl.Utf8}
}
gcs_file_path = "gs://ssb-staging-dapla-felles-data-delt/felles/pseudo-examples/andeby_personer.csv"
result_df = (
Pseudonymize.from_file(gcs_file_path)
.on_fields("fornavn", "etternavn")
.with_default_encryption()
.run()
.to_polars(**options)
)
Pseudonymize using custom keys/keysets
from dapla_pseudo import Pseudonymize, PseudoKeyset
df = (
Pseudonymize.from_polars(df)
.on_fields("fornavn")
.with_default_encryption()
.run()
.to_polars()
)
df = (
Pseudonymize.from_polars(df)
.on_fields("fornavn")
.with_default_encryption(custom_key="ssb-common-key-2")
.run()
.to_polars()
)
import json
custom_keyset = PseudoKeyset(
encrypted_keyset="CiQAp91NBhLdknX3j9jF6vwhdyURaqcT9/M/iczV7fLn...8XYFKwxiwMtCzDT6QGzCCCM=",
keyset_info={
"primaryKeyId": 1234567890,
"keyInfo": [
{
"typeUrl": "type.googleapis.com/google.crypto.tink.AesSivKey",
"status": "ENABLED",
"keyId": 1234567890,
"outputPrefixType": "TINK",
}
],
},
kek_uri="gcp-kms://projects/some-project-id/locations/europe-north1/keyRings/some-keyring/cryptoKeys/some-kek-1",
)
df = (
Pseudonymize.from_polars(df)
.on_fields("fornavn")
.with_default_encryption(custom_key="1234567890")
.run(custom_keyset=custom_keyset)
.to_polars()
)
Pseudonymize using custom rules
Instead of declaring the pseudonymization rules via the Pseudonymize functions, one can define the rules manually.
This can be done via the PseudoRule
class like this:
from dapla_pseudo import Pseudonymize, PseudoRule
rule_json = {
'name': 'my-fule',
'pattern': '**/identifiers/*',
'func': 'redact(placeholder=#)'
}
rule = PseudoRule.from_json(rule_json)
df = (
Pseudonymize.from_polars(df)
.add_rules(rule)
.run()
.to_polars()
)
Pseudonymization rules can also be read from file. This is especially handy when there are several rules, or if you
prefer to store and maintain pseudonymization rules externally. For example:
from dapla_pseudo import PseudoRule
import json
with open("pseudo-rules.json", 'r') as rules_file:
rules_json = json.load(rules_file)
pseudo_rules = [PseudoRule.from_json(rule) for rule in rules_json]
df = (
Pseudonymize.from_polars(df)
.add_rules(pseudo_rules)
.run()
.to_polars()
)
Depseudonymize
The "Depseudonymize" functions are almost exactly the same as when pseudonymizing.
User can map from Stable ID back to FNR.
from dapla_pseudo import Depseudonymize
import polars as pl
file_path="data/personer_pseudonymized.csv"
dtypes = {"fnr": pl.Utf8, "fornavn": pl.Utf8, "etternavn": pl.Utf8, "kjonn": pl.Categorical, "fodselsdato": pl.Utf8}
df = pl.read_csv(file_path, dtypes=dtypes)
result_df = (
Depseudonymize.from_polars(df)
.on_fields("fornavn")
.with_default_encryption()
.run()
.to_polars()
)
result_df = (
Depseudonymize.from_polars(df)
.on_fields("fornavn", "etternavn")
.with_default_encryption()
.run()
.to_polars()
)
result_df = (
Depseudonymize.from_polars(df)
.on_fields("fnr")
.with_stable_id()
.run()
.to_polars()
)
Note that depseudonymization requires elevated access privileges.
Repseudonymize
Repseudonymize can either 1) Change the algorithm used to pseudonymize, and/or 2)
change the key used in pseudonymization, while keeping the algorithm.
result_df = (
Repseudonymize.from_polars(df)
.on_fields("fnr")
.from_papis_compatible_encryption()
.to_stable_id()
.run()
.to_polars()
)
result_df = (
Repseudonymize.from_polars(df)
.on_fields("fnr")
.from_papis_compatible_encryption()
.to_papis_compatible_encryption(key_id="some-key")
.run()
.to_polars()
)
Datadoc
Datadoc metadata is gathered while pseudonymizing, and can be seen like so:
result = (
Pseudonymize.from_polars(df)
.on_fields("fornavn")
.with_default_encryption()
.run()
)
print(result.datadoc)
Datadoc metadata is automatically written to the folder or bucket as the pseudonymized
data, when using the to_file()
method on the result object.
The metadata file has the suffix __DOC
, and is always a .json
file.
The data and metadata is written to the file like so:
result = (
Pseudonymize.from_polars(df)
.on_fields("fornavn")
.with_default_encryption()
.run()
)
result.to_file("gs://bucket/test.parquet")
Note that if you choose to only use the DataFrame from the result, the metadata will be lost forever!
An example of how this can happen:
import dapla as dp
result = (
Pseudonymize.from_polars(df)
.on_fields("fornavn")
.with_default_encryption()
.run()
)
df = result.to_pandas()
dp.write_pandas(df, "gs://bucket/test.parquet", file_format="parquet")
Requirements
- Python >= 3.10
- Dependencies can be found in
pyproject.toml
Installation
You can install Dapla Toolbelt Pseudo via pip from PyPI:
pip install dapla-toolbelt-pseudo
Usage
Please see the Reference Guide for details.
Contributing
Contributions are very welcome.
To learn more, see the Contributor Guide.
License
Distributed under the terms of the MIT license,
Dapla Toolbelt Pseudo is free and open source software.
Issues
If you encounter any problems,
please file an issue along with a detailed description.
Credits
This project was generated from Statistics Norway's SSB PyPI Template.