Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

pyspark-anonymizer

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pyspark-anonymizer

Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

  • 0.5
  • PyPI
  • Socket score

Maintainers
1

pyspark-anonymizer

Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

Installing

pip install pyspark-anonymizer

Usage

Before Masking

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")
df.limit(5).toPandas()
marketplacecustomer_idreview_idproduct_idproduct_parentproduct_titlestar_ratinghelpful_votestotal_votesvineverified_purchasereview_headlinereview_bodyreview_dateyear
0US51163966R2RX7KLOQQ5VBGB00000JBAT738692522Diamond Rio Digital Player300NNWhy just 30 minutes?RIO is really great, but Diamond should increa...1999-06-221999
1US30050581RPHMRNCGZF2HNB001BRPLZU197287809NG 283220 AC Adapter Power Supply for HP Pavil...500NYFive StarsGreat quality for the price!!!!2014-11-172014
2US52246039R3PD79H9CTER8UB00000JBAT738692522Diamond Rio Digital Player512NNThe digital audio "killer app"One of several first-generation portable MP3 p...1999-06-301999
3US16186332R3U6UVNH7HGDMSB009CY43DK856142222HDE Mini Portable Capsule Travel Mobile Pocket...500NYFive StarsI like it, got some for the Grandchilren2014-11-172014
4US53068431R3SP31LN235GV3B00000JBSN670078724JVC FS-7000 Executive MicroSystem (Discontinue...355NNDesign flaws ruined the better functionsI returned mine for a couple of reasons: The ...1999-07-131999

After Masking

In this example we will add the following data anonymizers:

  • drop_column on column "marketplace"
  • replace all values to "*" of the "customer_id" column
  • replace_with_regex "R\d" (R and any digit) to "*" on "review_id" column
  • sha256 on "product_id" column
  • filter_row with condition "product_parent != 738692522"
from pyspark.sql import SparkSession
import pyspark.sql.functions as spark_functions
import pyspark_anonymizer

spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")

dataframe_anonymizers = [
    {
        "method": "drop_column",
        "parameters": {
            "column_name": "marketplace"
        }
    },
    {
        "method": "replace",
        "parameters": {
            "column_name": "customer_id",
            "replace_to": "*"
        }
    },
    {
        "method": "replace_with_regex",
        "parameters": {
            "column_name": "review_id",
            "replace_from_regex": "R\d",
            "replace_to": "*"
        }
    },
    {
        "method": "sha256",
        "parameters": {
            "column_name": "product_id"
        }
    },
    {
        "method": "filter_row",
        "parameters": {
            "where": "product_parent != 738692522"
        }
    }
]

df_parsed = pyspark_anonymizer.Parser(df, dataframe_anonymizers, spark_functions).parse()
df_parsed.limit(5).toPandas()
customer_idreview_idproduct_idproduct_parentproduct_titlestar_ratinghelpful_votestotal_votesvineverified_purchasereview_headlinereview_bodyreview_dateyear
0*RPHMRNCGZF2HN69031b13080f90ae3bbbb505f5f80716cd11c4eadd8d86...197287809NG 283220 AC Adapter Power Supply for HP Pavil...500NYFive StarsGreat quality for the price!!!!2014-11-172014
1**U6UVNH7HGDMSc99947c06f65c1398b39d092b50903986854c21fd1aeab...856142222HDE Mini Portable Capsule Travel Mobile Pocket...500NYFive StarsI like it, got some for the Grandchilren2014-11-172014
2**SP31LN235GV3eb6b489524a2fb1d2de5d2e869d600ee2663e952a4b252...670078724JVC FS-7000 Executive MicroSystem (Discontinue...355NNDesign flaws ruined the better functionsI returned mine for a couple of reasons: The ...1999-07-131999
3**IYAZPPTRJF7E2a243d31915e78f260db520d9dcb9b16725191f55c54df...503838146BlueRigger High Speed HDMI Cable with Ethernet...300NYNever got around to returning the 1 out of 2 ...Never got around to returning the 1 out of 2 t...2014-11-172014
4**RDD9FILG1LSNc1f5e54677bf48936fb1e9838869630e934d16ac653b15...587294791Brookstone 2.4GHz Wireless TV Headphones533NYSaved my. marriage, I swear to god.Saved my.marriage, I swear to god.2014-11-172014

Anonymizers from DynamoDB

You can store anonymizers on DynamoDB too.

Creating DynamoDB table

To create the table follow the steps below.

Using example script

  • Run examples/create_on_demand_table.py script of examples directory. The table will be created

On AWS console:

  • DynamoDB > Tables > Create table
  • Table name: "pyspark_anonymizer" (or any other of your own)
  • Partition key: "dataframe_name"
  • Customize the settings if you want
  • Create table
Writing Anonymizer on DynamoDB

You can run the example script, then edit your settings from there.

Parse from DynamoDB
from pyspark.sql import SparkSession
import pyspark.sql.functions as spark_functions
import pyspark_anonymizer
import boto3
from botocore.exceptions import ClientError as client_error

dynamo_table = "pyspark_anonymizer"
dataframe_name = "table_x"

dynamo_table = boto3.resource('dynamodb').Table(dynamo_table)
spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")

df_parsed = pyspark_anonymizer.ParserFromDynamoDB(df, dataframe_name, dynamo_table, spark_functions, client_error).parse()

df_parsed.limit(5).toPandas()

The output will be same as the previous. The difference is that the anonymization settings will be in DynamoDB

Currently supported data masking/anonymization methods

  • Methods
    • drop_column - Drop a column.
    • replace - Replace all column to a string.
    • replace_with_regex - Replace column contents with regex.
    • sha256 - Apply sha256 hashing function.
    • filter_row - Apply a filter to the dataframe.

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc