# -- coding: utf-8 --
		from setuptools import setup

		package_dir = \
		{'': 'src'}

		packages = \
		['drpt']

		package_data = \
		{'': ['*']}

		install_requires = \
		['click>=8.1.3,<9.0.0',
		'dask>=2023.1.0,<2024.0.0',
		'jsonschema>=4.17.3,<5.0.0',
		'pandas>=1.5.2,<2.0.0',
		'pyarrow>=10.0.1,<11.0.0']

		entry_points = \
		{'console_scripts': ['drpt = drpt.__main__:main']}

		setup_kwargs = {
		'name': 'drpt',
		'version': '0.8.2',
		'description': 'Tool for preparing a dataset for publishing by dropping, renaming, scaling, and obfuscating columns defined in a recipe.',
		'long_description': '# Data Release Preparation Tool\n\n- [Data Release Preparation Tool](#data-release-preparation-tool)\n - [Description](#description)\n - [Installation](#installation)\n - [Usage](#usage)\n - [CLI](#cli)\n - [Recipe Definition](#recipe-definition)\n - [Example](#example)\n - [Thanks](#thanks)\n\n> :warning: This is currently at beta development stage and likely has a lot of bugs. Please use the [issue tracker](https://github.com/ConX/drpt/issues) to report an bugs or feature requests.\n\n## Description\n\nCommand-line tool for preparing a dataset for publishing by dropping, renaming, scaling, and obfuscating columns defined in a recipe.\n\nAfter performing the operations defined in the recipe the tool generates the transformed dataset version and a CSV report listing the performed actions.\n\n## Installation\n\nThe tool can be installed using pip:\n\n```shell\npip install drpt\n```\n\n## Usage\n\n### CLI\n\n```txt\nUsage: drpt [OPTIONS] RECIPE_FILE INPUT_FILE\n\nOptions:\n -d, --dry-run Generate only the report without the release dataset\n -v, --verbose Verbose [Not implemented]\n -n, --nrows TEXT Number of rows to read from a CSV file. Doesn\'t work\n with parquet files.\n -l, --limits-file PATH Limits file\n -o, --output-dir PATH Output directory. The default output directory is\n the same as the location of the recipe_file.\n --version Show the version and exit.\n --help Show this message and exit.\n```\n\n### Recipe Definition\n\n#### Overview\nThe recipe is a JSON formatted file that includes what operations should be performed on the dataset. For versioning purposes, the recipe also contains a `version` key which is appended in the generated filenames and the report.\n\nDefault recipe:\n```json\n{\n "version": "",\n "actions": {\n "drop": [],\n "drop-constant-columns": false,\n "obfuscate": [],\n "disable-scaling": false,\n "skip-scaling": [],\n "sort-by": [],\n "rename": []\n }\n}\n```\n\nThe currently supported actions, performed in this order, are as follows:\n - `drop`: Column deletion\n - `drop-constant-columns`: Drops all columns that containt only one unique value\n - `obfuscate`: Column obfuscation, where the listed columns are treated as categorical variables and then integer coded.\n - Scaling: By default all columns are Min/Max scaled\n - `disable-scaling`: Can be used to disable scaling for all columns\n - `skip-scaling`: By default all columns are Min/Max scaled, except those excluded (`skip-scaling`)\n - `sort-by`: Sort rows by the listed columns\n - `rename`: Column renaming\n\nAll column definitions above support [regular expressions](https://docs.python.org/3/library/re.html#regular-expression-syntax).\n\n#### Actions\n\n##### _drop_\nThe `drop` action is defined as a list of column names to be dropped.\n\n##### _drop-constant-columns_\nThis is a boolean action, which when set to `true` will drop all the columns that have only a single unique value.\n\n##### _obfuscate_\nThe `obfuscate` action is defined as a list of column names to be obfuscated.\n\n##### _disable-scaling_, _skip-scaling_\nBy default, the tool Min/Max scales all numerical columns. This behavior can be disabled for all columns by setting the `disable-scaling` action to `true`. If scaling must be disabled for only a set of columns these columns can be defined using the `skip-scaling` action, as a list of column names.\n\n##### _sort-by_\nThis is a list of column names by which to sort the rows. The order in the list denotes the sorting priority.\n\n##### _rename_\nThe `rename` action is defined as a list of objects whose key is the original name (or regular expression), and their value is the target name. When the target uses matched groups from the regular expression those can be provided with their group number prepended with an escaped backslash (`\\\\1`) [see [example](#example) below].\n\n```json\n{\n //...\n "rename": [{"original_name": "target_name"}]\n //...\n}\n```\n## Example\n\nInput CSV file:\n```csv\ntest1,test2,test3,test4,test5,test6,test7,test8,test9,foo.bar.test,foo.bar.test2,const\n1.1,1,one,2,0.234,0.3,-1,a,e,1,1,1\n2.2,2,two,2,0.555,0.4,0,b,f,2,2,1\n3.3,3,three,4,0.1,5,1,c,g,3,3,1\n2.22,2,two,4,1,0,2.5,d,h,4,4,1\n```\n\nRecipe:\n```json\n{\n "version": "1.0",\n "actions": {\n "drop": ["test2", "test[8-9]"],\n "drop-constant-columns": true,\n "obfuscate": ["test3"],\n "skip-scaling": ["test4"],\n "sort-by": ["test4", "test3"],\n "rename": [\n { "test1": "test1_renamed" },\n { "test([3-4])": "test\\\\1_regex_renamed" },\n { "foo[.]bar[.].": "foo" }\n ]\n }\n}\n```\n\nGenerated CSV file:\n```csv\ntest3_regex_renamed,test4_regex_renamed,test1_renamed,test5,test6,test7,foo_1,foo_2\n0,2,0.0,0.1488888888888889,0.06,0.0,0.0,0.0\n2,2,0.5000000000000001,0.5055555555555556,0.08,0.2857142857142857,0.3333333333333333,0.3333333333333333\n1,4,1.0,0.0,1.0,0.5714285714285714,0.6666666666666666,0.6666666666666666\n2,4,0.5090909090909091,1.0,0.0,1.0,1.0,1.0\n```\n\nReport:*\n```csv\n,action,column,details\n0,recipe_version,,1.0\n1,drpt_version,,0.6.3\n2,DROP,test2,\n3,DROP,test8,\n4,DROP,test9,\n5,DROP_CONSTANT,const,\n6,OBFUSCATE,test3,"{""one"": 0, ""three"": 1, ""two"": 2}"\n7,SCALE_DEFAULT,test1,"[1.1,3.3]"\n8,SCALE_DEFAULT,test5,"[0.1,1.0]"\n9,SCALE_DEFAULT,test6,"[0.0,5.0]"\n10,SCALE_DEFAULT,test7,"[-1.0,2.5]"\n11,SCALE_DEFAULT,foo.bar.test,"[1,4]"\n12,SCALE_DEFAULT,foo.bar.test2,"[1,4]"\n13,SORT,"[\'test4\', \'test3\']",\n14,RENAME,test1,test1_renamed\n15,RENAME,test3,test3_regex_renamed\n16,RENAME,test4,test4_regex_renamed\n17,RENAME,foo.bar.test,foo_1\n18,RENAME,foo.bar.test2,foo_2\n```\n\n## Thanks\n\nThis tool was made possible with [Pandas](https://pandas.pydata.org/), [PyArrow](https://arrow.apache.org/docs/python/index.html), [jsonschema](https://pypi.org/project/jsonschema/), and of course [Python](https://www.python.org/).\n\n\n ',
		'author': 'Constantinos Xanthopoulos',
		'author_email': 'conx@xanthopoulos.info',
		'maintainer': 'None',
		'maintainer_email': 'None',
		'url': 'https://github.com/ConX/drpt',
		'package_dir': package_dir,
		'packages': packages,
		'package_data': package_data,
		'install_requires': install_requires,
		'entry_points': entry_points,
		'python_requires': '>=3.9,<4.0',
		}


		setup(**setup_kwargs)

+19

-37

PKG-INFO

		Metadata-Version: 2.1
		Name: drpt
		Version: 0.8.0
		Version: 0.8.2
		Summary: Tool for preparing a dataset for publishing by dropping, renaming, scaling, and obfuscating columns defined in a recipe.
		Author-email: Constantinos Xanthopoulos <conx@xanthopoulos.info>
		License: BSD 3-Clause License

		Copyright (c) 2022, Constantinos Xanthopoulos
		All rights reserved.

		Redistribution and use in source and binary forms, with or without
		modification, are permitted provided that the following conditions are met:

		1. Redistributions of source code must retain the above copyright notice, this
		list of conditions and the following disclaimer.

		2. Redistributions in binary form must reproduce the above copyright notice,
		this list of conditions and the following disclaimer in the documentation
		and/or other materials provided with the distribution.

		3. Neither the name of the copyright holder nor the names of its
		contributors may be used to endorse or promote products derived from
		this software without specific prior written permission.

		THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
		AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
		IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
		DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
		FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
		DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
		SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
		CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
		OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
		OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

		Project-URL: Homepage, https://github.com/ConX/drpt
		Home-page: https://github.com/ConX/drpt
		License: BSD-3-Clause
		Keywords: data,data science,preprocessing,scaling,obfuscation,data release,data publishing
		Author: Constantinos Xanthopoulos
		Author-email: conx@xanthopoulos.info
		Requires-Python: >=3.9,<4.0
		Classifier: Development Status :: 4 - Beta
		Classifier: Intended Audience :: Science/Research
		Classifier: Topic :: Scientific/Engineering
		Classifier: License :: OSI Approved :: BSD License
		Classifier: Operating System :: OS Independent
		Classifier: Programming Language :: Python
		Classifier: Programming Language :: Python :: 3
		Classifier: Operating System :: OS Independent
		Requires-Python: >=3.9
		Classifier: Programming Language :: Python :: 3.9
		Classifier: Programming Language :: Python :: 3.10
		Classifier: Programming Language :: Python :: 3.11
		Classifier: Programming Language :: Python :: 3
		Classifier: Topic :: Scientific/Engineering
		Requires-Dist: click (>=8.1.3,<9.0.0)
		Requires-Dist: dask (>=2023.1.0,<2024.0.0)
		Requires-Dist: jsonschema (>=4.17.3,<5.0.0)
		Requires-Dist: pandas (>=1.5.2,<2.0.0)
		Requires-Dist: pyarrow (>=10.0.1,<11.0.0)
		Project-URL: Repository, https://github.com/ConX/drpt
		Description-Content-Type: text/markdown
		License-File: LICENSE

		@@ -48,0 +30,0 @@ # Data Release Preparation Tool

+26

-24

pyproject.toml

		@@ -1,12 +0,9 @@
		[build-system]
		requires = ["setuptools>=61.0.0", "wheel"]
		build-backend = "setuptools.build_meta"

		[project]
		[tool.poetry]
		name = "drpt"
		version = "0.8.0"
		version = "0.8.2"
		description = "Tool for preparing a dataset for publishing by dropping, renaming, scaling, and obfuscating columns defined in a recipe."
		authors = ["Constantinos Xanthopoulos <conx@xanthopoulos.info>"]
		license = "BSD-3-Clause"
		readme = "README.md"
		authors = [{ name = "Constantinos Xanthopoulos", email = "conx@xanthopoulos.info" }]
		license = { file = "LICENSE" }
		repository = "https://github.com/ConX/drpt"
		classifiers = [
		@@ -22,20 +19,28 @@ "Development Status :: 4 - Beta",

		requires-python = ">=3.9"
		[tool.poetry.dependencies]
		python = "^3.9"
		click = "^8.1.3"
		jsonschema = "^4.17.3"
		pandas = "^1.5.2"
		pyarrow = "^10.0.1"
		dask = "^2023.1.0"

		dependencies = [
		"click >= 8.1.3",
		"jsonschema >=4.16.0",
		"pandas >=1.5.0",
		"pyarrow >=9.0.0",
		"dask >= 2022.9.2"
		]
		[tool.poetry.group.dev.dependencies]
		black = "^22.12.0"
		isort = "^5.11.4"
		ipykernel = "^6.20.1"
		flake8 = "^6.0.0"
		mypy = "^0.991"
		bumpver = "^2022.1120"

		[project.urls]
		Homepage = "https://github.com/ConX/drpt"

		[project.scripts]
		[tool.poetry.scripts]
		drpt = "drpt.__main__:main"

		[build-system]
		requires = ["poetry-core"]
		build-backend = "poetry.core.masonry.api"

		[tool.bumpver]
		current_version = "0.8.0"
		current_version = "0.8.2"
		version_pattern = "MAJOR.MINOR.PATCH[PYTAGNUM]"
		@@ -50,5 +55,2 @@ commit_message = "Bump version {old_version} -> {new_version}"
		'version = "{version}"',
		]
		"src/drpt/__init__.py" = [
		'__version__ = "{version}"',
		]
		]

+3

-1

src/drpt/__init__.py

		@@ -1,1 +0,3 @@
		__version__ = "0.8.0"
		import importlib.metadata

		__version__ = importlib.metadata.version(__name__)

-4

setup.cfg

		[egg_info]
		tag_build =
		tag_date = 0

-1

src/drpt.egg-info/dependency_links.txt

-2

src/drpt.egg-info/entry_points.txt

		[console_scripts]
		drpt = drpt.__main__:main

-222

src/drpt.egg-info/PKG-INFO

		Metadata-Version: 2.1
		Name: drpt
		Version: 0.8.0
		Summary: Tool for preparing a dataset for publishing by dropping, renaming, scaling, and obfuscating columns defined in a recipe.
		Author-email: Constantinos Xanthopoulos <conx@xanthopoulos.info>
		License: BSD 3-Clause License

		Copyright (c) 2022, Constantinos Xanthopoulos
		All rights reserved.

		Redistribution and use in source and binary forms, with or without
		modification, are permitted provided that the following conditions are met:

		1. Redistributions of source code must retain the above copyright notice, this
		list of conditions and the following disclaimer.

		2. Redistributions in binary form must reproduce the above copyright notice,
		this list of conditions and the following disclaimer in the documentation
		and/or other materials provided with the distribution.

		3. Neither the name of the copyright holder nor the names of its
		contributors may be used to endorse or promote products derived from
		this software without specific prior written permission.

		THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
		AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
		IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
		DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
		FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
		DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
		SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
		CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
		OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
		OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

		Project-URL: Homepage, https://github.com/ConX/drpt
		Keywords: data,data science,preprocessing,scaling,obfuscation,data release,data publishing
		Classifier: Development Status :: 4 - Beta
		Classifier: Intended Audience :: Science/Research
		Classifier: Topic :: Scientific/Engineering
		Classifier: Programming Language :: Python
		Classifier: Programming Language :: Python :: 3
		Classifier: Operating System :: OS Independent
		Requires-Python: >=3.9
		Description-Content-Type: text/markdown
		License-File: LICENSE

		# Data Release Preparation Tool

		- [Data Release Preparation Tool](#data-release-preparation-tool)
		- [Description](#description)
		- [Installation](#installation)
		- [Usage](#usage)
		- [CLI](#cli)
		- [Recipe Definition](#recipe-definition)
		- [Example](#example)
		- [Thanks](#thanks)

		> :warning: This is currently at beta development stage and likely has a lot of bugs. Please use the [issue tracker](https://github.com/ConX/drpt/issues) to report an bugs or feature requests.

		## Description

		Command-line tool for preparing a dataset for publishing by dropping, renaming, scaling, and obfuscating columns defined in a recipe.

		After performing the operations defined in the recipe the tool generates the transformed dataset version and a CSV report listing the performed actions.

		## Installation

		The tool can be installed using pip:

		```shell
		pip install drpt
		```

		## Usage

		### CLI

		```txt
		Usage: drpt [OPTIONS] RECIPE_FILE INPUT_FILE

		Options:
		-d, --dry-run Generate only the report without the release dataset
		-v, --verbose Verbose [Not implemented]
		-n, --nrows TEXT Number of rows to read from a CSV file. Doesn't work
		with parquet files.
		-l, --limits-file PATH Limits file
		-o, --output-dir PATH Output directory. The default output directory is
		the same as the location of the recipe_file.
		--version Show the version and exit.
		--help Show this message and exit.
		```

		### Recipe Definition

		#### Overview
		The recipe is a JSON formatted file that includes what operations should be performed on the dataset. For versioning purposes, the recipe also contains a `version` key which is appended in the generated filenames and the report.

		Default recipe:
		```json
		{
		"version": "",
		"actions": {
		"drop": [],
		"drop-constant-columns": false,
		"obfuscate": [],
		"disable-scaling": false,
		"skip-scaling": [],
		"sort-by": [],
		"rename": []
		}
		}
		```

		The currently supported actions, performed in this order, are as follows:
		- `drop`: Column deletion
		- `drop-constant-columns`: Drops all columns that containt only one unique value
		- `obfuscate`: Column obfuscation, where the listed columns are treated as categorical variables and then integer coded.
		- Scaling: By default all columns are Min/Max scaled
		- `disable-scaling`: Can be used to disable scaling for all columns
		- `skip-scaling`: By default all columns are Min/Max scaled, except those excluded (`skip-scaling`)
		- `sort-by`: Sort rows by the listed columns
		- `rename`: Column renaming

		All column definitions above support [regular expressions](https://docs.python.org/3/library/re.html#regular-expression-syntax).

		#### Actions

		##### _drop_
		The `drop` action is defined as a list of column names to be dropped.

		##### _drop-constant-columns_
		This is a boolean action, which when set to `true` will drop all the columns that have only a single unique value.

		##### _obfuscate_
		The `obfuscate` action is defined as a list of column names to be obfuscated.

		##### _disable-scaling_, _skip-scaling_
		By default, the tool Min/Max scales all numerical columns. This behavior can be disabled for all columns by setting the `disable-scaling` action to `true`. If scaling must be disabled for only a set of columns these columns can be defined using the `skip-scaling` action, as a list of column names.

		##### _sort-by_
		This is a list of column names by which to sort the rows. The order in the list denotes the sorting priority.

		##### _rename_
		The `rename` action is defined as a list of objects whose key is the original name (or regular expression), and their value is the target name. When the target uses matched groups from the regular expression those can be provided with their group number prepended with an escaped backslash (`\\1`) [see [example](#example) below].

		```json
		{
		//...
		"rename": [{"original_name": "target_name"}]
		//...
		}
		```
		## Example

		Input CSV file:
		```csv
		test1,test2,test3,test4,test5,test6,test7,test8,test9,foo.bar.test,foo.bar.test2,const
		1.1,1,one,2,0.234,0.3,-1,a,e,1,1,1
		2.2,2,two,2,0.555,0.4,0,b,f,2,2,1
		3.3,3,three,4,0.1,5,1,c,g,3,3,1
		2.22,2,two,4,1,0,2.5,d,h,4,4,1
		```

		Recipe:
		```json
		{
		"version": "1.0",
		"actions": {
		"drop": ["test2", "test[8-9]"],
		"drop-constant-columns": true,
		"obfuscate": ["test3"],
		"skip-scaling": ["test4"],
		"sort-by": ["test4", "test3"],
		"rename": [
		{ "test1": "test1_renamed" },
		{ "test([3-4])": "test\\1_regex_renamed" },
		{ "foo[.]bar[.].*": "foo" }
		]
		}
		}
		```

		Generated CSV file:
		```csv
		test3_regex_renamed,test4_regex_renamed,test1_renamed,test5,test6,test7,foo_1,foo_2
		0,2,0.0,0.1488888888888889,0.06,0.0,0.0,0.0
		2,2,0.5000000000000001,0.5055555555555556,0.08,0.2857142857142857,0.3333333333333333,0.3333333333333333
		1,4,1.0,0.0,1.0,0.5714285714285714,0.6666666666666666,0.6666666666666666
		2,4,0.5090909090909091,1.0,0.0,1.0,1.0,1.0
		```

		Report:
		```csv
		,action,column,details
		0,recipe_version,,1.0
		1,drpt_version,,0.6.3
		2,DROP,test2,
		3,DROP,test8,
		4,DROP,test9,
		5,DROP_CONSTANT,const,
		6,OBFUSCATE,test3,"{""one"": 0, ""three"": 1, ""two"": 2}"
		7,SCALE_DEFAULT,test1,"[1.1,3.3]"
		8,SCALE_DEFAULT,test5,"[0.1,1.0]"
		9,SCALE_DEFAULT,test6,"[0.0,5.0]"
		10,SCALE_DEFAULT,test7,"[-1.0,2.5]"
		11,SCALE_DEFAULT,foo.bar.test,"[1,4]"
		12,SCALE_DEFAULT,foo.bar.test2,"[1,4]"
		13,SORT,"['test4', 'test3']",
		14,RENAME,test1,test1_renamed
		15,RENAME,test3,test3_regex_renamed
		16,RENAME,test4,test4_regex_renamed
		17,RENAME,foo.bar.test,foo_1
		18,RENAME,foo.bar.test2,foo_2
		```

		## Thanks

		This tool was made possible with [Pandas](https://pandas.pydata.org/), [PyArrow](https://arrow.apache.org/docs/python/index.html), [jsonschema](https://pypi.org/project/jsonschema/), and of course [Python](https://www.python.org/).

-5

src/drpt.egg-info/requires.txt

		click>=8.1.3
		jsonschema>=4.16.0
		pandas>=1.5.0
		pyarrow>=9.0.0
		dask>=2022.9.2

-12

src/drpt.egg-info/SOURCES.txt

		LICENSE
		README.md
		pyproject.toml
		src/drpt/__init__.py
		src/drpt/__main__.py
		src/drpt/drpt.py
		src/drpt.egg-info/PKG-INFO
		src/drpt.egg-info/SOURCES.txt
		src/drpt.egg-info/dependency_links.txt
		src/drpt.egg-info/entry_points.txt
		src/drpt.egg-info/requires.txt
		src/drpt.egg-info/top_level.txt

-1

src/drpt.egg-info/top_level.txt

drpt

drpt - npm Package Compare versions

Improved metrics

Worsened metrics