DataLoader
Table of Contents
Requirements
-
Python
: ~=3.10
-
pytest
: ~=7.4.3
-
setuptools
: ~=68.2.2
-
pandas
: ~=2.2.0
This project mandates the use of Python 3.7
or later versions. Compatibility issues have been identified with the use for dataclasses in Python 3.6
and earlier versions.
Getting Started
Installation Methods
Cloning the Repository
- Clone the repository.
- Install the required dependencies
pip install -r requirements.txt
Install via pip
pip install dynamic-loader
pip install -r requirements.txt
Overview.
The DataLoader project is a comprehensive utility that facilitates the efficient loading and processing of data from specified directories. This project is designed to be user-friendly and easy to integrate into your projects.
The DataMetrics class focuses on processing data paths and gathering statistics related to the file system and specified paths. Also allows the ability to export all statistics to a JSON file.
The Extensions class is a utility that provides a set of default file extensions for the DataLoader
class. Its the back-bone for mapping all file extensions to its respective loading method.
Features
DataLoader
The DataLoader
class is specifically designed for loading and processing data from directories. It provides the following key features:
Key Features:
- Dynamic Loading: Load files from a single directory or merge files from multiple directories.
- Flexible Configuration: Set various parameters, such as default file extensions, full POSIX paths, method loader execution, and more.
- Parallel Execution: Leverage parallel execution with the
total_workers
parameter to enhance performance. - Verbose Output: Display verbose output to track the loading process.
- If enabled, the
verbose
parameter will display the loading process for each file. - If disabled, the
verbose
parameter will write the loading process for each file to a log file.
- Custom Loaders: Implement custom loaders for specific file extensions.
- Please note that at the moment, the loading methods kwargs will be uniformly applied to all files with the specified extension.
- Additionally, the first parameter of the loader method is automatically passed and should be skipped. If passed, the loader will fail and return the contents of the file as
TextIOWrapper
.
Future updates will include the ability to specify what loader method to use for a specific files efficiently.
Parameters:
path
(str or Path): The path of the directory from which to load files.directories
(Iterable): An iterable of directories from which to all files.default_extensions
(Iterable): Default file extensions to be processed.full_posix
(bool): Indicates whether to display full POSIX paths.no_method
(bool): Indicates whether to skip loading method matching execution.verbose
(bool): Indicates whether to display verbose output.generator
(bool): Indicates whether to return the loaded files as a generator; otherwise, returns as a dictionary.total_workers
(int): Number of workers for parallel execution.log
(Logger): A configured logger instance for logging messages. (Refer to the GetLogger class for more information on how to create a logger instance using the GetLogger
class.)ext_loaders
(dict[str, Any, dict[key-value]]): Dictionary containing extensions mapped to specified loaders. (Refer to the Extensions class for more information)
Class & Property Methods:
load_file
(class_method): Load a specific file.get_files
(class_method): Retrieve files from a directory based on default extensions and filters unwanted files.dir_files
(property): Loaded files from specified directories.files
(property): Loaded files from a single directoryall_exts
(property): Retrieve all supported file extensions with their respective loader methods being used.EXTENSIONS
(Extensions class instance): Retrieve all default supported file extensions with their respective loader methods.
DataMetrics
The DataMetrics
class focuses on processing data paths and gathering statistics related to the file system. Key features include:
Key Features:
- OS Statistics: Retrieve detailed statistics for each path, including symbolic link status, calculated size, and size in bytes.
- Export to JSON: Export all statistics to a JSON file for further analysis and visualization.
Parameters:
paths
(Iterable): Paths for which to gather statistics.file_name
(str): The file name to be used when exporting all files metadata stats.full_posix
(bool): Indicates whether to display full POSIX paths.
Property Methods:
all_stats
: Retrieve statistics for all paths.total_size
: Calculate the total size of all paths.total_files
: Calculate the total number of files in all paths.export_stats()
: Export all statistics to a JSON file.
OS Stats Results:
os_stats_results
: OS statistics results for each path.- Custom Stats:
st_fsize
: Full file size statistics.st_vsize
: Full volume size statistics.
Extensions
:
The Extensions
class is a utility that provides a set of default file extensions for the DataLoader
class. Its the back-bone for mapping all file extensions to its respective loading method. All extensions are stored in a dictionary (no period included), and the Extensions
class provides the following key features:
Key Features:
- File Extension Mapping: Retrieve all supported file extensions with their respective loader methods.
- Loader Method Retrieval: Retrieve the loader method for a specific file extension.
- Loader Method Check: Check if a specific file extension has a loader method implemented that's not
open
. - Supported Extension Check: Check if a specific file extension is supported.
- Customization: Customize the
Extensions
class with new files extensions and its respective loader methods.
Parameters:
- No parameters are required for the
Extensions
class. Extensions()
: Initializes the Extensions
class with all implemented file extensions and their respective loader methods.
- Acts as a dictionary for accessing supported file extensions and their loader methods via Extensions().ALL_EXTS.
Class Methods:
ALL_EXTS
: Retrieve all supported file extensions with their respective loader methods.get_loader
: Retrieve the loader method for a specific file extension.has_loader
: Checks if a specific file extension has a loader method implemented thats not open
.is_supported
: Checks if a specific file extension is supported.customize
: Customize the Extensions
class with new files extensions and its respective loader methods.
GetLogger
Overview
The GetLogger
class is a utility that provides a method to get a configured logger instance for logging messages. It is designed to be user-friendly and easy to integrate into your projects.
Parameters
name
(str, optional): The name of the logger. Defaults to the name of the calling module.level
(int, optional): The logging level. Defaults to logging.DEBUG.formatter_kwgs
(dict, optional): Additional keyword arguments for the log formatter.handler_kwgs
(dict, optional): Additional keyword arguments for the log handler.mode
(str, optional): The file mode for opening the log file. Defaults to "a" (append).
Attributes
refresher
(callable): A method to refresh the log file.set_verbose
(callable): A method to set the verbosity of the logger.
Returns
- Logger: A configured logger instance.
Notes
- This function sets up a logger with a file handler and an optional stream (console) handler for verbose logging.
- If
verbose
is True, log messages will be printed to the console instead of being written to a file.
Usage:
DataLoader
Usage Examples
Load Files from a Single Directory as a Generator
from data_loader import DataLoader
dl_gen = DataLoader(path="path/to/directory")
dl_files_gen = dl_gen.files
print(dl_files_gen)
Load Files from a Single Directory as a Dictionary (Custom-Repr)
from data_loader import DataLoader
dl_dict = DataLoader(path="path/to/directory", generator=False, full_posix=False)
dl_files_dict = dl_dict.files
print(dl_files_dict)
Load Files from Multiple Directories
from data_loader import DataLoader
dl = DataLoader(directories=["path/to/dir1", "path/to/dir2"], generator=False, full_posix=False)
dl_dir_files = dl.dir_files
print(dl_dir_files)
Load Files with Default Extensions
from data_loader import DataLoader
dl_default = DataLoader(path="path/to/directory", default_extensions=["csv"], generator=False, full_posix=False)
dl_default_files = dl_default.files
print(dl_default_files)
Retrieve Data for a Specific File
from data_loader import DataLoader
dl_files = DataLoader(path="path/to/directory", generator=False, full_posix=False).files
dl_specific_file_data = dl_files["file1.csv"]
Load Files with Custom Loader Methods
from data_loader import DataLoader
import pandas as pd
dl_custom = DataLoader(path="path/to/directory", ext_loaders={"csv": {pd.read_csv: {"nrows": 10}}}, generator=False, full_posix=False)
dl_custom_files = dl_custom.files
print(dl_custom_files)
Specify a Custom Logger
from data_loader import DataLoader
import logging
custom_logger = logging.getLogger("DataLoader")
dl_with_logger = DataLoader(path="path/to/directory", log=custom_logger)
dl_logger_files = dl_with_logger.files
print(dl_logger_files)
DataMetrics
Usage
from data_loader import DataMetrics
dm = DataMetrics(files=["path/to/directory1", "path/to/directory2"])
print(dm.all_stats)
print(dm.total_size)
print(dm.total_files)
dm.export_stats()
Extensions
Usage
from data_loader import Extensions
ALL_EXTS = Extensions()
print("csv" in ALL_EXTS)
print(ALL_EXTS.get_loader("csv"))
print(ALL_EXTS.get_loader(".pickle"))
print(ALL_EXTS.has_loader("docx"))
print(ALL_EXTS.is_supported("docx"))
ALL_EXTS.customize({"docx": {open: {mode="rb"}},
"png": {PIL.Image.open: {}}})
print(ALL_EXTS.get_loader("docx"))
GetLogger
Usage:
from data_loader import GetLogger
logger = GetLogger().logger
logger.info("This is an info message")
logger = GetLogger(name='custom_logger', level=logging.INFO, verbose=True).logger
logger.info("This is an info message")
logger = GetLogger().logger
logger.set_verbose(True)
CustomException("Error Message")
logger.set_verbose(False).logger
CustomException("Error Message")
DataMetrics
Usage Examples
from data_metrics import DataMetrics
dm = DataMetrics(("path/to/directory1", <Dict>),
("path/to/directory2", <Dict>))
metadata_directory1 = dm["path/to/directory1"]
print(metadata_directory1)
dm.export_stats(file_path="all_metadata_stats.json")
total_size = dm.total_size
print(total_size)
total_files = dm.total_files
print(total_files)
Future Updates
Feedback
Feedback is crucial for the improvement of the DataLoader
project. If you encounter any issues, have suggestions, or want to share your experience, please consider the following channels:
-
GitHub Issues: Open an issue on the GitHub repository to report bugs or suggest enhancements.
-
Contact: Reach out to the project maintainer via the following:
Contact Information
Your feedback and contributions play a significant role in making the DataLoader
project more robust and valuable for the community. Thank you for being part of this endeavor!