
Research
PyPI Package Disguised as Instagram Growth Tool Harvests User Credentials
A deceptive PyPI package posing as an Instagram growth tool collects user credentials and sends them to third-party bot services.
A python WebHDFS/HTTPFS based tool for inter/intra-cluster data copying. This tool is very suitable for multiple mid or small size files cross-clusters copy. Compared to the normal distcp which adds a lot of overhead time for submitting the map-reduce job then waiting for YARN to schedule it..., pydistcp uses webhdfs to stream the data from source cluster datanodes directly to destination cluster datanodes using multiple parallel threads.
When transferring few huge files, the normal distcp may be faster, but when transferring lot of small, midsize or relatively big file, pydistcp provides a very good performance.
$ pydistcp -f -s staging -d prod /data/outgoing /data/incoming --threads=10 --part-size=131072
27.1% [ pending: 32 | transferring: 6 | complete: 4 ]
Job Status:
{
"Size Failed": 0,
"Size Copied": 257721641,
"Source Path": "/data/t100",
"Size Expected": 257721641,
"Files Expected": 42,
"Files Failed": 0,
"Destination Path": "/data/t200",
"Start Time": "2017-02-22 17:39:29",
"Files Skipped": 0,
"Size Deleted": 0,
"End Time": "2017-02-22 17:39:50",
"Files Copied": 42,
"Files Deleted": 0,
"Duration": 20.756325006484985,
"Outcome": "Successful",
"Size Skipped": 0
}
Pydistcp uses pywhdfs for establishing connections with WEBHDFS/HTTPFS source and destination clusters.
$ easy_install pydistcp
Pydistcp share the same json configuration file used by pywhdfs . Please refer to the project readme file for details about the json configuration schema.
There are multiple arguments you can use to alter the way the copy works, or to enhance the performance of the job depending on the size of the server you use. Use the help argument to display the full list of supported parameters:
$ pydistcp --help
pydistcp: A python Web HDFS based tool for inter/intra-cluster data copying.
Usage:
pydistcp [-fp] [--no-checksum] [--silent] (-s CLUSTER -d CLUSTER) [-v...] [--part-size=PART_SIZE] [--threads=THREADS] SRC_PATH DEST_PATH
pydistcp (--version | -h)
Options:
--version Show version and exit.
-h --help Show help and exit.
-s CLUSTER --src=CLUSTER Alias of source namenode to connect to (valid only with dist).
-d CLUSTER --dest=CLUSTER Alias of destination namenode to connect to (valid only with dist).
-v --verbose Enable log output. Can be specified multiple times to increase verbosity each time.
--no-checksum Disable checksum check prior to file transfer. This will force overwrite.
--silent Don't display progress status.
-f --force Allow overwriting any existing files.
-p --preserve Preserve file attributes.
--threads=THREADS Number of threads to use for parallelization.
zero limits the concurrency to the maximum concurrent threads
supported by the cluster. [default: 0]
--part-size=PART_SIZE Interval in bytes by which the files will be copied
needs to be a Powers of 2. [default: 65536]
Examples:
pydistcp -s prod -d preprod -v /tmp/src /tmp/dest
All cluster connection parameters will be fetched from the json configuration file.
Below some benchmarks showing the impact of data size on the copy performance using pydistcp :
File Count | Data Size | Time |
---|---|---|
2379 | 11.4 G | 4m39.069s |
242 | 25.9 G | 5m39.348s |
869 | 116.9 G | 25m53.231s |
42 | 545.8 M | 0m19.946s |
1788 | 5.2 G | 2m25.649s |
4428 | 35.7 G | 10m20.129s |
2357 | 5.6 G | 3m2.598s |
180 | 2.3 G | 0m33.133s |
334 | 7.6 G | 1m26.260s |
Note that all test cases are executed with 10 concurrent threads on a machine having 6 cores and supporting up to 12 threads and no files are skipped during the copy. Both the source and destination clusters are secured with kerberos and use ssl to encrypt transferred data.
Pydistcp performance may be impact by lot of parameters like:
Feedback and Pull requests are very welcome!
FAQs
pydistcp: python WebHDFS inter/intra-cluster data copy tool.
We found that pydistcp demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
A deceptive PyPI package posing as an Instagram growth tool collects user credentials and sends them to third-party bot services.
Product
Socket now supports pylock.toml, enabling secure, reproducible Python builds with advanced scanning and full alignment with PEP 751's new standard.
Security News
Research
Socket uncovered two npm packages that register hidden HTTP endpoints to delete all files on command.