mlbench-core: Distributed Machine Learning Benchmark Core Library
MLBench is a Benchmarking Framework for Distributed Machine Learning algorithms.
This repository contains the core Python library for MLBench which is used to share code between Benchmark implementations as well as for communication with the dashboard.
For more information refer to the MLBench Core Documentation
or the Main Documentation
Changelog
v3.0.0 (2020-12-07)
Full Changelog
Implemented enhancements:
- Support multiple clusters in CLI #91
- Add notebook/code to visualize results #72
- Support AWS in CLI #33
- Fix rnn language model #303 (ehoelzl)
- Transformer language translation #99 (ehoelzl)
Fixed bugs:
- Training code keeps running for PyTorch after training is done #26
Closed issues:
- Remove loss argument for metric computation #295
- Update PyTorch to 1.7 #286
- Refactor optimizer and chose more appropriate names #284
- fails to create kind cluster #277
- Refactor CLI #253
- Dependabot couldn't authenticate with https://pypi.python.org/simple/ #252
- Unify requirements/setup.py versions #244
- isort failing on all PRs #227
- torch.div is not supported in PyTorch 1.6 #223
- Refactor common functionality for tiller and helm #108
- Add GPU support for AWS in CLI #104
- Change CPU limit to #CPUs - 1 #101
- Add --version flag #97
- Cluster creation/deletion errors with non-default zone #94
- Add command to list runs #86
- RefreshError from gcloud #83
- Run new benchmarks and document costs #82
- Make nvidia k80 default GPU #80
- Fix random seeds #79
- benchmark against torch.nn.parallel.DistributedDataParallel MPSG #75
- upgrade to pytorch 1.5 #74
- Provide comparison to competitors #66
- Add some integration tests #64
- Remove stale branches #62
- Add PowerSGD optimizer #59
- Add RNN Language Model #54
- Use torch.nn.DataParallel for intra-node computation #46
- Add CLI support for DIND #42
- Port over functionality from Language Model benchmark to the core library #34
- make results reproducible from command-line #24
- Contribution and docs section on README.md #17
- test new torch.distributed #15
Merged pull requests:
Changelog
v2.4.0 (2020-04-20)
Full Changelog
Implemented enhancements:
- Switch to black for code formatting #35
Closed issues:
- Travis tests run only for Python 3.6 #65
- Downloading results fails if
--output
option is not provided #57 - Remember user input in mlbench run #56
- Aggregate the gradients by model, instead of by layers. #45
- Update docker images to CUDA10, mlbench-core module to newest #43
- Upgrade PyTorch to 1.4 #40
Merged pull requests:
Changelog
v2.3.2 (2020-04-07)
Full Changelog
Implemented enhancements:
- Add NCCL & GLOO Backend support #49
- Add NCCL & GLOO Backend support #47 (giorgiosav)
Fixed bugs:
- math ValueError with 1-node cluster #38
Merged pull requests:
Change Log
2.3.1 (2020-03-09)
Full Changelog
Implemented enhancements:
- Customize Communication Scheme For Sparsified/Quantizatized/Decentralized scenarios #12
v2.3.0 (2019-12-23)
Full Changelog
v2.2.1 (2019-12-16)
Full Changelog
Change Log
v2.2.0 (2019-11-11)
Full Changelog
Implemented enhancements:
initialize_backends
can now be called as context manager- Improved CLI to run multiple runs in parallel
v2.1.1 (2019-11-11)
Full Changelog
Full Changelog
Implemented enhancements:
- Added CLI for MLBench runs
v1.4.4 (2019-05-28)
Full Changelog
v1.4.3 (2019-05-23)
Full Changelog
v1.4.2 (2019-05-21)
Full Changelog
v1.4.1 (2019-05-16)
Full Changelog
v1.4.0 (2019-05-02)
Full Changelog
Implemented enhancements:
- Split Train and Validation in Tensorflow #22
v1.3.4 (2019-03-20)
Full Changelog
Implemented enhancements:
- in controlflow, don't mix train and validation #20
Fixed bugs:
- Add metrics logging for Tensorflow #19
v1.3.3 (2019-02-26)
Full Changelog
v1.3.2 (2019-02-13)
Full Changelog
v1.3.1 (2019-02-13)
Full Changelog
v1.3.0 (2019-02-12)
Full Changelog
v1.2.1 (2019-01-31)
Full Changelog
v1.2.0 (2019-01-30)
Full Changelog
v1.1.1 (2019-01-09)
Full Changelog
v1.1.0 (2018-12-06)
Full Changelog
Fixed bugs:
- Bug when saving checkpoints #13
v1.0.0 (2018-11-20)
Full Changelog
Implemented enhancements:
- Add API Client to mlbench-core #6
- Move to google-style docs #4
- Add Imagenet Dataset for pytorch #3
- Move worker code to mlbench-core repo #1
Change Log
1.4.2 (2019-05-21)
Full Changelog
Implemented enhancements:
- Split Train and Validation in Tensorflow #22
- in controlflow, don't mix train and validation #20
Fixed bugs:
- Add metrics logging for Tensorflow #19
- Bug when saving checkpoints #13
Change Log
v1.4.1 (2019-05-16)
Full Changelog
1.4.0 (2019-05-02)
Full Changelog
Implemented enhancements:
- Split Train and Validation in Tensorflow #22
- in controlflow, don't mix train and validation #20
Fixed bugs:
- Add metrics logging for Tensorflow #19
- Bug when saving checkpoints #13
Change Log
v1.3.4 (2019-03-20)
Full Changelog
Implemented enhancements:
- in controlflow, don't mix train and validation #20
Fixed bugs:
- Add metrics logging for Tensorflow #19
v1.3.3 (2019-02-26)
Full Changelog
v1.3.2 (2019-02-13)
Full Changelog
v1.3.1 (2019-02-13)
Full Changelog
v1.3.0 (2019-02-12)
Full Changelog
v1.2.1 (2019-01-31)
Full Changelog
v1.2.0 (2019-01-30)
Full Changelog
v1.1.1 (2019-01-09)
Full Changelog
Change Log
v1.1.0 (2018-12-06)
Full Changelog
Fixed bugs:
- Bug when saving checkpoints #13
- Adds Tensorflow Controlflow, Dataset and Model code
- Adds Pytorch linear models
- Adds sparsified and decentralized optimizers
v1.0.0 (2018-11-15)
Implemented enhancements:
- Add API Client to mlbench-core #6
- Move to google-style docs #4
- Add Imagenet Dataset for pytorch #3
- Move worker code to mlbench-core repo #1
0.1.0 (2018-09-14)
Implemented enhancements:
- Add documentation in reference implementation to docs #46
- Replace cAdvisor with Kubernetes stats for Resource usage #38
- Rename folders #31
- Change docker image names #30
- Add continuous output for mpirun #27
- Replace SQlite with Postgres #25
- Fix unittest #23
- Add/Fix CI/Automated build #22
- Cleanup unneeded project files #21
- Remove hardcoded values #20
- Improves Notes.txt #19
- Rename components #15
Fixed bugs:
- 504 Error when downloading metrics for long runs #61
Closed issues:
- small doc improvements for first release #54
- Check mlbench works on Google Cloud #51
- learning rate scheduler #50
- Add Nvidia k8s-device-plugin to charts #48
- Add Weave to Helm Chart #41
- Allow limiting of resources for experiments #39
- Allow downloading of Run measurements #35
- Worker Details page #33
- Run Visualizations #32
- Show experiment history in Dashboard #18
- Show model progress in Dashboard #13
- Report cluster status in Dashboard #12
- Send metrics from SGD example to metrics api #11
- Add metrics endpoint for experiments #10
- Let Coordinator Dashboard start a distributed Experiment #9
- Add mini-batch SGD model experiment #8
- add benchmark code for MPI #7
- add benchmark code for tensorflow #6
- add benchmark code for apache reef #5
- add benchmark code for apache flink #4
- get initial benchmark numbers (spark reference implementation and mllib/ml) #3
- evaluate script (framework-independent) and algorithm output format #2
- bench-spark: remove prepare-data for now, comment on solver prequisites #1
* This Change Log was automatically generated by github_changelog_generator
* This Change Log was automatically generated by github_changelog_generator
* This Change Log was automatically generated by github_changelog_generator
* This Change Log was automatically generated by github_changelog_generator
* This Change Log was automatically generated by github_changelog_generator
* This Change Log was automatically generated by github_changelog_generator
* This Change Log was automatically generated by github_changelog_generator
* This Change Log was automatically generated by github_changelog_generator
* This Change Log was automatically generated by github_changelog_generator
* This Change Log was automatically generated by github_changelog_generator
* This Changelog was automatically generated by github_changelog_generator
* This Changelog was automatically generated by github_changelog_generator
* This Changelog was automatically generated by github_changelog_generator