Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
.. image:: https://github.com/Yelp/mrjob/raw/master/docs/logos/logo_medium.png
mrjob is a Python 2.7/3.4+ package that helps you write and run Hadoop Streaming jobs.
Stable version (v0.7.4) documentation <http://mrjob.readthedocs.org/en/stable/>
_
Development version documentation <http://mrjob.readthedocs.org/en/latest/>
_
.. image:: https://travis-ci.org/Yelp/mrjob.png :target: https://travis-ci.org/Yelp/mrjob
mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. mrjob has basic support for Google Cloud Dataproc (Dataproc) which allows you to buy time on a Hadoop cluster on a minute-by-minute basis. It also works with your own Hadoop cluster.
Some important features:
Run jobs on EMR, Google Cloud Dataproc, your own Hadoop cluster, or locally (for testing).
Write multi-step jobs (one map-reduce step feeds into the next)
Easily launch Spark jobs on EMR or your own Hadoop cluster
Duplicate your production environment inside Hadoop
$PYTHONPATH
$TZ
)mrjob.conf
config fileAutomatically interpret error logs
SSH tunnel to hadoop job tracker (EMR only)
Minimal setup
$AWS_ACCESS_KEY_ID
and $AWS_SECRET_ACCESS_KEY
$GOOGLE_APPLICATION_CREDENTIALS
pip install mrjob
As of v0.7.0, Amazon Web Services and Google Cloud Services are optional
depedencies. To use these, install with the aws
and google
targets,
respectively. For example:
pip install mrjob[aws]
Code for this example and more live in mrjob/examples
.
.. code-block:: python
"""The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import re
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def combiner(self, word, counts):
yield (word, sum(counts))
def reducer(self, word, counts):
yield (word, sum(counts))
if name == 'main': MRWordFreqCount.run()
::
# locally
python mrjob/examples/mr_word_freq_count.py README.rst > counts
# on EMR
python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts
# on Dataproc
python mrjob/examples/mr_word_freq_count.py README.rst -r dataproc > counts
# on your Hadoop cluster
python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts
Amazon Web Services account <http://aws.amazon.com/>
_your account page <http://aws.amazon.com/account/>
_)$AWS_ACCESS_KEY_ID
and
$AWS_SECRET_ACCESS_KEY
accordinglyCreate a Google Cloud Platform account <http://cloud.google.com/>
_, see top-right
Learn about Google Cloud Platform "projects" <https://cloud.google.com/docs/overview/#projects>
_
Select or create a Cloud Platform Console project <https://console.cloud.google.com/project>
_
Enable billing for your project <https://console.cloud.google.com/billing>
_
Go to the API Manager <https://console.cloud.google.com/apis>
_ and search for / enable the following APIs...
Under Credentials, Create Credentials and select Service account key. Then, select New service account, enter a Name and select Key type JSON.
Install the Google Cloud SDK <https://cloud.google.com/sdk/>
_
To run in other AWS regions, upload your source tree, run make
, and use
other advanced mrjob features, you'll need to set up mrjob.conf
. mrjob looks
for its conf file in:
$MRJOB_CONF
~/.mrjob.conf
/etc/mrjob.conf
See the mrjob.conf documentation <https://mrjob.readthedocs.io/en/latest/guides/configs-basics.html>
_ for more
information.
Source code <http://github.com/Yelp/mrjob>
__Documentation <https://mrjob.readthedocs.io/en/latest/>
_Discussion group <http://groups.google.com/group/mrjob>
_Hadoop Streaming <http://hadoop.apache.org/docs/stable1/streaming.html>
_Elastic MapReduce <http://aws.amazon.com/documentation/elasticmapreduce/>
_Google Cloud Dataproc <https://cloud.google.com/dataproc/overview>
_PyCon 2011 mrjob overview <http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-mrjob-distributed-computing-for-everyone-4898987/>
_Introduction to Recommendations and MapReduce with mrjob <http://aimotion.blogspot.com/2012/08/introduction-to-recommendations-with.html>
_
(source code <https://github.com/marcelcaraciolo/recsys-mapreduce-mrjob>
__)Social Graph Analysis Using Elastic MapReduce and PyPy <http://postneo.com/2011/05/04/social-graph-analysis-using-elastic-mapreduce-and-pypy>
_Thanks to Greg Killion <mailto:greg@blind-works.net>
_
(ROMEO ECHO_DELTA <http://www.romeoechodelta.net/>
_) for the logo.
FAQs
Python MapReduce framework
We found that mrjob demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 3 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.