Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
probablepeople is a python library for parsing unstructured romanized name or company strings into components, using advanced NLP methods. This is based off usaddress, a python library for parsing addresses.
Try it out on our web interface! For those who aren't python developers, we also have an API.
What this can do: Using a probabilistic model, it makes (very educated) guesses in identifying name or corporation components, even in tricky cases where rule-based parsers typically break down.
What this cannot do: It cannot identify components with perfect accuracy, nor can it verify that a given name/company is correct/valid.
probablepeople learns how to parse names/companies through a body of training data. If you have examples of names/companies that stump this parser, please send them over! By adding more examples to the training data, probablepeople can continue to learn and improve.
Install probablepeople with pip, a tool for installing and managing python packages (beginner's guide here)
In the terminal,
pip install probablepeople
Parse some names/companies!
Note that parse
and tag
are differet methods:
import probablepeople as pp
name_str='Mr George "Gob" Bluth II'
corp_str='Sitwell Housing Inc'
# The parse method will split your string into components, and label each component.
pp.parse(name_str) # expected output: [('Mr', 'PrefixMarital'), ('George', 'GivenName'), ('"Gob"', 'Nickname'), ('Bluth', 'Surname'), ('II', 'SuffixGenerational')]
pp.parse(corp_str) # expected output: [('Sitwell', 'CorporationName'), ('Housing', 'CorporationName'), ('Inc', 'CorporationLegalType')]
# The tag method will try to be a little smarter
# it will merge consecutive components, strip commas, & return a string type
pp.tag(name_str) # expected output: (OrderedDict([('PrefixMarital', 'Mr'), ('GivenName', 'George'), ('Nickname', '"Gob"'), ('Surname', 'Bluth'), ('SuffixGenerational', 'II')]), 'Person')
pp.tag(corp_str) # expected output: (OrderedDict([('CorporationName', 'Sitwell Housing'), ('CorporationLegalType', 'Inc')]), 'Corporation')
Probablepeople uses parserator, a library for making and improving probabilistic parsers - specifically, parsers that use python-crfsuite's implementation of conditional random fields. Parserator allows you to train probablepeople's model (a .crfsuite settings file) on labeled training data, and provides tools for easily adding new labeled training data.
git clone https://github.com/datamade/probablepeople.git
cd probablepeople
pip install -e .
pytest
If there are name/company formats that the parser isn't performing well on, you can add them to training data. As probablepeople continually learns about new cases, it will continually become smarter and more robust.
NOTE: The model doesn't need many examples to learn about new patterns - if you are trying to get probablepeople to perform better on a specific type of name, start with a few (<5) examples, check performance, and then add more examples as necessary.
For this parser, we are keeping person names and organization names separate in the training data. The two training files used to produce the model are:
name_data/labeled/labeled.xml
for peoplename_data/labeled/company_labeled.xml
for organizations.To add your own training examples, first put your unlabeled raw data in a csv. Then:
parserator label [infile] [outfile] probablepeople
[infile]
is your raw csv and [outfile]
is the appropriate training file to write to. For example, if you put raw strings in my_companies.csv
, you'd use parserator label my_companies.csv name_data/labeled/company_labeled.xml probablepeople
The parserator label
command will start a console labeling task, where you will be prompted to label raw strings via the command line. For more info on using parserator, see the parserator documentation.
If you've added new training data, you will need to re-train the model. To set multiple files as traindata, separate them with commas.
parserator train [traindata] probablepeople
probablepeople allows for multiple model files - person
for person names only, company
for company names only, or generic
(both). here are examples of commands for training models:
parserator train name_data/labeled/person_labeled.xml,name_data/labeled/company_labeled.xml probablepeople --modelfile=generic
parserator train name_data/labeled/person_labeled.xml probablepeople --modelfile=person
parserator train name_data/labeled/company_labeled.xml probablepeople --modelfile=company
If something is not behaving intuitively, it is a bug and should be reported. Report it here by creating an issue: https://github.com/datamade/probablepeople/issues
Help us fix the problem as quickly as possible by following Mozilla's guidelines for reporting bugs.
Your patches are welcome. Here's our suggested workflow:
Copyright (c) 2014 Atlanta Journal Constitution. Released under the MIT License.
FAQs
Parse romanized names & companies using advanced NLP methods
We found that probablepeople demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.