Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
A tool for learning vector representations of words and entities from Wikipedia
Wikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia. It is developed and maintained by Studio Ousia.
This tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space. Embeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.
This tool implements the conventional skip-gram model to learn the embeddings of words, and its extension proposed in Yamada et al. (2016) to learn the embeddings of entities.
An empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available here.
Documentation are available online at http://wikipedia2vec.github.io/.
Wikipedia2Vec can be installed via PyPI:
% pip install wikipedia2vec
With this tool, embeddings can be learned by running a train command with a Wikipedia dump as input. For example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:
% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE
Then, the learned embeddings are written to MODEL_FILE. Note that this command can take many optional parameters. Please refer to our documentation for further details.
Pretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) can be downloaded from this page.
Wikipedia2Vec has been applied to the following tasks:
If you use Wikipedia2Vec in a scientific publication, please cite the following paper:
Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia.
@inproceedings{yamada2020wikipedia2vec,
title = "{W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia",
author={Yamada, Ikuya and Asai, Akari and Sakuma, Jin and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu and Matsumoto, Yuji},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
year = {2020},
publisher = {Association for Computational Linguistics},
pages = {23--30}
}
The embedding model was originally proposed in the following paper:
Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation.
@inproceedings{yamada2016joint,
title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},
booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
year={2016},
publisher={Association for Computational Linguistics},
pages={250--259}
}
The text classification model implemented in this example was proposed in the following paper:
Ikuya Yamada, Hiroyuki Shindo, Neural Attentive Bag-of-Entities Model for Text Classification.
@article{yamada2019neural,
title={Neural Attentive Bag-of-Entities Model for Text Classification},
author={Yamada, Ikuya and Shindo, Hiroyuki},
booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},
year={2019},
publisher={Association for Computational Linguistics},
pages = {563--573}
}
FAQs
A tool for learning vector representations of words and entities from Wikipedia
We found that wikipedia2vec demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.