New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More
Socket
Sign inDemoInstall
Socket

nlp4bia

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

nlp4bia

Download NLP4BIA benchmarks and load datasets in their format

  • 2.2.0
  • PyPI
  • Socket score

Maintainers
1

NLP4BIA Library

This repository provides a Python library for loading, processing, and utilizing biomedical datasets curated by the NLP4BIA research group at the Barcelona Supercomputing Center (BSC). The datasets are specifically designed for natural language processing (NLP) tasks in the biomedical domain.


Available Dataset Loaders

The library currently supports the following dataset loaders, which are part of public benchmarks:

1. Distemist

  • Description: A dataset for disease mentions recognition and normalization in Spanish medical texts.
  • Zenodo Repository: Distemist Zenodo

2. Meddoplace

  • Description: A dataset for place name recognition in Spanish medical texts.
  • Zenodo Repository: Meddoplace Zenodo

3. Medprocner

  • Description: A dataset for procedure name recognition in Spanish medical texts.
  • Zenodo Repository: Medprocner Zenodo

4. Symptemist

  • Description: A dataset for symptom mentions recognition in Spanish medical texts.
  • Zenodo Repository: Symptemist Zenodo

Installation

pip install nlp4bia

Quick Start Guide

Example Usage

Dataset Loaders

Here's how to use one of the dataset loaders, such as DistemistLoader:

from nlp4bia.datasets.benchmark.distemist import DistemistLoader

# Initialize loader
distemist_loader = DistemistLoader(lang="es", download_if_missing=True)

# Load and preprocess data
dis_df = distemist_loader.df
print(dis_df.head())

Dataset folders are automatically downloaded and extracted to the ~/.nlp4bia directory.

Preprocessor
Deduplication
from nlp4bia.preprocessor.deduplicator import HashDeduplicator

# Define the list of files to deduplicate
ls_files = ["path/to/file1.txt", "path/to/file2.txt"]

# Instantiate the deduplicator. It deduplicates the files using 8 cores.
hd = HashDeduplicator(ls_files, num_processes=8)

# Deduplicate the files and save the results to a CSV file
hd.get_deduplicated_files("path/tp/deduplicated_contents.csv")
Document Parser

PDFS

from nlp4bia.preprocessor.pdfparser import PDFParserMuPDF

# Define the path to the PDF file
pdf_path = "path/to/file.pdf"

# Instantiate the PDF parser
pdf_parser = PDFParserMuPDF(pdf_path)

# Extract the text from the PDF file
pdf_text = pdf_parser.extract_text()
Dataset Columns
  • filenameid: Unique identifier combining filename and offset information.
  • mention_class: The class of the mention (e.g., disease, symptom, etc.).
  • span: Text span corresponding to the mention.
  • code: The normalized code for the mention (usually to SNOMED CT).
  • sem_rel: Semantic relationships associated with the mention.
  • is_abbreviation: Indicates if the mention is an abbreviation.
  • is_composite: Indicates if the mention is a composite term.
  • needs_context: Indicates if the mention requires additional context.
  • extension_esp: Additional information specific to Spanish texts.
Gazetteer Columns
  • code: Normalized code for the term.
  • language: Language of the term.
  • term: The term itself.
  • semantic_tag: Semantic tag associated with the term.
  • mainterm: Indicates if the term is a primary term.

Contributing

Contributions to expand the dataset loaders or improve existing functionality are welcome! Please open an issue or submit a pull request.


License

This project is licensed under the MIT License. See the LICENSE file for details.


References

If you use this library or its datasets in your research, please cite the corresponding Zenodo repositories or related publications.


Instructions for Maintainers

  1. Update the version in nlp4bia/__init__.py and in pyproject.toml.
  2. Remove the dist folder (rm -rf dist).
  3. Build the package (python -m build).
  4. Check the package (twine check dist/*).
  5. Upload the package (twine upload dist/*).
  6. Install the package (pip install nlp4bia).

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc