
Security News
MCP Community Begins Work on Official MCP Metaregistry
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
A custom package for language identification subpackage and OCR QA score calculation subpackage (imitating pipelines)
This repository contains a Python package designed for efficient and modular processing. Currently, it includes the following subpackages:
To install the package, use:
pip install impresso_pipelines[all]
If you want to install only the language identification pipeline, use:
pip install impresso_pipelines[langident]
If you want to install only the OCR QA pipeline, use:
pip install impresso_pipelines[ocrqa]
Import and use the subpackages as follows:
from impresso_pipelines.langident import LangIdentPipeline
from impresso_pipelines.ocrqa import OCRQAPipeline
# Initialize the pipeline
lang_pipeline = LangIdentPipeline()
# Example text in German
de_text = "Ein kleiner Hund namens Max lebte in einem ruhigen Dorf. Jeden Tag rannte er durch die Straßen und spielte mit den Kindern. Eines Tages fand er einen geheimen Garten, den niemand kannte. Max entschied sich, den Garten zu erkunden und entdeckte viele schöne Blumen und Tiere. Von diesem Tag an besuchte er den Garten jeden Nachmittag."
# Detect language
result = lang_pipeline(de_text)
print(result)
Expected Output:
{'language': 'de', 'score': 1.0}
Score represents the probability of the detected language based on the input text.
# Initialize the pipeline
ocrqa_pipeline = OCRQAPipeline()
# Example text extracted from OCR
de_text = "Ein kleiner Hund namens Max lebte in einem ruhigen Dorf. Jeden Tag rannte er durch die Straßen und spielte mit den Kindern. Eines Tages fand er einen geheimen Garten, den niemand kannte. Max entschied sich, den Garten zu erkunden und entdeckte viele schöne Blumen und Tiere. Von diesem Tag an besuchte er den Garten jeden Nachmittag."
# Get an answer
result = ocrqa_pipeline(de_text)
print(result)
Expected Output:
{'language': 'de', 'score': 1.0}
Score roughly represents the ratio between known and unknown words in the text in comparison to the language-specific Bloom filter database.
flowchart TD
subgraph s1["(4) Mallet vectorizers"]
n3["Mallet input<br>converting pipeline"]
end
subgraph s2["(5) Mallet inferences"]
n5["mallet topic <br>modeling inference"]
end
subgraph s3["(6) JSONafication"]
n6["Produce <br>JSON output"]
end
A["(1) Input text (str)"] --> n1["(2) Langident"]
n1 -- de/fr/lb --> n2["(3) Tokenizer<br>POStagging<br>Lemmanizer<br>(SPACY)"]
n2 --> n3
n3 --> n5
s2 --> n6
n3@{ shape: rounded}
n5@{ shape: rounded}
n6@{ shape: rounded}
A@{ shape: rounded}
n1@{ shape: rounded}
n2@{ shape: rounded}
The pipeline starts with a text input in string format. This could be any textual data that needs to be analyzed.
The system uses a language identification tool to detect the language of the input text. Based on the output, the text is classified as German (de
), French (fr
), or Luxembourgish (lb
).
Once the language is identified, the text undergoes several preprocessing steps:
Output is a list of lemmatized tokens: ['ein', 'klein', 'Hund', 'namens', 'Max', 'leben', 'in', 'ein', 'ruhig', 'Dorf', ...]
The processed text is converted into a format suitable for MALLET topic modeling. This step likely includes text vectorization, where words are transformed into numerical representations.
MALLET applies topic modeling, typically using Latent Dirichlet Allocation (LDA) or another probabilistic model. The system infers topics from the text.
The topic modeling results are formatted into JSON output. This output is likely structured with topic distributions, keywords, and document-topic probabilities, making it easier to use for downstream applications.
For more examples, please take a look at documentation notebooks langident_pipeline_demo.ipynb and ocrqa_pipeline_demo.ipynb.
More pipelines and subpackages will be added to enhance functionality and broaden use cases. Stay tuned!
Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.
Copyright (C) 2024 The Impresso team.
This program is provided as open source under the GNU Affero General Public License v3 or later.
FAQs
A custom package for language identification subpackage and OCR QA score calculation subpackage (imitating pipelines)
We found that impresso-pipelines demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
Research
Security News
Socket uncovers an npm Trojan stealing crypto wallets and BullX credentials via obfuscated code and Telegram exfiltration.
Research
Security News
Malicious npm packages posing as developer tools target macOS Cursor IDE users, stealing credentials and modifying files to gain persistent backdoor access.