Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
[ Paper ] [ Website ] [ Dataset (OpenDataLab)] [ Dataset (Hugging Face) ]
[Models 🤗(Hugging Face)] [Models (ModelScope)]
🔥🔥 CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation
Welcome to the official repository of UniMERNet, a solution that converts images of mathematical expressions into LaTeX, suitable for a wide range of real-world scenarios.
2024.09.06 🎉🎉 UniMERNet Update: The new version features a smaller model and faster inference. Training code is now open-sourced. For details, see the latest paper UniMERNet.
2024.09.06 🎉🎉 Introducing a new metric for formula recognition: CDM. Compared to BLEU/EditDistance, CDM provides a more intuitive and accurate evaluation score, allowing for fair comparison of different models without being affected by formula expression diversity.
2024.07.21 🎉🎉 Add Math Formula Detection (MFD) Tutorial based on PDF-Extract-Kit MFD model.
2024.06.06 🎉🎉 Open-sourced evaluation code for UniMER dataset.
2024.05.06 🎉🎉 Open-sourced UniMER dataset, including UniMER-1M for model training and UniMER-Test for MER evaluation.
2024.05.06 🎉🎉 Add Streamlit formula recognition demo and provided local deployment App.
2024.04.24 🎉🎉 Paper now available on ArXiv.
2024.04.24 🎉🎉 Inference code and checkpoints have been released.
https://github.com/opendatalab/UniMERNet/assets/69186975/ac54c6b9-442c-48b0-95f9-a4a3fce8780b
https://github.com/opendatalab/UniMERNet/assets/69186975/09b71c55-c58a-4792-afc1-d5774880ccf8
git clone https://github.com/opendatalab/UniMERNet.git
cd UniMERNet/models
# Download the model and tokenizer individually or use git-lfs
git lfs install
git clone https://huggingface.co/wanderkid/unimernet_base # 1.3GB
git clone https://huggingface.co/wanderkid/unimernet_small # 773MB
git clone https://huggingface.co/wanderkid/unimernet_tiny # 441MB
# you can also download the model from ModelScope
git clone https://www.modelscope.cn/wanderkid/unimernet_base.git
git clone https://www.modelscope.cn/wanderkid/unimernet_small.git
git clone https://www.modelscope.cn/wanderkid/unimernet_tiny.git
Create a clean Conda environment
conda create -n unimernet python=3.10
conda activate unimernet
Method 1: Install via pip (recommended for general users)
pip install -U "unimernet[full]"
Method 2: Local installation (recommended for developers)
pip install -e ."[full]"
Streamlit Application: For an interactive and user-friendly experience, use our Streamlit-based GUI. This application allows real-time formula recognition and rendering.
unimernet_gui
Ensure you have the latest version of UniMERNet installed (pip install --upgrade unimernet & pip install "unimernet[full]"
) for the streamlit GUI application.
Command-line Demo: Predict LaTeX code from an image.
python demo.py
Jupyter Notebook Demo: Recognize and render formula from an image.
jupyter-lab ./demo.ipynb
UniMERNet significantly outperforms mainstream models in recognizing real-world mathematical expressions, demonstrating superior performance across Simple Printed Expressions (SPE), Complex Printed Expressions (CPE), Screen-Captured Expressions (SCE), and Handwritten Expressions (HWE), as evidenced by the comparative BLEU Score evaluation.
Due to the diversity of expression of formulas, it is unfair to compare different models by BLEU metric. Therefore, we conduct evaluation by CDM, a specially designed metric for formula recognition. Our method is far superior to the open source model and has the same effect as that of commercial software Mathpix. CDM@ExpRate means that the proportion of correct formulas is completely predicted. Refer to CDM paper for details.
UniMERNet excels in visual recognition of challenging samples, outperforming other methods.
The UniMER dataset is a specialized collection curated to advance the field of Mathematical Expression Recognition (MER). It encompasses the comprehensive UniMER-1M training set, featuring over one million instances that represent a diverse and intricate range of mathematical expressions, coupled with the UniMER Test Set, meticulously designed to benchmark MER models against real-world scenarios. The dataset details are as follows:
UniMER-1M Training Set:
UniMER Test Set:
You can download the dataset from OpenDataLab (recommended for users in China) or HuggingFace.
Download the UniMER-1M dataset and extract it to the following directory:
./data/UniMER-1M
Download the UniMER-Test dataset and extract it to the following directory:
./data/UniMER-Test
To train the UniMERNet model, follow these steps:
Specify the Training Dataset Path: Open the configs/train
fold and set the path to your training dataset.
Run the Training Script: Execute the following command to start the training process.
bash script/train.sh
configs/train
fold is correct and accessible.To test the UniMERNet model, follow these steps:
Specify the Test Dataset Path: Open the configs/val
fold and set the path to your test dataset.
Run the Test Script: Execute the following command to start the testing process.
bash script/test.sh
configs/val
fold is correct and accessible.test.py
script will use the specified test dataset for evaluation. Remember to change the test set path in test.py to your actual path.The prerequisite for formula recognition is to detect the areas within PDF or webpage screenshots where formulas are located. The PDF-Extract-Kit includes a powerful model for detecting formulas. If you wish to perform both formula detection and recognition by yourself, you can refer to the Formula Detection Tutorial for guidance on deploying and using the formula detection model.
[✅] Release inference code and checkpoints of UniMERNet.
[✅] Release UniMER-1M and UniMER-Test.
[✅] Open-source the Streamlit formula recognition GUI application.
[✅] Release the training code for UniMERNet.
If you find our models / code / papers useful in your research, please consider giving us a star ⭐ and citing our work 📝, thank you :)
@misc{wang2024unimernetuniversalnetworkrealworld,
title={UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition},
author={Bin Wang and Zhuangcheng Gu and Guang Liang and Chao Xu and Bo Zhang and Botian Shi and Conghui He},
year={2024},
eprint={2404.15254},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2404.15254},
}
@misc{wang2024cdmreliablemetricfair,
title={CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation},
author={Bin Wang and Fan Wu and Linke Ouyang and Zhuangcheng Gu and Rui Zhang and Renqiu Xia and Bo Zhang and Conghui He},
year={2024},
eprint={2409.03643},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2409.03643},
}
If you have any questions, comments, or suggestions, please do not hesitate to contact us at wangbin@pjlab.org.cn.
FAQs
UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition
We found that unimernet demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.