
Security News
Security Community Slams MIT-linked Report Claiming AI Powers 80% of Ransomware
Experts push back on new claims about AI-driven ransomware, warning that hype and sponsored research are distorting how the threat is understood.
Texify is an OCR model that converts images or pdfs containing math into markdown and LaTeX that can be rendered by MathJax ($$ and $ are delimiters). It can run on CPU, GPU, or MPS.
https://github.com/VikParuchuri/texify/assets/913340/882022a6-020d-4796-af02-67cb77bc084c
Texify can work with block equations, or equations mixed with text (inline). It will convert both the equations and the text.
The closest open source comparisons to texify are pix2tex and nougat, although they're designed for different purposes:
Pix2tex is trained on im2latex, and nougat is trained on arxiv. Texify is trained on a more diverse set of web data, and works on a range of images.
See more details in the benchmarks section.
Discord is where we discuss future development.
Note I added spaces after _ symbols and removed , because Github math formatting is broken.

Detected Text The potential $V_ i$ of cell $\mathcal{C}_ i$ centred at position $\mathbf{r}_ i$ is related to the surface charge densities $\sigma_ j$ of cells $\mathcal{C}_ j$ $j\in[1,N]$ through the superposition principle as: $$V_ i = \sum_ {j=0}^{N} \frac{\sigma_ j}{4\pi\varepsilon_ 0} \int_ {\mathcal{C}_ j} \frac{1}{|\mathbf{r}_ i-\mathbf{r}'|} \mathrm{d}^2\mathbf{r}' = \sum_{j=0}^{N} Q_ {ij} \sigma_ j,$$ where the integral over the surface of cell $\mathcal{C}_ j$ only depends on $\mathcal{C}_ j$ shape and on the relative position of the target point $\mathbf{r}_ i$ with respect to $\mathcal{C}_ j$ location, as $\sigma_ j$ is assumed constant over the whole surface of cell $\mathcal{C}_ j$.
| Image | OCR Markdown | 
|---|---|
| 1 | 1 | 
| 2 | 2 | 
| 3 | 3 | 
You'll need python 3.9+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See here for more details.
Install with:
`pip install texify`
Model weights will automatically download the first time you run it.
texify/settings.py.  You can override any settings with environment variables.TORCH_DEVICE=cuda or TORCH_DEVICE=mps.TEMPERATURE setting.I've included a streamlit app that lets you interactively select and convert equations from images or PDF files. Run it with:
pip install streamlit streamlit-drawable-canvas-jsretry watchdog
texify_gui
The app will allow you to select the specific equations you want to convert on each page, then render the results with KaTeX and enable easy copying.
You can OCR a single image or a folder of images with:
texify /path/to/folder_or_file --max 8 --json_path results.json
--max is how many images in the folder to convert at most.  Omit this to convert all images in the folder.--json_path is an optional path to a json file where the results will be saved.  If you omit this, the results will be saved to data/results.json.--katex_compatible will make the output more compatible with KaTeX.You can import texify and run it in python code:
from texify.inference import batch_inference
from texify.model.model import load_model
from texify.model.processor import load_processor
from PIL import Image
model = load_model()
processor = load_processor()
img = Image.open("test.png") # Your image name here
results = batch_inference([img], model, processor)
See texify/output.py:replace_katex_invalid if you want to make the output more compatible with KaTeX.
If you want to develop texify, you can install it manually:
git clone https://github.com/VikParuchuri/texify.gitcd texifypoetry install # Installs main and dev dependenciesOCR is complicated, and texify is not perfect. Here are some known limitations:
TEMPERATURE setting.Benchmarking OCR quality is hard - you ideally need a parallel corpus that models haven't been trained on. I sampled from arxiv and im2latex to create the benchmark set.

Each model is trained on one of the benchmark tasks:
Although this makes the benchmark results biased, it does seem like a good compromise, since nougat and pix2tex don't work as well out of domain. Note that neither pix2tex or nougat is really designed for this task (OCR inline equations and text), so this is not a perfect comparison.
| Model | BLEU ⬆ | METEOR ⬆ | Edit Distance ⬇ | 
|---|---|---|---|
| pix2tex | 0.382659 | 0.543363 | 0.352533 | 
| nougat | 0.697667 | 0.668331 | 0.288159 | 
| texify | 0.842349 | 0.885731 | 0.0651534 | 
You can benchmark the performance of texify on your machine.
pip install pix2texpip install nougat-ocrdata folder.benchmark.py like this:pip install tabulate
python benchmark.py --max 100 --pix2tex --nougat --data_path data/bench_data.json --result_path data/bench_results.json
This will benchmark marker against pix2tex and nougat. It will do batch inference with texify and nougat, but not with pix2tex, since I couldn't find an option for batching.
--max is how many benchmark images to convert at most.--data_path is the path to the benchmark data.  If you omit this, it will use the default path.--result_path is the path to the benchmark results.  If you omit this, it will use the default path.--pix2tex specifies whether to run pix2tex (Latex-OCR) or not.--nougat specifies whether to run nougat or not.Texify was trained on latex images and paired equations from across the web. It includes the im2latex dataset. Training happened on 4x A6000s for 2 days (~6 epochs).
This model is trained on top of the openly licensed Donut model, and thus can be used for commercial purposes. Model weights are licensed under the CC BY-SA 4.0 license.
This work would not have been possible without lots of amazing open source work. I particularly want to acknowledge Lukas Blecher, whose work on Nougat and pix2tex was key for this project. I learned a lot from his code, and used parts of it for texify.
FAQs
OCR for latex images
We found that texify demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Experts push back on new claims about AI-driven ransomware, warning that hype and sponsored research are distorting how the threat is understood.

Security News
Ruby's creator Matz assumes control of RubyGems and Bundler repositories while former maintainers agree to step back and transfer all rights to end the dispute.

Research
/Security News
Socket researchers found 10 typosquatted npm packages that auto-run on install, show fake CAPTCHAs, fingerprint by IP, and deploy a credential stealer.