This article presents a naive but informative benchmark comparing traditional OCR technologies with Vision Language Models (VLMs) for processing scanned documents. Using the FUNSD dataset of noisy scanned forms, the study evaluates 10 different OCR solutions across multiple metrics including text similarity, word error rate, character error rate, and processing time. The results show VLMs (particularly Qwen and Mistral) significantly outperform traditional OCR methods in accuracy, especially for complex layouts and poor scan quality, though at the cost of longer processing times. The article provides practical recommendations for when to use each approach based on document complexity, volume requirements, and cost considerations. Complete benchmark code is available on GitHub for further exploration and improvement.
This article will be the steping stone for a long series of articles about designing a robust and highly efficient knowledge base.
Why did i start with this benchmark ? Because Scanned documents are a huge pain to deal with and are still a huge part of the work of a lot of companies.
When i first searched for this subject, i found a surprising lack of comprehensive benchmarks comparing traditional OCR methods with VLM-based approaches, particularly for scanned document processing.
Which motivated me to create this quick/naive benchmark.
Why naive? because i didn't want to spend time on this, so i just used the funsd dataset and a few models. I also didn't do any pre-processing of the images since i was just looking for a quick benchmark and pre-processing is a whole can of worms.
I explored some quick preprocessing like denoising, binarization, and resizing, but i didn't spend time on it since i was not seeing a huge difference in accuracy. ( sometimes even a big drop in accuracy)
This gap prompted me to create a quick benchmark comparing various OCR technologies against ground truth data from the FUNSD dataset (Form Understanding in Noisy Scanned Documents), with a specific focus on understanding how VLMs perform relative to established OCR methods.
The point is mainly to see how well VLMs perform compared to traditional OCR methods.
On this post, we will focus on the benchmark and the results
as the hole code is available on github with an exhaustive Readme on how to recreate and use it locally.
Don't hesitate to clone and improve it and issue a PR :)
I tried creating a quick benchmarking toolkit that evaluates multiple OCR methods across several key metrics
Compared 10 different OCR solutions across three categories:
- Traditional OCR engines (Tesseract)
- Deep learning OCR models (EasyOCR, PaddleOCR, DocTR, Docling, KerasOCR)
- Vision-Language Models (Qwen, Mistral, Pixtral, Gemini)
- Amazon Textract (for comparison)
Used the FUNSD dataset of noisy scanned forms with precise annotations for consistent evaluation.
Example of images from the dataset:
Assessed performance using complementary metrics including text similarity, word error rate (WER), character error rate (CER), common word accuracy, and processing time.
Generated ground truth from both dataset annotations and VLM ( Gemini 2.5 with reflection) outputs to provide multiple reference points.
The benchmark uses the FUNSD (Form Understanding in Noisy Scanned Documents) dataset, which consists of noisy scanned forms with annotations for text, layout, and form understanding tasks.
You can find the complete reproducible sampling process from the FUNSD dataset in github.
This ensures consistent testing across OCR methods and enables others to replicate the benchmark using the same document set i did.
The benchmark has one ground truth source and one VLM-based high-performance reference:
Annotation-based ground truth extracted from the FUNSD dataset:
VLM-based ground truth using high-performance models like Gemini 2.5 with reflection:
The VLM-based approach generates more structured and sometimes more complete text extraction:
[ with the risk of hallucination ]
This dual approach allows evaluation against both human-annotated data and state-of-the-art VLM interpretations. The VLM approach often captures more context and formatting, while annotation-based ground truth tends to be more direct.
I created a modular framework for running different OCR methods with a consistent interface:
Each method follows the same pattern, allowing straightforward comparison and extension to new OCR technologies.
The benchmark runs all OCR methods against the sample dataset:
This processes each image with every specified OCR method and measures execution time.
I evaluated each OCR method using multiple complementary metrics:
- Text Similarity: Overall textual similarity using difflib's SequenceMatcher
- Word Error Rate (WER): Word-level edit distance normalized by reference length
- Character Error Rate (CER): Character-level edit distance for finer-grained assessment
- Common Word Accuracy: Percentage of reference words present in the OCR output
- Processing Time: Execution time per image
Here's how each metric is implemented:
These metrics provide different perspectives on OCR quality:
- Text similarity gives an overall view of how close the extracted text is to the ground truth
- WER focuses on correctly identified words (with position/order)
- CER provides character-level accuracy for fine-grained analysis
- Common word accuracy shows how well key terms are captured regardless of order
The benchmark revealed several important insights about the OCR landscape:
1. As expected, VLMs Outperform Traditional OCR for Accuracy
- VLM models (particularly Qwen and Mistral) achieved text similarity scores up to 3-4 times higher than traditional OCR methods on complex scanned documents.
- VLMs demonstrated superior performance on documents with complex layouts, handwriting, or poor scan quality.
2. Performance Trade-offs
- While VLMs delivered higher accuracy, they had significantly longer processing times (5-10x slower than traditional OCR engines).
- Deep learning OCR methods like PaddleOCR and EasyOCR offered a middle ground with better accuracy than Tesseract and faster processing than VLMs.
3. Error Pattern Differences
- Traditional OCR methods struggled with layout interpretation, often failing to properly follow multi-column formats. Expected as they were not trained to detect layout.
- VLMs excelled at contextual understanding, correctly interpreting forms with tables, checkboxes, and mixed formatting.
- Character-level errors were most common in traditional OCR, while VLMs occasionally hallucinated text or made semantic interpretation errors.
4. Method-Specific Strengths
- Tesseract: Fast but struggled with complex layouts
- PaddleOCR: Good balance of speed and accuracy
- Qwen-VLM: Highest overall accuracy but slowest processing time
- Mistral-VLM: Strong layout understanding with competitive accuracy
This visualization shows how each OCR method performs across text similarity, word error rate, character error rate, and common word accuracy. The VLM models (particularly Qwen and Mistral) consistently show higher accuracy metrics than traditional OCR methods.
Not so relevant since sometimes i used GPU T4 and sometimes mps.
And for VLM, i used Openrouter with no regars to optimize inference time. so the routing was not the best.
For time-sensitive applications, the processing time differences are significant:
- Traditional OCR (Tesseract): ~0.5 seconds per image
- Deep Learning OCR (EasyOCR, PaddleOCR, DocTR, Docling): 1-3 seconds per image
- VLM-based OCR (Qwen, Mistral, Pixtral, Gemini): 5-15 seconds per image
( it is highly biased as inference time for pixtral was highly dependant on the fact there were no Provider with good inference time and i didn't want to host it on my own )
Based on the benchmark results, here are my recommendations for selecting an OCR solution:
1. High-Volume, Basic Documents, Very low cost: Traditional OCR engines like Tesseract remain the best choice for processing large volumes of simple, well-formatted documents where speed and cost are critical. no need for expensive backend, no need for GPU, no need for fancy models.
2. Balanced Requirements: Deep learning OCR methods like PaddleOCR and DocTR offer a good compromise between accuracy and processing time for most business applications. and with finetuning, they can be even better for the right use case.
3. Complex Documents with High-Value Information: VLM-based approaches justify their higher cost and processing time when working with complex documents where accuracy is paramount, such as legal contracts, medical records, or financial statements.
There are 3 main drawbacks to using VLM-based approaches:
- The cost of the VLM-based approaches. which can be mitigated by using the right model and provider ( i think of gemini 2.0 cheap, accurante and quick) or self-hosting it ( i think about Qwen2.5 70B for example).
- The non-deterministic nature of the output of the VLM-based approaches.
- The data privacy concerns, which can be mitigated by self-hosting the model. ( some infrastructure with GPUs is needed which can is generally more costly)
4. Handwritten Content: VLMs significantly outperform traditional OCR for handwritten text, making them the clear choice for documents with substantial handwritten components.
This naive and very humble benchmark provides a foundation for OCR technology selection, but several areas warrant further investigation :
1. Domain-Specific Testing: Expanding the benchmark to industry-specific document types (invoices, medical records, etc.)
2. Multi-Language Evaluation: Assessing OCR performance across different languages and scripts
3. Cost-Benefit Analysis: Developing a framework to balance accuracy improvements against increased processing costs for VLM approaches
If you're working on document digitization challenges or have any insights about this, I'd love to connect. Drop a comment below or reach out directly to discuss.
For those interested in running the benchmark themselves or extending it to include additional OCR methods, the complete toolkit is available on GitHub at OCR Benchmarking
• Latest new on data engineering
• How to design Production ready AI Systems
• Curated list of material to Become the ultimate AI Engineer