10
 min read

OCR vs VLM-OCR: Naive Benchmarking Accuracy for Scanned Documents

This article presents a naive but informative benchmark comparing traditional OCR technologies with Vision Language Models (VLMs) for processing scanned documents. Using the FUNSD dataset of noisy scanned forms, the study evaluates 10 different OCR solutions across multiple metrics including text similarity, word error rate, character error rate, and processing time. The results show VLMs (particularly Qwen and Mistral) significantly outperform traditional OCR methods in accuracy, especially for complex layouts and poor scan quality, though at the cost of longer processing times. The article provides practical recommendations for when to use each approach based on document complexity, volume requirements, and cost considerations. Complete benchmark code is available on GitHub for further exploration and improvement.

OCR vs VLM-OCR: Naive Benchmarking Accuracy for Scanned Documents

This article will be the steping stone for a long series of articles about designing a robust and highly efficient knowledge base.

Problem Statement

Why did i start with this benchmark ? Because Scanned documents are a huge pain to deal with and are still a huge part of the work of a lot of companies.

When i first searched for this subject, i found a surprising lack of comprehensive benchmarks comparing traditional OCR methods with VLM-based approaches, particularly for scanned document processing.

Which motivated me to create this quick/naive benchmark.

Why naive? because i didn't want to spend time on this, so i just used the funsd dataset and a few models. I also didn't do any pre-processing of the images since i was just looking for a quick benchmark and pre-processing is a whole can of worms.

I explored some quick preprocessing like denoising, binarization, and resizing, but i didn't spend time on it since i was not seeing a huge difference in accuracy. ( sometimes even a big drop in accuracy)

Image Pre-processing tentative

This gap prompted me to create a quick benchmark comparing various OCR technologies against ground truth data from the FUNSD dataset (Form Understanding in Noisy Scanned Documents), with a specific focus on understanding how VLMs perform relative to established OCR methods.

The point is mainly to see how well VLMs perform compared to traditional OCR methods.

On this post, we will focus on the benchmark and the results

as the hole code is available on github with an exhaustive Readme on how to recreate and use it locally.

Don't hesitate to clone and improve it and issue a PR :)

Approach Overview

I tried creating a quick benchmarking toolkit that evaluates multiple OCR methods across several key metrics

1. Multiple OCR Technologies:

Compared 10 different OCR solutions across three categories:

  - Traditional OCR engines (Tesseract)

  - Deep learning OCR models (EasyOCR, PaddleOCR, DocTR, Docling, KerasOCR)

  - Vision-Language Models (Qwen, Mistral, Pixtral, Gemini)

  - Amazon Textract (for comparison)

2. Standardized Dataset:

Used the FUNSD dataset of noisy scanned forms with precise annotations for consistent evaluation.

Example of images from the dataset:

sample of data

3. Multi-faceted Evaluation:

Assessed performance using complementary metrics including text similarity, word error rate (WER), character error rate (CER), common word accuracy, and processing time.

4. Ground Truth Validation:

Generated ground truth from both dataset annotations and VLM ( Gemini 2.5 with reflection) outputs to provide multiple reference points.

Step-by-Step Process

1. Dataset Preparation

The benchmark uses the FUNSD (Form Understanding in Noisy Scanned Documents) dataset, which consists of noisy scanned forms with annotations for text, layout, and form understanding tasks.


You can find the complete reproducible sampling process from the FUNSD dataset in github.

This ensures consistent testing across OCR methods and enables others to replicate the benchmark using the same document set i did.

2. Ground Truth Generation

The benchmark has one ground truth source and one VLM-based high-performance reference:

Annotation-based ground truth extracted from the FUNSD dataset:

VLM-based ground truth using high-performance models like Gemini 2.5 with reflection:

The VLM-based approach generates more structured and sometimes more complete text extraction:
[ with the risk of hallucination ]

This dual approach allows evaluation against both human-annotated data and state-of-the-art VLM interpretations. The VLM approach often captures more context and formatting, while annotation-based ground truth tends to be more direct.

3. OCR Implementation

I created a modular framework for running different OCR methods with a consistent interface:

Each method follows the same pattern, allowing straightforward comparison and extension to new OCR technologies.

4. Benchmark Execution

The benchmark runs all OCR methods against the sample dataset:

This processes each image with every specified OCR method and measures execution time.

5. Comprehensive Evaluation

I evaluated each OCR method using multiple complementary metrics:

- Text Similarity: Overall textual similarity using difflib's SequenceMatcher

- Word Error Rate (WER): Word-level edit distance normalized by reference length

- Character Error Rate (CER): Character-level edit distance for finer-grained assessment

- Common Word Accuracy: Percentage of reference words present in the OCR output

- Processing Time: Execution time per image

Here's how each metric is implemented:

These metrics provide different perspectives on OCR quality:

- Text similarity gives an overall view of how close the extracted text is to the ground truth

- WER focuses on correctly identified words (with position/order)

- CER provides character-level accuracy for fine-grained analysis

- Common word accuracy shows how well key terms are captured regardless of order

Results

The benchmark revealed several important insights about the OCR landscape:

Key Findings

1. As expected, VLMs Outperform Traditional OCR for Accuracy

  - VLM models (particularly Qwen and Mistral) achieved text similarity scores up to 3-4 times higher than traditional OCR methods on complex scanned documents.

  - VLMs demonstrated superior performance on documents with complex layouts, handwriting, or poor scan quality.

2. Performance Trade-offs

  - While VLMs delivered higher accuracy, they had significantly longer processing times (5-10x slower than traditional OCR engines).

  - Deep learning OCR methods like PaddleOCR and EasyOCR offered a middle ground with better accuracy than Tesseract and faster processing than VLMs.

3. Error Pattern Differences

  - Traditional OCR methods struggled with layout interpretation, often failing to properly follow multi-column formats. Expected as they were not trained to detect layout.

  - VLMs excelled at contextual understanding, correctly interpreting forms with tables, checkboxes, and mixed formatting.

  - Character-level errors were most common in traditional OCR, while VLMs occasionally hallucinated text or made semantic interpretation errors.

4. Method-Specific Strengths

  - Tesseract: Fast but struggled with complex layouts

  - PaddleOCR: Good balance of speed and accuracy

  - Qwen-VLM: Highest overall accuracy but slowest processing time

  - Mistral-VLM: Strong layout understanding with competitive accuracy

Visual Comparison

This visualization shows how each OCR method performs across text similarity, word error rate, character error rate, and common word accuracy. The VLM models (particularly Qwen and Mistral) consistently show higher accuracy metrics than traditional OCR methods.

Processing Time Analysis

Not so relevant since sometimes i used GPU T4 and sometimes mps.

And for VLM, i used Openrouter with no regars to optimize inference time. so the routing was not the best.

For time-sensitive applications, the processing time differences are significant:

- Traditional OCR (Tesseract): ~0.5 seconds per image

- Deep Learning OCR (EasyOCR, PaddleOCR, DocTR, Docling): 1-3 seconds per image

- VLM-based OCR (Qwen, Mistral, Pixtral, Gemini): 5-15 seconds per image

( it is highly biased as inference time for pixtral was highly dependant on the fact there were no Provider with good inference time and i didn't want to host it on my own )

When to Choose Each Approach

Based on the benchmark results, here are my recommendations for selecting an OCR solution:

1. High-Volume, Basic Documents, Very low cost: Traditional OCR engines like Tesseract remain the best choice for processing large volumes of simple, well-formatted documents where speed and cost are critical. no need for expensive backend, no need for GPU, no need for fancy models.

2. Balanced Requirements: Deep learning OCR methods like PaddleOCR and DocTR offer a good compromise between accuracy and processing time for most business applications. and with finetuning, they can be even better for the right use case.

3. Complex Documents with High-Value Information: VLM-based approaches justify their higher cost and processing time when working with complex documents where accuracy is paramount, such as legal contracts, medical records, or financial statements.

There are 3 main drawbacks to using VLM-based approaches:

- The cost of the VLM-based approaches. which can be mitigated by using the right model and provider ( i think of gemini 2.0 cheap, accurante and quick) or self-hosting it ( i think about Qwen2.5 70B for example).

- The non-deterministic nature of the output of the VLM-based approaches.

- The data privacy concerns, which can be mitigated by self-hosting the model. ( some infrastructure with GPUs is needed which can is generally more costly)

4. Handwritten Content: VLMs significantly outperform traditional OCR for handwritten text, making them the clear choice for documents with substantial handwritten components.

Future Work

This naive and very humble benchmark provides a foundation for OCR technology selection, but several areas warrant further investigation :

1. Domain-Specific Testing: Expanding the benchmark to industry-specific document types (invoices, medical records, etc.)

2. Multi-Language Evaluation: Assessing OCR performance across different languages and scripts

3. Cost-Benefit Analysis: Developing a framework to balance accuracy improvements against increased processing costs for VLM approaches

Connect With Me

If you're working on document digitization challenges or have any insights about this, I'd love to connect. Drop a comment below or reach out directly to discuss.

For those interested in running the benchmark themselves or extending it to include additional OCR methods, the complete toolkit is available on GitHub at OCR Benchmarking

Want receive the best AI & DATA insights? Subscribe now!

•⁠  ⁠Latest new on data engineering
•⁠  ⁠How to design Production ready AI Systems
•⁠  ⁠Curated list of material to Become the ultimate AI Engineer

Latest Articles

Prompt Engineering Best Practices: Complete Comparison Matrix

Prompt Engineering Best Practices: Complete Comparison Matrix

I've analyzed the official prompt engineering guidelines from OpenAI (GPT-4.1), Anthropic (Claude 3.7/4/Reasoning), and Google (Gemini) to create the first comprehensive comparison matrix. This comprehensive guide compares prompt engineering techniques across different leading models – helping you get better results from any AI model you use.

AI Engineering
AI Engineering
Clock Icon - Tech Webflow Template
7
 min read
Testing Glue Jobs Locally

Testing Glue Jobs Locally

Ce guide pratique explique comment tester localement les jobs AWS Glue, un service serverless d'intégration de données. L'article souligne l'importance du test local pour accélérer le développement, réduire les coûts et faciliter le débogage. Il détaille ensuite une méthode en trois étapes pour configurer un environnement de test local. Ce tutoriel vise à optimiser le processus de développement des jobs AWS Glue, permettant aux data engineers de tester efficacement leur code avant le déploiement en production.

Data Engineering
Data Engineering
Clock Icon - Tech Webflow Template
10
 min read
Raycast ou Comment Exploser sa Productivité sur Mac en 2025 : Guide Complet pour Travailler 3x Plus Vite

Raycast ou Comment Exploser sa Productivité sur Mac en 2025 : Guide Complet pour Travailler 3x Plus Vite

Découvrez comment Raycast a radicalement transformé mon expérience sur mon Mac en 2025. Il m'a permis de facilement mettre un raccourcis sur tout, rajouter de l'IA dans tous mes workflows, en automatisant les tâches répétitives et en éliminant les distractions. Dans ce guide, apprenez à configurer votre propre système de raccourcis, assistants IA et explorez les extensions essentielles de Raycast pour révolutionner votre façon de travailler.

Dev Productivity
Dev Productivity
Clock Icon - Tech Webflow Template
10
 min read