Why Arabic OCR still fails, and how we fixed it

TL;DR

·The Arabic alphabet is not the problem. Ligatures, diacritics, and digit direction are.الأبجدية العربية ليست المشكلة. المشكلة في الحروف المتصلة والتشكيل واتجاه الأرقام.
·Mixed-direction layouts are the real killer: a top commercial engine scored 71% on real Gulf documents.التنسيقات ثنائية الاتجاه هي القاتل الحقيقي: محرك تجاري من الطراز الأول سجّل ٧١٪ على مستندات خليجية حقيقية.
·Three pipeline changes took field-level accuracy from 71% to 99.2%.ثلاثة تغييرات في خط المعالجة رفعت الدقة من ٧١٪ إلى ٩٩٫٢٪.
·Everything runs on-prem. The constraint forced a better pipeline.كل شيء يعمل محليًا على خوادم العميل. وهذا القيد أجبرنا على بناء خطٍّ أفضل.

A typical Gulf tax document is a small act of typographic chaos. The vendor name runs right-to-left in Arabic, the line items are bilingual, the registration number marches left-to-right, and the totals sit in a table whose columns flip direction halfway down the page. To a human, it is obvious. To most OCR engines, it is noise.المستند الضريبي الخليجي النموذجي فوضى طباعية صغيرة. اسم المورّد يجري من اليمين إلى اليسار بالعربية، والبنود ثنائية اللغة، ورقم التسجيل يسير من اليسار إلى اليمين، والمجاميع في جدول تنقلب اتجاهات أعمدته في منتصف الصفحة. بالنسبة للإنسان، الأمر بديهي. أما لمعظم محركات القراءة الآلية، فهو ضجيج.

We learned this the hard way. Our first pipeline used a best-in-class commercial OCR API, the kind that tops English benchmarks. On clean English receipts it was flawless. On the first batch of real Gulf documents, it returned 71% field-level accuracy. For a back office, 71% is not "mostly working." It means roughly one in three documents needs a human to re-key it, which defeats the entire point.تعلّمنا هذا بالطريقة الصعبة. اعتمد خط معالجتنا الأول على واجهة قراءة آلية تجارية من الطراز الأول، من النوع الذي يتصدّر اختبارات الإنجليزية. على الإيصالات الإنجليزية النظيفة كان أداؤه مثاليًا. وعلى أول دفعة من المستندات الخليجية الحقيقية، أعاد دقة ٧١٪ على مستوى الحقول. وبالنسبة للمكتب الخلفي، ٧١٪ ليست «تعمل في الغالب»: تعني أن واحدًا من كل ثلاثة مستندات تقريبًا يحتاج إلى إعادة إدخال يدوي، وهذا يُلغي الغرض كله.

The problem isn't the alphabetالمشكلة ليست في الأبجدية

It is tempting to assume Arabic is hard because it has different letters. It doesn't work that way. The alphabet is small, 28 base letters. The difficulty is that each letter changes shape depending on where it sits in a word, and adjacent letters fuse into ligatures that share strokes. English OCR leans heavily on the gaps between characters. Arabic gives it far fewer.من المغري الافتراض أن العربية صعبة لأن حروفها مختلفة، لكن الأمر ليس كذلك. الأبجدية صغيرة، ٢٨ حرفًا أساسيًا. الصعوبة أن كل حرف يغيّر شكله حسب موضعه في الكلمة، والحروف المتجاورة تندمج في روابط تتشارك الخطوط. تعتمد القراءة الآلية للإنجليزية بشدة على الفجوات بين الحروف، والعربية تمنحها فجوات أقل بكثير.

Mixed direction is where it really breaksالاتجاه المختلط هو ما يكسرها فعلًا

Even when the characters are read correctly, layout destroys them. Gulf documents are bidirectional tables. A description column flows right-to-left, the quantity and price columns are left-to-right, and the engine has to decide reading order before it can decide what belongs to which row. Get the order wrong and you get something worse than garbage: a plausible number attached to the wrong field.حتى حين تُقرأ الحروف بشكل صحيح، يدمّرها التخطيط. المستندات الخليجية جداول ثنائية الاتجاه. عمود الوصف يجري من اليمين إلى اليسار، وعمودا الكمية والسعر من اليسار إلى اليمين، وعلى المحرّك أن يحدّد ترتيب القراءة قبل أن يحدّد ما الذي ينتمي إلى أي صف. أخطئ في الترتيب فتحصل على ما هو أسوأ من الهراء: رقم معقول مُلصق بالحقل الخطأ.

A wrong number that looks right is more expensive than no number at all.الرقم الخاطئ الذي يبدو صحيحًا أغلى من غياب الرقم تمامًا.

What we actually changedما الذي غيّرناه فعليًا

1 Layout-aware segmentation first.تجزئة واعية بالتخطيط أولًا.
Before any text recognition, a vision model maps the page into directional regions, so reading order is decided from geometry, not guessed from characters.قبل أي تعرّف على النص، يرسم نموذج رؤية خريطة للصفحة إلى مناطق حسب الاتجاه، فيُحدَّد ترتيب القراءة من الهندسة، لا تخمينًا من الحروف.
2 An Arabic-first recognizer.محرّك تعرّف عربي أولًا.
We fine-tuned an open model on hundreds of thousands of real document crops: ligatures, stamps, low-contrast scans, and margin handwriting, instead of clean synthetic text.صقلنا نموذجًا مفتوحًا على مئات الآلاف من قصاصات المستندات الحقيقية: روابط، وأختام، ومسوحات منخفضة التباين، وكتابات يدوية في الهوامش، بدلًا من النصوص الاصطناعية النظيفة.
3 Numbers must reconcile.الأرقام يجب أن تتطابق.
Every extracted figure is checked against the document's own arithmetic. A number that does not balance is flagged, never silently posted.يُراجَع كل رقم مستخرَج مقابل حساب المستند نفسه. وأي رقم لا يتوازن يُعلَّم، ولا يُرحَّل بصمت أبدًا.

99.2%

field-level accuracy, up from 71%, measured on 4,000 held-out Gulf documents across 60 vendors, including scanned, photographed, and exported PDFs.دقة على مستوى الحقول، ارتفاعًا من ٧١٪، مقيسة على ٤٠٠٠ مستند خليجي محجوز للاختبار من ٦٠ مورّدًا، تشمل المسوحات والصور وملفات PDF المُصدَّرة.

Why on-prem made it harder, and betterلماذا جعلها التشغيل المحلي أصعب، وأفضل

We could have shipped a cloud API and called it done. But teams in the region cannot send this data to a third party, so the whole pipeline had to run on the customer's own GPUs. That constraint forced a discipline that improved accuracy: every model had to be small enough to run locally, so each layer had to earn its place. The result reads an Arabic document as well as a careful human clerk, and it never leaves the building.كان بإمكاننا إطلاق واجهة سحابية واعتبار الأمر منتهيًا، لكن الفرق في المنطقة لا تستطيع إرسال هذه البيانات إلى طرف ثالث، فكان على خط المعالجة بأكمله أن يعمل على معالِجات العميل نفسه. وذلك القيد فرض انضباطًا حسّن الدقة: كان على كل نموذج أن يكون صغيرًا بما يكفي للعمل محليًا، فكان على كل طبقة أن تستحقّ مكانها. والنتيجة تقرأ المستند العربي بدقة موظّف بشري متأنٍّ، ولا تغادر المبنى أبدًا.

Shareشارك

Why Arabic OCR still fails, and how we fixed itلماذا يفشل التعرّف الضوئي على العربية، وكيف عالجناه

The problem isn't the alphabetالمشكلة ليست في الأبجدية

Mixed direction is where it really breaksالاتجاه المختلط هو ما يكسرها فعلًا

What we actually changedما الذي غيّرناه فعليًا

Why on-prem made it harder, and betterلماذا جعلها التشغيل المحلي أصعب، وأفضل

Put one workflow into production.ضعوا عمليةً واحدة في الإنتاج.

Why Arabic OCR still fails, and how we fixed itلماذا يفشل التعرّف الضوئي على العربية، وكيف عالجناه

The problem isn't the alphabetالمشكلة ليست في الأبجدية

Mixed direction is where it really breaksالاتجاه المختلط هو ما يكسرها فعلًا

What we actually changedما الذي غيّرناه فعليًا

Why on-prem made it harder, and betterلماذا جعلها التشغيل المحلي أصعب، وأفضل

Put one workflow into production.ضعوا عمليةً واحدة في الإنتاج.

The high tier of open-weight models, June 2026الفئة العليا من النماذج مفتوحة الأوزان، يونيو 2026

The state of Arabic OCR in 2026: what actually worksواقع التعرف الضوئي على النصوص العربية في 2026: ما الذي ينجح فعلاً