🕌 I Built a Tool to Translate Arabic Islamic Texts — Fully Offline

Why I Built This

I wanted to read kitab — classical Arabic Islamic scholarly texts — but I don’t read Arabic fluently enough to get through them on my own. These books carry centuries of knowledge: fiqh, aqeedah, tafsir, hadith commentary. Scholars have written in Arabic for over a millennium, and most of that wisdom has never been fully translated into Indonesian or English.

My copy of Al-Qawa’id Al-Arba’ by Imam Muhammad ibn Abd al-Wahhab sits as a scanned PDF. So does Ifadatul Mustafid. Most of the kitab I’ve collected are scanned images — not searchable text, just photographs of old book pages.

The existing options were frustrating:

Google Translate works on typed text, not on scanned PDFs
Commercial tools either require internet, cost money, or don’t handle Arabic well
Copying text from a scanned PDF is impossible — it’s just pixels

So I built Tarjim (تَرجِم — Arabic for “translate”).

What Tarjim Does

Tarjim takes a scanned Arabic PDF, runs OCR to extract the text line by line, translates each line offline, and writes the translated text back onto the same PDF — replacing the original Arabic in-place.

The result is a new PDF that looks like the original layout, but in English (or Indonesian, or other languages).

Input:  Scanned Arabic PDF  (just images of Arabic text)
            ↓
Output: Translated PDF  (English/Indonesian text in the same positions)

The entire pipeline runs offline after the initial model download. No API keys. No internet. No data leaves your machine. This was a hard requirement for me — these are religious texts, and privacy matters.

How It Works

The pipeline has five steps:

graph TD
    A[Scanned Arabic PDF] --> B[Render pages to images<br/>PyMuPDF @ 300 DPI]
    B --> C[Arabic OCR<br/>Surya OCR → text lines + bounding boxes]
    C --> D{Direct package available?}
    D -->|ar → en| E[Direct translation<br/>Argos Translate]
    D -->|ar → id / others| F[Pivot translation<br/>ar → en → id]
    E --> G[Overlay translated text<br/>PIL with white box + word-wrap]
    F --> G
    G --> H[Save as new PDF]

Step 1 — Render PDF to Images

Since the input is a scanned document (images, not text), we first render each page into a high-resolution PIL image using PyMuPDF at 300 DPI. Higher DPI means better OCR accuracy.

zoom = 300 / 72.0  # PDF baseline is 72 DPI
matrix = pymupdf.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=matrix)
image = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)

Step 2 — OCR with Surya

This is the most critical step. I tried three OCR engines before settling on Surya:

Tesseract — The classic. Arabic support exists, but accuracy on old scanned texts is poor. Struggles with naskh and ruq’ah scripts.
PaddleOCR — Better accuracy, but heavier setup and the Arabic model needed extra configuration.
Surya OCR — Built specifically for document-level OCR, supports Arabic out of the box, and returns text lines with precise bounding boxes. This is what I use.

Surya returns a structured prediction: for each page, you get a list of text_lines, each with .text (the recognized Arabic string) and .bbox (the pixel coordinates x1, y1, x2, y2).

from surya.recognition import RecognitionPredictor
from surya.detection import DetectionPredictor

recognition_predictor = RecognitionPredictor(foundation_predictor)
detection_predictor = DetectionPredictor()

predictions = recognition_predictor([image], det_predictor=detection_predictor)
page = predictions[0]

for line in page.text_lines:
    print(line.text)   # "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ"
    print(line.bbox)   # (42, 120, 890, 158)

Step 3 — Translate with Argos Translate

Argos Translate is a fully offline, open-source machine translation library backed by OpenNMT models. Once packages are downloaded, no network is needed.

The interesting engineering problem here was supporting Indonesian (Bahasa Indonesia) — there is no direct Arabic → Indonesian package in Argos. The solution: pivot translation.

I’ll explain this in detail in the next section.

Step 4 — Overlay Translated Text

For each detected text line, I:

Draw a white rectangle over the original Arabic text (covering it completely)
Calculate an appropriate font size based on the bounding box height
Draw the translated text with word-wrapping to fit inside the same bounding box

# Cover the original Arabic
draw.rectangle([x1, y1, x2, y2], fill="white", outline="white")

# Calculate font size from bbox height
font_size = max(1, int((y2 - y1) * 0.8))
font = ImageFont.truetype(font_path, font_size)

# Word-wrap and draw translated text
draw_text_in_box(draw, translated_text, bbox, font)

There’s also a clean mode that starts from a blank white page — useful if you just want the translated text without the original scanned image behind it.

Step 5 — Save as PDF

All modified page images are saved back into a multi-page PDF using Pillow.

The Engineering Decision I’m Most Proud Of: Smart Translation Routing

This was the most interesting design challenge.

Argos Translate has language packages that you install individually — ar→en, en→id, en→fr, etc. There is an ar→en package (Arabic to English). But there is no ar→id package (Arabic to Indonesian).

So how do you translate Arabic to Indonesian? You chain two translations:

Arabic text → [ar→en model] → English text → [en→id model] → Indonesian text

This is called pivot translation through English.

The challenge is making this seamless. A user just wants to say --lang id and get Indonesian output. They shouldn’t need to know about pivoting.

Here’s how I implemented it:

def setup_argos_translation(from_code: str, to_code: str) -> str:
    """
    Returns 'direct' or 'pivot:en' depending on available packages.
    Auto-installs whatever is needed.
    """
    # 1. Try to find a direct package (e.g., ar→id)
    direct_package = find_package(from_code, to_code)
    if direct_package:
        install(direct_package)
        return "direct"

    # 2. No direct package — install pivot packages (ar→en + en→id)
    install(find_package(from_code, "en"))
    install(find_package("en", to_code))
    return "pivot:en"

And in translate_text(), if the pair is known to be a pivot pair, it chains two calls:

def translate_text(text, from_code="ar", to_code="id"):
    if "ar->id" in _PIVOT_PAIRS:
        # Two-step: ar → en → id
        en_text = argos.translate(text, "ar", "en")
        return argos.translate(en_text, "en", "id")

    # Direct: ar → en
    return argos.translate(text, from_code, to_code)

The system remembers which pairs need pivoting at runtime. It even auto-discovers the need: if a direct translation call fails, it automatically retries via the English pivot and marks that pair for future calls.

The route selection is fully transparent through the CLI:

$ python -m src.cli -i kitab.pdf -o kitab_id.pdf --lang id --verbose

Translation route: ar → en → id (pivot through English, no direct package available)

Translation Routing Table

Target Language	Route	Packages Needed
English (`en`)	Direct	`ar→en`
Indonesian (`id`)	Pivot	`ar→en` + `en→id`
Malay (`ms`)	Pivot	`ar→en` + `en→ms`
French (`fr`)	Pivot	`ar→en` + `en→fr`
German (`de`)	Pivot	`ar→en` + `en→de`
Spanish (`es`)	Pivot	`ar→en` + `en→es`
Turkish (`tr`)	Pivot	`ar→en` + `en→tr`

Getting Started

Installation

git clone https://github.com/scrowten/tarjim.git
cd tarjim

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

pip install -r requirements.txt

On first run, Surya OCR models (~2–3 GB) and Argos translation packages download automatically. After that: fully offline.

Translate Arabic → English

python -m src.cli --input kitab.pdf --output kitab_en.pdf --lang en

Translate Arabic → Indonesian

python -m src.cli --input kitab.pdf --output kitab_id.pdf --lang id

The system automatically uses the ar→en→id pivot route. You don’t need to do anything special.

Web UI

uvicorn src.api:app --port 8000
# Open http://localhost:8000

Upload a PDF, pick a language, click Translate. The UI shows the translation route so you know what’s happening under the hood.

Docker

docker build -t tarjim .
docker run -p 8000:8000 tarjim

The Docker image pre-downloads both the ar→en and en→id packages at build time.

Python API

from src.core.pdf_handler import process_pdf

# Arabic → Indonesian
process_pdf(
    input_path="al-qawaid-al-arba.pdf",
    output_path="al-qawaid-al-arba_id.pdf",
    target_lang="id",
    overlay_mode="replace",  # or "clean" for white background
)

Project Structure

tarjim/
├── src/
│   ├── cli.py                    # Command-line interface
│   ├── api.py                    # FastAPI web server
│   └── core/
│       ├── pdf_handler.py        # Pipeline orchestrator
│       ├── ocr_surya.py          # Surya OCR wrapper
│       ├── translator_argos.py   # Argos Translate + pivot routing
│       └── utils.py              # Text overlay, fonts, word-wrap
├── static/
│   ├── index.html                # Web UI
│   └── styles.css
├── tests/                        # Unit tests
├── Dockerfile
└── requirements.txt

The core modules are intentionally small and single-purpose. translator_argos.py knows nothing about PDFs. ocr_surya.py knows nothing about translation. pdf_handler.py orchestrates everything.

Challenges & Lessons Learned

OCR accuracy on old texts is hard. Surya is the best open-source option available, but classical Arabic manuscript fonts are different from modern typeset text. Vowel diacritics (tashkeel) especially trip up OCR models. For perfectly clean modern typeset kitab, results are good. For old handwritten manuscripts, there’s room to improve.

Translation quality through pivoting is lossy. Chaining ar→en→id means any error in the first step gets propagated and amplified in the second. The ar→en Argos model performs reasonably well on straightforward prose, but classical Arabic religious vocabulary — tawhid, shirk, sunnah, bid’ah — often gets transliterated rather than translated, which is actually appropriate for Islamic texts.

Font fitting is a solved problem — just tedious. Getting English text to fit into the same bounding box as Arabic text is fiddly. Arabic is compact horizontally; English words for the same concept are often longer. I handle this with dynamic font sizing and word-wrapping, but very dense Arabic pages still get tight.

Lazy imports matter for usability. Early versions of the code eagerly imported PyMuPDF and Surya at the top of every file. This meant that even import src.core.utils would try to load multi-gigabyte models. Moving to lazy imports (__getattr__ on the package __init__.py) made the module usable in testing environments without all dependencies present.

What’s Next

The project is functional but there’s a clear path forward:

Better handling of dense pages — Auto-shrink font, or flow overflow text to margins
Bilingual output mode — Show original Arabic alongside the translation (side-by-side columns or interleaved pages)
Batch processing — Translate an entire folder of kitab PDFs in one command
Confidence scoring — Flag lines where OCR or translation confidence is low
Better support for tashkeel — Strip or handle vowel diacritics before OCR to improve accuracy
Table of contents preservation — Detect and translate structural elements, not just body text
Optional cloud API mode — Hook into a higher-quality translation API (OpenAI, Google) for users who prefer accuracy over privacy

Try It

The project is open source under MIT license.

GitHub: github.com/scrowten/tarjim

git clone https://github.com/scrowten/tarjim.git
pip install -r requirements.txt
python -m src.cli --input your_kitab.pdf --output translated.pdf --lang en

If you work with Arabic documents — religious texts, academic papers, historical archives — I hope this tool is useful to you. Contributions, issues, and suggestions are very welcome.

Built with Surya OCR, Argos Translate, PyMuPDF, and FastAPI.