Open Source

Scans to structured text

Transform scanned books into structured, searchable digital text with AI-powered OCR and intelligent document analysis.

Processing Pipeline

📷

OCR Pages

Extract text from scanned page images using vision AI models

mistral
Extract text using Mistral vision model
olm
Extract text using OLM OCR model
paddle
Extract text using PaddleOCR
blend
Combine OCR outputs into best-quality text
🏷️

Label Structure

Classify content blocks as body text, headers, footnotes, or page numbers

mechanical
Extract patterns using regex and heuristics
unified
Classify page elements with vision LLM
gap_analysis
Identify pages with missing or uncertain labels
agent_healing
Fix classification gaps using LLM agent
📑

Extract ToC

Identify and extract the table of contents from OCR output

find
Locate ToC pages using vision agent
extract
Parse ToC entries from identified pages
🔗

Link ToC

Map table of contents entries to their corresponding page numbers

find_entries
Locate each ToC entry in page content
pattern
Analyze heading patterns to find candidates
evaluation
Evaluate candidate headings with vision LLM
merge
Merge results into enriched ToC
🏗️

Build Structure

Assemble unified document structure with chapter text and metadata

build_structure
Build document skeleton from ToC and detected headings
polish_entries
Extract and polish chapter text with LLM (parallel)
merge
Merge polished entries into final structure.json
📖

Generate Output

Create ePub files, audiobook scripts, or structured API output

generate_epub
Build and validate ePub 3.0 file

Features

🤖

AI-Powered OCR

Multiple vision models including Mistral, OLM, and PaddleOCR work together to extract the highest quality text from scanned pages.

🧠

Intelligent Structure Detection

Automatically classify content as headers, body text, footnotes, and page numbers using advanced pattern recognition.

📚

Table of Contents Extraction

Locate, extract, and link table of contents to create a navigable document structure.

🔧

Self-Healing Pipeline

Agent-based healing automatically identifies and fixes gaps in document classification and structure.

📱

Multiple Output Formats

Generate ePub files, structured JSON, or audiobook scripts from your processed documents.

Parallel Processing

Built with performance in mind, processing multiple stages and chapters in parallel for maximum efficiency.

Quick Start

$ git clone https://github.com/jackzampolin/shelf.git
$ cd shelf
$ pip install -r requirements.txt
$ python -m shelf.cli process /path/to/scanned/pages
✓ Processing complete! Generated structure.json and output.epub