Paper Title
HANDWRITTEN DOCUMENT ANALYSIS USING OCR AND TRANSFORMER-BASED MODEL
Abstract
Detecting content generated using different AI tools in handwritten academic submissions presents a unique challenge. Most existing detection tools assume clean digital text and cannot operate directly on OCR-extracted output. In this paper, an end-to-end pipeline is proposed for processing a handwritten document and detecting AI-generated content. Scanned PDF documents are first digitized using Tesseract, to assess extraction quality, a three-metric validation is applied, comprising Character Error Rate (CER), word coverage, and layout score. A LangChain-based prompt using GPT-4o is then used to clean residual OCR noise, producing error-free text for further analysis. Finally, to detect AI-generated content, a fine-tuned DistilBERT model is applied to the cleaned text, predicting both document-level AI percentage and sentence-level binary classification, indicating whether each sentence is human-written or AI-generated. The proposed system is expected to demonstrate whether LangChain LLM-assisted post-correction significantly improves extraction, and if AI content detection is achievable on digitized handwritten documents.
Keywords - Optical Character Recognition, AI Content Detection, DistilBERT, LangChain