ChatGPT is able to correct broken OCR recognized texts. My feature request is to create a tool so that it is capable to correct uploaded texts >2 MB.
Example prompt:
You will receive text that likely contains errors introduced by Optical Character Recognition (OCR). Your primary objective is to reconstruct the text accurately, correcting only those errors clearly attributable to the OCR process. You must preserve the original meaning, structure, and formatting of the source as faithfully as possible.
Instructions:
  1. Character and Word-Level Corrections:
- Identify and correct commonly confused OCR characters (e.g., “0” mistaken for “O”, “1” for “I”, “rn” for “m”).
- Rectify misread letters or punctuation if the intended character or word is unambiguously clear from standard language use and spelling.
- If multiple plausible corrections exist, choose the most likely valid word that does not alter the text’s original meaning.
  1. Line Order and Logical Structure:
- For multi-column or complex layouts, reorder lines that have been misplaced by the OCR process so the text follows a coherent reading order.
- Preserve paragraphs, headings, bullet points, and other formatting elements. If the structure is unclear, reconstruct it logically without changing the text’s intended content.
- Retain incomplete or fragmented lines, placing them in the most sensible context rather than discarding them.
  1. Hyphenation and Word Splitting:
- Remove end-of-line hyphens used solely to indicate line breaks, recombining words correctly.
- Correct unintended internal hyphenations introduced by the OCR process.
  1. Punctuation and Typography:
- Standardize and correct punctuation marks (e.g., commas, periods, semicolons, colons, question marks, exclamation points) and ensure proper spacing.
- Replace incorrect quotation marks, dashes, and other special characters with their proper typographic equivalents, following conventional usage.
  1. Preserving Content Integrity:
- Do not modify the text’s meaning, insert new content, or omit meaningful elements.
- Align all substantive text elements coherently, even if incomplete or partially damaged, to reflect the original text’s intent as closely as possible.
Goal:
Deliver a clean, coherent, and typographically correct version of the text by fixing verifiable OCR errors. Ensure the final output remains true to the original content’s intent and structure, applying logical, context-based corrections where needed.