Expected duration: 1 day or less The objective is to build a structured blood test database that allows pathology results to be viewed, edited, filtered, and exported to Excel via a web-based HTML interface. The system stores results in a clean, standardised format so trends can be analysed accurately over time.
Using AI-assisted OCR, I have built a local Python extraction pipeline that converts PDF pathology reports into machine-readable text and inserts structured data into a SQLite database. The majority of blood tests extract correctly, including canonical test name, result value, unit, and reference range.
However, I have reached a specific technical issue with three markers:
• CRP (C-reactive protein) • ESR • GLU (Glucose)
The OCR output clearly contains the correct lines, and debug logs confirm they are processed. Yet no rows are inserted for these markers.
The failure appears to occur between canonical matching, numeric extraction, or validation logic.
Current System Architecture
The system runs locally and consists of:
• extraction_core_2.py (main engine) • Supporting modules for OCR preprocessing, lab dictionary building, regex matching, and validation • SQLite backend • Schema-driven canonical lab dictionary • Controlled fuzzy fallback logic • HTML viewer for results display and Excel export
Pipeline flow:
Convert PDF to image (pdf2image)
Preprocess
Run Tesseract OCR
Clean and normalise text
Match against canonical lab dictionary
Extract:
canonical test name
numeric result
unit
reference range
Validate
Insert into SQLite
The engine is deterministic and rule-based.
The Specific Problem
Example OCR line:
CRP H 5.2 mg/L 0-5
OCR text is correct. NUMBER_PATTERN matches. The canonical dictionary contains the test.
Yet:
Inserted 0 rows from 0126251OrderReport_23B00006604_CRP.pdf
Likely failure points include:
• Canonical containment match failing due to normalisation • Flag tokens (“H”, “L”) interfering with numeric capture • Numeric extraction anchored incorrectly • Validation rejecting due to strict range formatting • Unit pattern mismatch (e.g. mmol/L) • Dictionary indexing issue • Match overridden by another lab name • Guard conditions too strict
If validation fails, the row is rejected silently.
All other panels extract correctly. The issue appears isolated.
This is a focused debugging and refinement request. I have spent many hours attempting to isolate the issue and now require an experienced developer to identify the blocking condition and implement a practical fix.
I have been advised this should take 1–2 hours for a senior developer.
CAD Design for Master Bath Remodel Category: 3D CAD, 3D Design, 3D Modelling, AutoCAD, Building Design, CAD / CAM, Drafting Budget: $250 - $750 USD
03-Mar-2026 17:04 GMT
Business Branding Course Content Category: Branding, Business Analysis, Business Writing, Content Development, Graphic Design, Logo Design, Research Writing Budget: $10 - $30 USD
03-Mar-2026 17:04 GMT
SPSS Comparative Study Analysis Category: Data Analysis, Data Management, Data Visualization, R Programming Language, Regression Analysis, SPSS Statistics, Statistical Analysis, Statistics Budget: ₹600 - ₹1500 INR
03-Mar-2026 17:03 GMT
Local Auto Detailing SEO Domination Category: Content Creation, Internet Marketing, Keyword Research, Link Building, Marketing, Moz, SEMrush, SEO Budget: $30 - $250 USD
03-Mar-2026 17:02 GMT
Windows PC Setup & Data Migration Category: .NET, Data Backup, Data Entry, Technical Support, Windows Desktop, Windows Server Budget: $30 - $250 USD
03-Mar-2026 17:01 GMT
Daily Facebook Group Posting Expert Category: Content Creation, Facebook Ads, Facebook API, Facebook Development, Facebook Marketing, Facebook Pixel, Social Media Management, Social Media Marketing Budget: $10 - $30 USD
03-Mar-2026 17:01 GMT
Social media content creator Category: Content Creation, Content Strategy, Instagram Marketing, Social Media Marketing, Video Ads, Video Editing, Video Production Budget: $10 - $30 USD
03-Mar-2026 17:01 GMT
Grow Trading SaaS Platform with Creative Content Category: After Effects, Animation, Content Creation, Content Marketing, Facebook Marketing, Graphic Design, Social Media Management, Social Media Marketing, Video Editing, Video Services Budget: $30 - $250 USD
03-Mar-2026 16:59 GMT
Sole Proprietor Business Evaluation Category: Accounting, Business Analysis, Business Consulting, Business Plans, Finance, Financial Consulting, Financial Modeling, Financial Research Budget: $250 - $750 CAD
03-Mar-2026 16:58 GMT
Data Entry Clerk needed on a Project Category: Adobe Acrobat, Copy Typing, Data Entry, Data Extraction, Data Management, Data Processing, Excel, Google Sheets, OCR, PDF Budget: $3000 - $5000 USD
03-Mar-2026 16:58 GMT
Udemy course marketing for revenue share Category: Advertising, Blender 3D, Courses, Email Marketing, Facebook Marketing, Internet Marketing, Marketing Strategy, SEO, Social Media Marketing Budget: $30 - $250 USD
Smooth Short Video Editing Needed Category: Adobe Premiere Pro, After Effects, Audio Editing, Final Cut Pro, Music Video, Video Editing, Video Post Editing, Video Production, Video Services, YouTube Video Editing Budget: ₹600 - ₹1500 INR