Document Extraction Named Entity Recognition March 2024:

Developed an AI model for extracting metadata from various document formats, including images (PNG, JPG) and documents (DOC), with a focus on Table of Contents extraction.

Demo Video :


Technical stack Used in the Project -

  • Implemented a Jupyter Notebook (extraction_(doc,png,jpg)_to_text.ipynb) detailing the extraction process using pytesseract for image files and the open-docx library for document files.Created training and testing CSV files with attributes such as file name and extracted text to facilitate model training.
  • Utilized the zero-shot prompting method outlined in the notebook using_openai.ipynb to extract required labels from files or images, assessing its effectiveness and limitations.
  • Employed OpenAI chat completion with a prompting approach in one_shot_prompting.ipynb, comparing results with the zero-shot prompting method.
  • Streamlined the process by integrating it into a Streamlit application, leveraging helper functions from helper_function.py, and incorporating fastapi for API calls, ensuring reliability and efficiency in document extraction.

The Github code is here