Document Extraction Named Entity Recognition March 2024:
Developed an AI model for extracting metadata from various document formats, including images (PNG, JPG) and documents (DOC), with a focus on Table of Contents extraction.
Demo Video :
Technical stack Used in the Project -
- Implemented a Jupyter Notebook (extraction_(doc,png,jpg)_to_text.ipynb) detailing the extraction process using pytesseract for image files and the open-docx library for document files.Created training and testing CSV files with attributes such as file name and extracted text to facilitate model training.
- Utilized the zero-shot prompting method outlined in the notebook using_openai.ipynb to extract required labels from files or images, assessing its effectiveness and limitations.
- Employed OpenAI chat completion with a prompting approach in one_shot_prompting.ipynb, comparing results with the zero-shot prompting method.
- Streamlined the process by integrating it into a Streamlit application, leveraging helper functions from helper_function.py, and incorporating fastapi for API calls, ensuring reliability and efficiency in document extraction.
The Github code is here