StudyBuddy | Jae Kim

Study Buddy

Features of the AI Program Using Retrieval-Augmented Generation

This AI program leverages the Retrieval-Augmented Generation (RAG) model to provide precise and relevant answers to user queries. Here's an overview of its key functionalities:

1. Data Sourcing and Question Answering

The program utilizes the RAG model to locate the necessary data and generate responses. It supports two modes of data sourcing:

Local Database Mode:
- The program scans a designated folder (gemini_folder) to create a persistent database from PDF files.
- This process needs to be run only once unless additional files are added to the folder. When new PDFs are added, the database can be updated by rerunning the scan.
- Answers are based solely on the scanned database. If the required information is absent, the program will not provide a response.
Web Database Mode:
- The program attempts to answer questions by first checking the local database.
- If the relevant data is not found locally, it switches to sourcing information from the world wide web.

2. Quiz Generation

The program can generate:

Multiple-choice quizzes
Open-ended question quizzes

These quizzes are crafted using the data from the local database, offering a personalized and targeted learning experience.

3. Summarization Tool

The program summarizes topics based on queries using the information in the local database. This feature provides concise, context-specific overviews for quick reference.

4. Personality Switch Mode

A unique "Switch Mode" transforms the AI's personality into a sassy mom for entertaining and conversational responses. This mode adds a playful twist to the user experience without compromising the AI's functionality.

These features make the program versatile, user-friendly, and adaptable to various use cases, from information retrieval and summarization to interactive learning and playful interactions.

Python Verion 3.12.3

Pycharm Community Edition 2024.2.4

you will require an ".env" file, formatted as

GOOGLEAI_API_KEY= your api key

your api key should be replaced by your google API key provided

For getting a google api, follow the instructions here https://developers.google.com/maps/documentation/embed/get-api-key

-----Dependencies------

dotenv 1.0.1

langchain 0.3.7

langchain_chroma 0.1.4

langchain_community 0.3.5

langchain_core 0.3.15

langchain_google_genai 2.0.4

langchain-openai 0.2.6

langchain_text_splitters 0.3.2

NumPy 1.26.4

Custom AI Model for Your Dataset

This program allows you to create a customized AI model tailored to your specific dataset, ensuring highly relevant and precise responses to your queries.

Unlike generic AI models that often provide broad or unrelated information, this program focuses exclusively on your provided data. This eliminates the need for repeated queries and delivers answers directly aligned with your needs, such as class materials or specialized topics.

Powered by the Retrieval-Augmented Generation (RAG) method, the AI efficiently retrieves and processes the necessary data to provide accurate, context-specific answers every time.

Understanding the Hybrid AI Data Sourcing System

Implementing a Hybrid System for AI Data Sourcing
The ideal solution for our system is to adopt a hybrid AI data sourcing approach. This combines the strengths of Retrieval Augmented Generation (RAG) and external internet searches. The hybrid system would work as follows:

Step 1: Retrieve Relevant Data from the Local Database
Using RAG, the system would first scan and retrieve data from the local database that aligns with the user’s query. This ensures that answers are contextually relevant to the data we have on hand, providing precise and controlled results.
Step 2: Expand the Answer Using External Sources
Once the local retrieval is complete, the program would expand on the initial result by supplementing it with broader information pulled from the internet. This dual-layered approach enriches the output by balancing local accuracy with global relevance.

Current System Limitations
At present, our system operates on an "all or nothing" model:

It either sources exclusively from the local database or relies entirely on the internet.
This siloed behavior limits its ability to provide both precise and enriched responses simultaneously.

The hybrid system would eliminate this limitation by enabling sequential or parallel data sourcing, depending on the use case.

Testing Retrieval Augmented Generation on PDFs
During testing, we evaluated whether the file name of a PDF affects the performance of our RAG model when categorizing documents. The results showed that file names, such as "12389_3218231" or "Math," have no impact on categorization accuracy.

Instead, the model relies on:

The content of the document.
The title embedded within the file.
Any other relevant metadata that can be extracted.

This finding demonstrates the robustness of the model in focusing on meaningful data rather than superficial naming conventions.

Conclusion
Transitioning to a hybrid system would significantly enhance the flexibility and utility of our AI model. By leveraging both local and global data sources, we can:

Deliver more accurate and comprehensive results.
Ensure adaptability for various use cases.
Maintain the integrity of document categorization, regardless of file naming practices.

This approach represents a balanced and future-proof strategy for improving the system’s performance and user satisfaction.

Examples

Option 2: Querying based on the scanned local database

Story inspired by "The Little Prince"

Option 4: Querying multiple choice and open questions