I’m working on building an educational chatbot that can answer students' academic questions. For this purpose, and to ensure the model provides accurate answers to users’ questions, I want to use a database that includes PDF and Word files of textbooks along with sample questions and their answers from recent years. My plan is to chunk these resources using RAG and then, when a user asks a question, have the system search through the chunks via RAG, retrieve the relevant chunk, and send it along with the user’s query to the model.
However, the dataset I’m working with is unstructured and contains Word and PDF files for subjects like math, physics, etc. These files include text, math formulas, diagrams, and tables in image format. I’ve tried using the RAG feature on your platform and uploaded a sample file to the document section. But when I test it by asking questions, especially math-related ones, the bot struggles to find the relevant chunks and send them to the model.
How can I solve this issue? I’m also attaching a sample image from the content of my files so you can see the type of material I’m trying to chunk.
I’d appreciate your guidance on how I can address these challenges and improve the bot’s performance so that it can retrieve the correct chunks for users’ questions and enable the model to generate accurate answers.