This project evaluates the accuracy and relevance of responses generated by ChatDKU Advising using RAGAS. The evaluation is conducted by comparing ChatDKU Advising's answers to the official FAQ responses provided by the Academic Advising Office at Duke Kunshan University (DKU).
ChatDKU is a RAG-based AI chatbot designed to enhance campus interaction by providing academic advising and administrative support. The system integrates multi-source retrieval, query optimization, and context-aware prompt engineering to deliver high-quality responses.
⚠️ Due to its potential future role as an official DKU resource, the complete project code remains private. However, this evaluation project provides insight into its effectiveness.
校园大模型ChatDKU上线昆山杜克大学 (Bilibili)
ChatDKU-Advising
(Requires DKU VPN & NetID Login to access)
- Uses
ragas.dataset_schema.SingleTurnSample
to structure FAQ data. - Evaluates generated responses using RAGAS metrics:
- BLEU Score: Measures n-gram overlap between generated and reference responses.
- ROUGE Score: Compares recall-based textual similarity.
- Non-LLM String Similarity & Distance Measure: Evaluates lexical similarity between responses.
The FAQ dataset consists of officially provided advising questions and answers, covering topics such as:
- Academic Honors
- Academic Standing
- CR/NC Grading
- Course Load
- Course Registration
- Course Repeat
- Course Withdrawal
- Credits Transfer
- Global Education
- Graduation
- Incomplete Grade
- Leave of Absence
- PE & NSPHST
Run the script to test ChatDKU Advising's responses against reference answers and obtain performance metrics.
ragas
asyncio
This project is part of my signature work and graduation project at DKU. It aims to assess the reliability of ChatDKU Advising in providing accurate academic guidance, ensuring alignment with DKU’s official advising policies.
This section provides visualizations of the evaluation results, including average scores, radar charts, score distributions, and bar charts.


Score Distributions Histogram![]() |
Bar Chart of Average Scores![]() |
ChatDKU was evaluated using 104 questions categorized into 13 domains from Duke Kunshan University’s official FAQ documents. The system’s responses were compared against reference answers using three metrics: Levenshtein (textual similarity), BLEU (phrase-level precision), and ROUGE (content recall). Key findings include:
- Overall Performance: ChatDKU achieved strong results in Levenshtein (avg. 0.7877) and ROUGE (avg. 0.8111), indicating high structural and content-level alignment with official answers. However, BLEU scores (avg. 0.6018) highlighted inconsistencies in exact phrase matching.
- Top Categories: Courses like Course Repeat (Levenshtein: 0.9377) and Credits Transfer (ROUGE: 0.9600) excelled due to standardized responses.
- Challenges: Categories like Leave of Absence (Levenshtein: 0.6143) and Graduation (BLEU: 0.4739) showed lower scores, suggesting structural or content gaps.
- Strengths: ChatDKU demonstrates robust retrieval capabilities and context-aware response generation, particularly for structured queries.
- Limitations: Phrase-level precision and handling open-ended questions remain areas for improvement. Traditional metrics like BLEU may understate the value of semantically accurate but paraphrased answers.
Enhancing prompt engineering, expanding dataset coverage, and integrating advanced evaluation methods"""