A comprehensive, extensible AI model evaluation framework designed for production use. NovaEval provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.
We're looking for contributors! See the Contributing section below for ways to help.
NovaEval is an open-source project that thrives on community contributions. Whether you're a seasoned developer or just getting started, there are many ways to contribute:
We're actively looking for contributors in these key areas:
- π§ͺ Unit Tests: Help us improve our test coverage (currently 23% overall, 90%+ for core modules)
- π Examples: Create real-world evaluation examples and use cases
- π Guides & Notebooks: Write evaluation guides and interactive Jupyter notebooks
- π Documentation: Improve API documentation and user guides
- π RAG Metrics: Add more metrics specifically for Retrieval-Augmented Generation evaluation
- π€ Agent Evaluation: Build frameworks for evaluating AI agents and multi-turn conversations
- Start Small: Pick up issues labeled
good first issue
orhelp wanted
- Join Discussions: Share your ideas in GitHub Discussions
- Review Code: Help review pull requests and provide feedback
- Report Issues: Found a bug? Report it in GitHub Issues
- Spread the Word: Star the repository and share with your network
- Multi-Model Support: Evaluate models from OpenAI, Anthropic, AWS Bedrock, and custom providers
- Extensible Scoring: Built-in scorers for accuracy, semantic similarity, code evaluation, and custom metrics
- Dataset Integration: Support for MMLU, HuggingFace datasets, custom datasets, and more
- Production Ready: Docker support, Kubernetes deployment, and cloud integrations
- Comprehensive Reporting: Detailed evaluation reports, artifacts, and visualizations
- Secure: Built-in credential management and secret store integration
- Scalable: Designed for both local testing and large-scale production evaluations
- Cross-Platform: Tested on macOS, Linux, and Windows with comprehensive CI/CD
pip install novaeval
git clone https://github.com/Noveum/NovaEval.git
cd NovaEval
pip install -e .
docker pull noveum/novaeval:latest
from novaeval import Evaluator
from novaeval.datasets import MMLUDataset
from novaeval.models import OpenAIModel
from novaeval.scorers import AccuracyScorer
# Configure for cost-conscious evaluation
MAX_TOKENS = 100 # Adjust based on budget: 5-10 for answers, 100+ for reasoning
# Initialize components
dataset = MMLUDataset(
subset="elementary_mathematics", # Easier subset for demo
num_samples=10,
split="test"
)
model = OpenAIModel(
model_name="gpt-4o-mini", # Cost-effective model
temperature=0.0,
max_tokens=MAX_TOKENS
)
scorer = AccuracyScorer(extract_answer=True)
# Create and run evaluation
evaluator = Evaluator(
dataset=dataset,
models=[model],
scorers=[scorer],
output_dir="./results"
)
results = evaluator.run()
# Display detailed results
for model_name, model_results in results["model_results"].items():
for scorer_name, score_info in model_results["scores"].items():
if isinstance(score_info, dict):
mean_score = score_info.get("mean", 0)
count = score_info.get("count", 0)
print(f"{scorer_name}: {mean_score:.4f} ({count} samples)")
from novaeval import Evaluator
# Load configuration from YAML/JSON
evaluator = Evaluator.from_config("evaluation_config.yaml")
results = evaluator.run()
NovaEval provides a comprehensive CLI for running evaluations:
# Run evaluation from configuration file
novaeval run config.yaml
# Quick evaluation with minimal setup
novaeval quick -d mmlu -m gpt-4 -s accuracy
# List available datasets, models, and scorers
novaeval list-datasets
novaeval list-models
novaeval list-scorers
# Generate sample configuration
novaeval generate-config sample-config.yaml
π Complete CLI Reference - Detailed documentation for all CLI commands and options
# evaluation_config.yaml
dataset:
type: "mmlu"
subset: "abstract_algebra"
num_samples: 500
models:
- type: "openai"
model_name: "gpt-4"
temperature: 0.0
- type: "anthropic"
model_name: "claude-3-opus"
temperature: 0.0
scorers:
- type: "accuracy"
- type: "semantic_similarity"
threshold: 0.8
output:
directory: "./results"
formats: ["json", "csv", "html"]
upload_to_s3: true
s3_bucket: "my-eval-results"
NovaEval provides a FastAPI-based HTTP API for programmatic access to evaluation capabilities. This enables easy integration with web applications, microservices, and CI/CD pipelines.
# Install API dependencies
pip install -e ".[api]"
# Run the API server
uvicorn app.main:app --host 0.0.0.0 --port 8000
# Access interactive documentation
open http://localhost:8000/docs
- Health Check:
GET /health
- Service health status - Component Discovery:
GET /api/v1/components/
- List available models, datasets, scorers - Model Operations:
POST /api/v1/models/{model}/predict
- Generate predictions - Dataset Operations:
POST /api/v1/datasets/{dataset}/load
- Load and query datasets - Scorer Operations:
POST /api/v1/scorers/{scorer}/score
- Score predictions - Evaluation Jobs:
POST /api/v1/evaluations/submit
- Submit async evaluation jobs
import requests
# Submit evaluation via API
evaluation_config = {
"name": "api_evaluation",
"models": [{"provider": "openai", "identifier": "gpt-3.5-turbo"}],
"datasets": [{"name": "mmlu", "split": "test", "limit": 10}],
"scorers": [{"name": "accuracy"}]
}
response = requests.post(
"http://localhost:8000/api/v1/evaluations/submit",
json=evaluation_config
)
task_id = response.json()["task_id"]
print(f"Evaluation started: {task_id}")
- Docker:
docker run -p 8000:8000 novaeval-api:latest
- Kubernetes: Full manifests provided in
kubernetes/
- Cloud Platforms: Supports AWS, GCP, Azure with environment variable configuration
π Complete API Documentation - Detailed API reference, examples, and deployment guide
NovaEval is built with extensibility and modularity in mind:
src/novaeval/
βββ datasets/ # Dataset loaders and processors
βββ evaluators/ # Core evaluation logic
βββ integrations/ # External service integrations
βββ models/ # Model interfaces and adapters
βββ reporting/ # Report generation and visualization
βββ scorers/ # Scoring mechanisms and metrics
βββ utils/ # Utility functions and helpers
- Datasets: Standardized interface for loading evaluation datasets
- Models: Unified API for different AI model providers
- Scorers: Pluggable scoring mechanisms for various evaluation metrics
- Evaluators: Orchestrates the evaluation process
- Reporting: Generates comprehensive reports and artifacts
- Integrations: Handles external services (S3, credential stores, etc.)
- MMLU: Massive Multitask Language Understanding
- HuggingFace: Any dataset from the HuggingFace Hub
- Custom: JSON, CSV, or programmatic dataset definitions
- Code Evaluation: Programming benchmarks and code generation tasks
- Agent Traces: Multi-turn conversation and agent evaluation
- OpenAI: GPT-3.5, GPT-4, and newer models
- Anthropic: Claude family models
- AWS Bedrock: Amazon's managed AI services
- Noveum AI Gateway: Integration with Noveum's model gateway
- Custom: Extensible interface for any API-based model
NovaEval provides a comprehensive suite of scorers organized by evaluation domain. All scorers implement the BaseScorer
interface and support both synchronous and asynchronous evaluation.
- Purpose: Performs exact string matching between prediction and ground truth
- Features:
- Case-sensitive/insensitive matching options
- Whitespace normalization and stripping
- Perfect for classification tasks with exact expected outputs
- Use Cases: Multiple choice questions, command validation, exact answer matching
- Configuration:
case_sensitive
,strip_whitespace
,normalize_whitespace
- Purpose: Advanced classification accuracy with answer extraction capabilities
- Features:
- Intelligent answer extraction from model responses using multiple regex patterns
- Support for MMLU-style multiple choice questions (A, B, C, D)
- Letter-to-choice text conversion
- Robust parsing of various answer formats
- Use Cases: MMLU evaluations, multiple choice tests, classification benchmarks
- Configuration:
extract_answer
,answer_pattern
,choices
- Purpose: Token-level F1 score for partial matching scenarios
- Features:
- Calculates precision, recall, and F1 score
- Configurable tokenization (word-level or character-level)
- Case-sensitive/insensitive options
- Use Cases: Question answering, text summarization, partial credit evaluation
- Returns: Dictionary with
precision
,recall
,f1
, andscore
values
- Purpose: Evaluates if the LLM retains information provided by users throughout conversations
- Features:
- Sophisticated knowledge extraction from conversation history
- Sliding window approach for relevant context (configurable window size)
- Detects when LLM asks for previously provided information
- Tracks knowledge items with confidence scores
- Use Cases: Chatbots, virtual assistants, multi-turn conversations
- Requirements: LLM model for knowledge extraction, conversation context
- Purpose: Measures response relevance to recent conversation context
- Features:
- Sliding window context analysis
- LLM-based relevance assessment (1-5 scale)
- Context coherence evaluation
- Conversation flow maintenance tracking
- Use Cases: Dialogue systems, context-aware assistants
- Configuration:
window_size
for context scope
- Purpose: Assesses whether user intentions and requests are fully addressed
- Features:
- Extracts user intentions from conversation history
- Evaluates fulfillment level of each intention
- Comprehensive coverage analysis
- Outcome-based evaluation
- Use Cases: Customer service bots, task-oriented dialogue systems
- Purpose: Evaluates consistency with assigned persona or role
- Features:
- Role consistency tracking throughout conversations
- Character maintenance assessment
- Persona adherence evaluation
- Customizable role expectations
- Use Cases: Character-based chatbots, role-playing AI, specialized assistants
- Configuration:
expected_role
parameter
- Purpose: Comprehensive conversational evaluation combining multiple metrics
- Features:
- Combines knowledge retention, relevancy, completeness, and role adherence
- Configurable metric inclusion/exclusion
- Weighted aggregation of individual scores
- Detailed per-metric breakdown
- Use Cases: Holistic conversation quality assessment
- Configuration: Enable/disable individual metrics, window sizes, role expectations
- Purpose: Evaluates how relevant answers are to given questions
- Features:
- Generates questions from answers using LLM
- Semantic similarity comparison using embeddings (SentenceTransformers)
- Multiple question generation for robust evaluation
- Cosine similarity scoring
- Use Cases: RAG systems, Q&A applications, knowledge bases
- Configuration:
threshold
,embedding_model
- Purpose: Measures if responses are faithful to provided context without hallucinations
- Features:
- Extracts factual claims from responses
- Verifies each claim against source context
- Three-tier verification: SUPPORTED/PARTIALLY_SUPPORTED/NOT_SUPPORTED
- Detailed claim-by-claim analysis
- Use Cases: RAG faithfulness, fact-checking, source attribution
- Configuration:
threshold
for pass/fail determination
- Purpose: Evaluates precision of retrieved context relevance
- Features:
- Splits context into chunks for granular analysis
- Relevance scoring per chunk (1-5 scale)
- Intelligent context segmentation
- Average relevance calculation
- Use Cases: Retrieval system evaluation, context quality assessment
- Requirements: Context must be provided for evaluation
- Purpose: Measures if all necessary information for answering is present in context
- Features:
- Extracts key information from expected outputs
- Checks presence of each key fact in provided context
- Three-tier presence detection: PRESENT/PARTIALLY_PRESENT/NOT_PRESENT
- Comprehensive information coverage analysis
- Use Cases: Retrieval completeness, context sufficiency evaluation
- Requirements: Both context and expected output required
- Purpose: Composite RAGAS methodology combining multiple RAG metrics
- Features:
- Integrates Answer Relevancy, Faithfulness, Contextual Precision, and Contextual Recall
- Configurable weighted aggregation
- Parallel execution of individual metrics
- Comprehensive RAG pipeline evaluation
- Use Cases: Complete RAG system assessment, benchmark evaluation
- Configuration: Custom weights for each metric component
- Purpose: Uses LLMs with chain-of-thought reasoning for custom evaluation criteria
- Features:
- Based on G-Eval research paper methodology
- Configurable evaluation criteria and steps
- Chain-of-thought reasoning support
- Multiple evaluation iterations for consistency
- Custom score ranges and thresholds
- Use Cases: Custom evaluation criteria, human-aligned assessment, complex judgments
- Configuration:
criteria
,use_cot
,num_iterations
,threshold
- Correctness: Factual accuracy and completeness assessment
- Relevance: Topic adherence and query alignment evaluation
- Coherence: Logical flow and structural consistency analysis
- Helpfulness: Practical value and actionability assessment
- Purpose: Multi-LLM evaluation with diverse perspectives and aggregation
- Features:
- Multiple LLM judges with individual weights and specialties
- Configurable aggregation methods (mean, median, weighted, consensus, etc.)
- Consensus requirement and threshold controls
- Parallel judge evaluation for efficiency
- Detailed individual and aggregate reasoning
- Use Cases: High-stakes evaluation, bias reduction, robust assessment
- Configuration: Judge models, weights, specialties, aggregation method
- Diverse Panel: Different models with varied specialties (accuracy, clarity, completeness)
- Consensus Panel: High-consensus requirement for agreement-based decisions
- Weighted Expert Panel: Domain experts with expertise-based weighting
- Purpose: Evaluates appropriateness of tool calls given available tools
- Features: Compares selected tools against available tool catalog
- Use Cases: Agent tool selection assessment, action planning evaluation
- Purpose: Compares actual tool calls against expected tool calls
- Features: Detailed tool call comparison and correctness assessment
- Use Cases: Agent behavior validation, expected action verification
- Purpose: Evaluates correctness of parameters passed to tool calls
- Features: Parameter validation against tool call results and expectations
- Use Cases: Tool usage quality, parameter selection accuracy
- Purpose: Measures agent progress toward assigned tasks
- Features: Analyzes task completion status and advancement quality
- Use Cases: Agent effectiveness measurement, task completion tracking
- Purpose: Assesses response appropriateness given agent's role and task
- Features: Role-task-response alignment evaluation
- Use Cases: Agent behavior consistency, contextual appropriateness
- Purpose: Evaluates consistency with assigned agent role across actions
- Features: Comprehensive role consistency across tool calls and responses
- Use Cases: Agent persona maintenance, role-based behavior validation
- Purpose: Measures overall goal accomplishment using complete interaction traces
- Features: End-to-end goal evaluation with G-Eval methodology
- Use Cases: Agent effectiveness assessment, outcome-based evaluation
- Purpose: Evaluates logical flow and context maintenance in agent conversations
- Features: Conversational coherence and context tracking analysis
- Use Cases: Agent dialogue quality, conversation flow assessment
- Purpose: Unified interface for all agent evaluation metrics
- Features: Single class providing access to all agent scorers with consistent LLM model
- Methods: Individual scoring methods plus
score_all()
for comprehensive evaluation
All scorers inherit from BaseScorer
providing:
- Statistics Tracking: Automatic score history and statistics
- Batch Processing: Efficient batch scoring capabilities
- Input Validation: Robust input validation and error handling
- Configuration Support: Flexible configuration from dictionaries
- Metadata Reporting: Detailed scoring metadata and information
Comprehensive scoring results include:
- Numerical Score: Primary evaluation score
- Pass/Fail Status: Threshold-based binary result
- Detailed Reasoning: Human-readable evaluation explanation
- Rich Metadata: Additional context and scoring details
# Basic accuracy scoring
scorer = AccuracyScorer(extract_answer=True)
score = scorer.score("The answer is B", "B")
# Advanced conversational evaluation
conv_scorer = ConversationalMetricsScorer(
model=your_llm_model,
include_knowledge_retention=True,
include_relevancy=True,
window_size=10
)
result = await conv_scorer.evaluate(input_text, output_text, context=conv_context)
# RAG system evaluation
ragas = RAGASScorer(
model=your_llm_model,
weights={"faithfulness": 0.4, "answer_relevancy": 0.3, "contextual_precision": 0.3}
)
result = await ragas.evaluate(question, answer, context=retrieved_context)
# Panel-based evaluation
panel = SpecializedPanelScorer.create_diverse_panel(
models=[model1, model2, model3],
evaluation_criteria="overall quality and helpfulness"
)
result = await panel.evaluate(input_text, output_text)
# Agent evaluation
agent_scorers = AgentScorers(model=your_llm_model)
all_scores = agent_scorers.score_all(agent_data)
# Install dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run example evaluation
python examples/basic_evaluation.py
# Build image
docker build -t nova-eval .
# Run evaluation
docker run -v $(pwd)/config:/config -v $(pwd)/results:/results nova-eval --config /config/eval.yaml
# Deploy to Kubernetes
kubectl apply -f kubernetes/
# Check status
kubectl get pods -l app=nova-eval
NovaEval supports configuration through:
- YAML/JSON files: Declarative configuration
- Environment variables: Runtime configuration
- Python code: Programmatic configuration
- CLI arguments: Command-line overrides
export NOVA_EVAL_OUTPUT_DIR="./results"
export NOVA_EVAL_LOG_LEVEL="INFO"
export OPENAI_API_KEY="your-api-key"
export AWS_ACCESS_KEY_ID="your-aws-key"
NovaEval includes optimized GitHub Actions workflows:
- Unit tests run on all PRs and pushes for quick feedback
- Integration tests run on main branch only to minimize API costs
- Cross-platform testing on macOS, Linux, and Windows
NovaEval generates comprehensive evaluation reports:
- Summary Reports: High-level metrics and insights
- Detailed Results: Per-sample predictions and scores
- Visualizations: Charts and graphs for result analysis
- Artifacts: Model outputs, intermediate results, and debug information
- Export Formats: JSON, CSV, HTML, PDF
results/
βββ summary.json # High-level metrics
βββ detailed_results.csv # Per-sample results
βββ artifacts/
β βββ model_outputs/ # Raw model responses
β βββ intermediate/ # Processing artifacts
β βββ debug/ # Debug information
βββ visualizations/
β βββ accuracy_by_category.png
β βββ score_distribution.png
β βββ confusion_matrix.png
βββ report.html # Interactive HTML report
from novaeval.datasets import BaseDataset
class MyCustomDataset(BaseDataset):
def load_data(self):
# Implement data loading logic
return samples
def get_sample(self, index):
# Return individual sample
return sample
from novaeval.scorers import BaseScorer
class MyCustomScorer(BaseScorer):
def score(self, prediction, ground_truth, context=None):
# Implement scoring logic
return score
from novaeval.models import BaseModel
class MyCustomModel(BaseModel):
def generate(self, prompt, **kwargs):
# Implement model inference
return response
We welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework. Please see our Contributing Guide for detailed guidelines.
As mentioned in the We Need Your Help section, we're particularly looking for help with:
- Unit Tests - Expand test coverage beyond the current 23%
- Examples - Real-world evaluation scenarios and use cases
- Guides & Notebooks - Interactive evaluation tutorials
- Documentation - API docs, user guides, and tutorials
- RAG Metrics - Specialized metrics for retrieval-augmented generation
- Agent Evaluation - Frameworks for multi-turn and agent-based evaluations
# Clone repository
git clone https://github.com/Noveum/NovaEval.git
cd NovaEval
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
# Run tests
pytest
# Run with coverage
pytest --cov=src/novaeval --cov-report=html
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Make your changes following our coding standards
- Add tests for your changes
- Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
- Code Quality: Follow PEP 8 and use the provided pre-commit hooks
- Testing: Add unit tests for new features and bug fixes
- Documentation: Update documentation for API changes
- Commit Messages: Use conventional commit format
- Issues: Reference relevant issues in your PR description
Contributors will be:
- Listed in our contributors page
- Mentioned in release notes for significant contributions
- Invited to join our contributor Discord community
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Inspired by evaluation frameworks like DeepEval, Confident AI, and Braintrust
- Built with modern Python best practices and industry standards
- Designed for the AI evaluation community
- Documentation: https://noveum.github.io/NovaEval
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
Made with β€οΈ by the Noveum.ai team