← Back to Projects
AI Systems2026Model EvaluationResponsible AIA/B TestingBias Detection

LLM Evaluation & Responsible AI Suite

Systematic benchmarking framework comparing Llama 3.1 vs Mistral across semantic quality, demographic bias, prompt effectiveness, and latency — with automated PM recommendations.

Llama 3.1 semantic quality
0.713 / 1.0
Mistral semantic quality
0.662 / 1.0
Latency — Llama 3.1
97.9s avg (higher quality)
Latency — Mistral
72.5s avg (26% faster)
LLM Evaluation Suite — semantic quality comparison and bias analysis dashboard

Problem

Teams deploying LLMs rarely have a systematic framework to answer three critical questions before going to production: which model produces better outputs for our use case, are outputs demographically biased, and which prompt variant actually performs better? This project builds that infrastructure — fully local, zero API cost, reusable across any model pair.

Approach

  • Built a multi-model router sending identical prompts to Llama 3.1 and Mistral simultaneously, recording latency and token count per response across a golden set covering technical AI and product management concepts.
  • Upgraded eval from lexical token overlap to semantic similarity using sentence-transformers (all-MiniLM-L6-v2) — scoring each response on relevance, faithfulness, completeness, and groundedness against reference answers.
  • Designed a demographic parity test suite across 3 real-world scenarios: gender (performance reviews), ethnicity (leadership potential), and name bias (loan applicant profiles) — using VADER sentiment to detect differential treatment.
  • Built an A/B prompt testing framework running 2 iterations per variant with independent t-tests for statistical significance — tested direct vs structured prompting across summarization and explanation tasks.
  • Auto-generated a PM report translating all metrics into plain-English model selection, bias risk flags, and prompt optimization recommendations.

Impact

  • Llama 3.1 recommended for quality-sensitive workloads (0.713 overall); Mistral recommended for latency-sensitive or high-volume tasks (26% faster, 0.662 quality) — data-driven tradeoff recommendation.
  • Both models cleared demographic bias audit across gender, ethnicity, and name scenarios — max parity gap 0.023, well below the 0.20 medium-risk threshold.
  • Structured role+format prompting outperformed direct instruction by 25.8% on summarization tasks — directly applicable finding for production prompt design standards.
  • Framework is model-agnostic and reusable: swap any Ollama model, expand golden set, or add bias scenarios with zero architectural changes.

Deployment & Monitoring

  • Runs fully locally via Ollama — zero API cost, production architecture portable to GCP Cloud Run or AWS Lambda.
  • Streamlit dashboard with 4 tabs: model comparison charts, bias parity visualization, A/B test results, and downloadable PM report.
  • Single-command reproducible pipeline: python run.py --quick triggers full evaluation in ~30 minutes.
  • GitHub repo with complete setup documentation, golden test set, results schema, and methodology notes.

Risks & Tradeoffs

  • Semantic eval scores reflect cosine similarity against reference answers — strong answers that paraphrase differently may score slightly lower than expected; documented as a known limitation.
  • A/B statistical significance requires more runs to reach p<0.05 at 2 iterations — directional findings are consistent across runs; noted as a methodology improvement for v2.
  • Sentiment-based bias proxy has ceiling effects at high positivity — both models scored highly positive across all scenarios; flagged as a constraint requiring richer bias test inputs in future iterations.

Tools

Python 3.12Ollama (Llama 3.1, Mistral)sentence-transformers (all-MiniLM-L6-v2)VADER Sentimentscipy (t-test)StreamlitFastAPI

Deep Dive

Bank Telemarketing Propensity SystemPortfolio RAG: Grounded Recruiter Q&A (Citations + Evals)