AI Systems • 2026Model EvaluationResponsible AIA/B TestingBias Detection

LLM Evaluation & Responsible AI Suite

Systematic benchmarking framework comparing Llama 3.1 vs Mistral across semantic quality, demographic bias, prompt effectiveness, and latency — with automated PM recommendations.

Llama 3.1 semantic quality

0.713 / 1.0

Mistral semantic quality

0.662 / 1.0

Latency — Llama 3.1

97.9s avg (higher quality)

Latency — Mistral

72.5s avg (26% faster)

LLM Evaluation Suite — semantic quality comparison and bias analysis dashboard

Problem

Teams deploying LLMs rarely have a systematic framework to answer three critical questions before going to production: which model produces better outputs for our use case, are outputs demographically biased, and which prompt variant actually performs better? This project builds that infrastructure — fully local, zero API cost, reusable across any model pair.

Approach

Built a multi-model router sending identical prompts to Llama 3.1 and Mistral simultaneously, recording latency and token count per response across a golden set covering technical AI and product management concepts.
Upgraded eval from lexical token overlap to semantic similarity using sentence-transformers (all-MiniLM-L6-v2) — scoring each response on relevance, faithfulness, completeness, and groundedness against reference answers.
Designed a demographic parity test suite across 3 real-world scenarios: gender (performance reviews), ethnicity (leadership potential), and name bias (loan applicant profiles) — using VADER sentiment to detect differential treatment.
Built an A/B prompt testing framework running 2 iterations per variant with independent t-tests for statistical significance — tested direct vs structured prompting across summarization and explanation tasks.
Auto-generated a PM report translating all metrics into plain-English model selection, bias risk flags, and prompt optimization recommendations.

Impact

Llama 3.1 recommended for quality-sensitive workloads (0.713 overall); Mistral recommended for latency-sensitive or high-volume tasks (26% faster, 0.662 quality) — data-driven tradeoff recommendation.
Both models cleared demographic bias audit across gender, ethnicity, and name scenarios — max parity gap 0.023, well below the 0.20 medium-risk threshold.
Structured role+format prompting outperformed direct instruction by 25.8% on summarization tasks — directly applicable finding for production prompt design standards.
Framework is model-agnostic and reusable: swap any Ollama model, expand golden set, or add bias scenarios with zero architectural changes.

Deployment & Monitoring

Runs fully locally via Ollama — zero API cost, production architecture portable to GCP Cloud Run or AWS Lambda.
Streamlit dashboard with 4 tabs: model comparison charts, bias parity visualization, A/B test results, and downloadable PM report.
Single-command reproducible pipeline: python run.py --quick triggers full evaluation in ~30 minutes.
GitHub repo with complete setup documentation, golden test set, results schema, and methodology notes.

Risks & Tradeoffs

Semantic eval scores reflect cosine similarity against reference answers — strong answers that paraphrase differently may score slightly lower than expected; documented as a known limitation.
A/B statistical significance requires more runs to reach p<0.05 at 2 iterations — directional findings are consistent across runs; noted as a methodology improvement for v2.
Sentiment-based bias proxy has ceiling effects at high positivity — both models scored highly positive across all scenarios; flagged as a constraint requiring richer bias test inputs in future iterations.

Tools

Python 3.12Ollama (Llama 3.1, Mistral)sentence-transformers (all-MiniLM-L6-v2)VADER Sentimentscipy (t-test)StreamlitFastAPI

Deep Dive

GitHub Repo

← Bank Telemarketing Propensity System Portfolio RAG: Grounded Recruiter Q&A (Citations + Evals) →