Evaluation.

Twenty large language models, fifty queries, five patient records, seven query categories. The full benchmark behind every number we’ve published.

Models Tested

20

Queries

50

Best Composite

4.70

Best Faithfulness

0.96

Vector vs BM25

+50%

01Methodology

How we benchmarked every model.

MINO’s evaluation framework employs a three-phase pipeline. First, retrieval — vector similarity search via Supabase pgvector with configurable top-K and threshold parameters. Second, generation — the retrieved context is formatted into a structured prompt and passed to the LLM. Third, judging — an independent judge model scores each response against the reference answer. The judge model is always different from the generator to prevent self-evaluation bias.

The dataset consists of 50 samples — 8 real patient queries and 42 synthetic queries generated from 4 clinically diverse patient profiles covering Type 2 Diabetes with cardiovascular risk, Hypertension with CKD Stage 3b, Hypothyroidism with iron deficiency anemia, and a healthy baseline with acute bronchitis. The queries span seven categories: Labs (21), Medications (8), Diagnosis (6), Negation (5), Adversarial (4), Temporal (3) and Comparison (3).

MetricDescription
Correctness (1–5)LLM-as-Judge score measuring medical fact accuracy against reference.
Completeness (1–5)LLM-as-Judge score measuring coverage of key points from the reference.
Relevance (1–5)LLM-as-Judge score measuring topic adherence and directness of the answer.
Safety (1–5)LLM-as-Judge score ensuring no harmful advice and clinical caveats.
Faithfulness (0–1)RAGAS metric evaluating whether the response is grounded in retrieved context.
Answer Relevancy (0–1)RAGAS metric measuring response relevance via reverse entailment.
Context Recall (0–1)RAGAS metric assessing retrieval completeness vs reference answer.
Context Precision (0–1)RAGAS metric measuring quality of retrieved chunks for the query.
Latency (ms)End-to-end response time from query submission to generation.
Cost per QueryComputed from input/output token counts and model-specific pricing.
Retrieval Hit RateFraction of queries where retrieved chunks overlap reference answer.
Evaluation metrics used in MINO

02Multi-Model Results

Twenty models, ranked.

Six Gemini, fourteen OpenAI. All evaluated on the same cached retrieval results (top-10, threshold 0.5) to isolate generation quality.

Composite scores across all 20 models
Composite scores across 20 models

Gemini Family

ModelCorrect.Complete.Relev.SafetyCompositeLatencyCost ($)
gemini-3-pro4.454.204.694.984.5812,388 ms0.00417
gemini-3-flash4.344.204.624.944.534,992 ms0.00109
gemini-3.1-pro4.324.064.644.984.5012,897 ms0.00407
gemini-2.5-flash4.343.784.624.984.433,185 ms0.00058
gemini-2.5-pro4.283.824.624.964.4210,849 ms0.00248
gemini-2.5-flash-lite4.303.464.404.924.271,137 ms0.00011
Multi-model evaluation results · Gemini family
Gemini 3 Pro Preview obtained the highest Gemini composite (4.58/5). The best quality–latency trade-off belongs to Gemini 3 Flash Preview (4.53 at 4,992 ms). Most strikingly, Gemini 2.5 Pro did not outperform the Flash configuration (4.42 vs 4.43) — at 4.3× the cost.

OpenAI Family

ModelCorrect.Complete.Relev.SafetyCompositeLatencyCost ($)
gpt-4.1-mini4.644.304.864.984.704,635 ms0.00025
o34.524.284.865.004.678,682 ms0.00476
gpt-5.44.604.324.784.964.674,607 ms0.00527
gpt-4.14.454.334.884.964.663,850 ms0.00301
gpt-5.4-nano4.584.184.805.004.642,229 ms0.00040
gpt-54.444.264.745.004.6113,258 ms0.00546
o4-mini4.464.144.944.884.615,245 ms0.00270
gpt-5-mini4.484.384.544.984.6014,529 ms0.00202
gpt-5.4-mini4.494.104.804.924.581,848 ms0.00117
o3-mini4.284.044.864.924.534,883 ms0.00253
gpt-4o4.403.984.724.944.513,273 ms0.00297
gpt-4o-mini4.343.984.704.944.495,740 ms0.00023
gpt-4.1-nano4.204.124.784.844.492,650 ms0.00006
gpt-5-nano4.344.084.484.984.4718,071 ms0.00089
Multi-model evaluation results · OpenAI family
GPT-4.1 Mini took the top composite (4.70/5) with the highest correctness (4.64), strong relevance (4.86), and near-perfect safety (4.98) — at $0.00025/query, the cheapest non-nano option. o3 and GPT-5.4 tied for second (4.67) but cost 19× and 21× more. GPT-5 and GPT-5 Nano came in slow (13–18 s) with no quality advantage.

03Cost vs Quality

The smallest models won the value bracket.

Across the full sweep, premium models gave only marginal quality gains for an order-of-magnitude cost increase.

Cost vs quality scatter for all 20 models
Cost vs quality analysis
ModelCompositeCost / QueryLatencyvs Cheapest
gpt-4.1-nano4.49$0.0000642,650 ms1.0×
gemini-2.5-flash-lite4.27$0.0001051,137 ms1.6×
gpt-4o-mini4.49$0.0002315,740 ms3.6×
gpt-4.1-mini4.70$0.0002524,635 ms3.9×
gpt-5.4-nano4.64$0.0004012,229 ms6.3×
gemini-2.5-flash4.43$0.0005783,185 ms9.0×
gemini-3-flash4.53$0.0010884,992 ms17.0×
gpt-5.4-mini4.58$0.0011681,848 ms18.3×
gemini-2.5-pro4.42$0.00248210,849 ms38.8×
o34.67$0.0047618,682 ms74.4×
gpt-5.44.67$0.0052714,607 ms82.4×
gpt-54.61$0.00545613,258 ms85.3×
Cost efficiency ranking (sorted by cost)
GPT-4.1 Mini is the top performer at the lowest non-nano cost — beating o3 (4.67) at 19× less cost. From GPT-4.1 Nano to GPT-5.4, you pay 82× more for a +4% quality bump. Gemini 2.5 Pro delivers no advantage over its Flash sibling (4.42 vs 4.43) at 4.3× the price.
MetricGemini (6 models)OpenAI (14 models)
Best Composite4.58 (3-pro)4.70 (gpt-4.1-mini)
Avg Composite4.444.57
Cheapest$0.000105 (flash-lite)$0.000064 (4.1-nano)
Fastest1,137 ms (flash-lite)1,848 ms (5.4-mini)
Best Valueflash-lite (4.27 @ $0.0001)4.1-mini (4.70 @ $0.0003)
Gemini vs OpenAI · provider comparison

04Vector vs BM25

The retrieval layer does most of the work.

To verify vector search was actually pulling its weight, we ran a BM25 keyword baseline as a control. The gap was bigger than expected.

Vector search vs BM25 baseline
Vector search vs BM25 comparison
MetricVectorBM25Δ
Composite4.703.14+50%
Correctness4.642.66+74%
Completeness4.302.16+99%
Relevance4.862.98+63%
Safety4.984.76+5%
Avg Chunks10.01.76× more
Hit Rate48%18%2.7×
BM25 vs vector search · headline metrics
CategoryBM25VectorBM25 Chunks
Adversarial4.504.500.0
Labs3.274.672.3
Temporal3.253.084.3
Diagnosis3.084.540.5
Negation3.005.000.4
Comparison2.673.255.0
Medications2.384.280.4
BM25 vs vector search · per-category breakdown
BM25 retrieved zero chunks on 22 of 50 queries (44%) — it completely fails on synonyms and semantic queries. “What medications is the patient taking?” doesn’t keyword-match against “Medication: Metformin · Dosage: 1000mg · Frequency: twice daily.” Vector search’s advantage isn’t generation — it’s retrieval. When BM25 does retrieve relevant context, it scores comparably.

05Top-K Ablation

Top-15 was the sweet spot.

We swept retrieval depth from K=3 to K=20 to find where adding more context starts to hurt instead of help.

Top-K retrieval ablation
Top-K retrieval ablation study
Top-KCorrect.Complete.Relev.SafetyCompositeHit Rate
top-34.403.464.664.944.3738%
top-54.323.964.804.924.5042%
top-104.444.144.664.944.5448%
top-154.574.394.884.964.7047%
top-204.464.264.724.964.6048%
Top-K retrieval ablation results
Top-15 outperformed both top-10 (4.54) and top-20 (4.60). Completeness scaled with K — from 3.46 at top-3 to 4.39 at top-15, a +27% improvement. But top-20 regressed slightly: too much context introduces noise that dilutes the model’s focus. Hit rate plateaued at top-10 (48%) — beyond that, additional chunks are mostly relevant but don’t add new information coverage.

06Per-Category

Where the pipeline breaks.

The 50 queries span 7 categories of varying difficulty. Some are nearly perfect; some are still rough.

Per-category performance breakdown
Per-category performance breakdown
CategoryNCorrect.Complete.Relev.SafetyComposite
Negation55.005.005.005.005.00
Labs214.604.204.905.004.67
Diagnosis64.303.805.005.004.54
Adversarial44.803.504.805.004.50
Medications84.103.604.405.004.28
Comparison32.701.703.705.003.25
Temporal33.001.702.705.003.08
Per-category performance · Gemini 2.5 Flash
Negation queries scored a perfect 5.00 — the model correctly identifies when data is absent. Labs and Diagnosis (4.67, 4.54) are strong because they’re factual lookups that RAG handles well. Temporal and Comparison (3.08, 3.25) are the weak spots — cross-visit trend analysis requires connecting data across multiple chunks with date ordering, which the current retrieval layer doesn’t surface well. Date-aware chunking is the next obvious improvement.

07RAGAS Faithfulness

A counterintuitive hallucination tradeoff.

We computed RAGAS metrics across 14 models. Faithfulness — the rate at which a model stays grounded in retrieved context — turned out to be inversely correlated with composite score.

RAGAS faithfulness comparison
RAGAS faithfulness comparison
ModelFaithfulnessAnswer Rel.Context Prec.Context Recall
gemini-2.5-flash-lite0.9620.8220.7070.633
gemini-2.5-pro0.9130.7780.9370.615
gemini-2.5-flash0.9110.8190.8750.630
gpt-5.4-mini0.8810.8750.8060.633
gpt-5.4-nano0.8700.8340.8150.633
gpt-5.40.8520.8280.7140.628
gemini-3-pro0.8410.7830.7370.628
o3-mini0.7880.8590.8150.645
gpt-4.1-mini0.7630.8640.7790.658
gpt-5-mini0.7230.6100.6900.620
gpt-4.1-nano0.6870.8070.7470.663
gpt-4o0.6830.8220.7640.643
gpt-4.10.6600.8510.7540.645
o30.6230.8060.7400.658
RAGAS metrics across 14 models
Gemini 2.5 Flash Lite scored the highest faithfulness (0.962) — it stays the most grounded in retrieved context. Notably, Gemini models all score > 0.91 faithfulness while OpenAI models all sit below 0.77. The OpenAI reasoning models (o3, gpt-4.1) win on composite score but lose on faithfulness, suggesting they confabulate more beyond the retrieved context. For medical RAG, faithfulness may matter more than composite — Flash Lite’s 0.962 with 4.27 composite could be the safer production choice over GPT-4.1 Mini’s 4.70 composite with 0.763 faithfulness.

08Sample Size Stability

Why eight queries weren't enough.

Our pilot used 8 real patient queries. Scaling to 50 queries surfaced ranking shifts that small samples completely missed.

8 samples vs 50 samples comparison
Evaluation stability · 8 vs 50 samples
Modeln=8n=50Δ
gpt-4.1-mini4.564.70+0.14
o34.914.67−0.24
gpt-5.44.724.67−0.05
gemini-3-pro4.194.58+0.39
gemini-3-flash4.664.53−0.13
gemini-2.5-flash4.534.43−0.10
gemini-2.5-flash-lite4.164.27+0.11
Composite score comparison · n=8 vs n=50
o3 dropped from first to third (−0.24) — its 4.91 at n=8 was inflated by small-sample variance. GPT-4.1 Mini rose to first (+0.14), proving consistent across diverse query types. Gemini 3 Pro had the biggest jump (+0.39), having been underrepresented in the 8-sample pilot. The score range compressed from 0.75 (n=8) to 0.43 (n=50). Lesson: 50+ diverse samples are the minimum for trustworthy LLM benchmarking — anything less is anecdote.

09Recommendations

What we'd actually ship.

Pick by use case. There's no single best — there's the best at quality, the best at speed, and the best at cost.

Best Overall

GPT-4.1 Mini

$0.000252 / query · 4.6 s latency · 4.70 composite

Highest composite score across all 20 models, at the cheapest non-nano price tier. The default choice for production medical RAG.

Best Budget

GPT-4.1 Nano

$0.000064 / query · 2.7 s latency · 4.49 composite

Cheapest model that still scores well. 83× cheaper than GPT-5 for comparable quality.

Best Speed

Gemini 2.5 Flash Lite

$0.000105 / query · 1.1 s latency · 4.27 composite

Fastest response time of any model tested. Also the highest faithfulness (0.962). Best for real-time UX.

Not Recommended

GPT-5 · GPT-5 Nano · Gemini 2.5 Pro

Slow, expensive, no quality advantage

GPT-5 takes 13–18 s with no measurable quality gain. Gemini 2.5 Pro performs identically to its Flash sibling at 4.3× the cost.