Date run: 24 June 2025


1 | Executive Summary

We benchmarked several large-language-models (LLM), OpenAI (o3), Claude (Opus), Replit (Chat), DGi and v0 (high), on four practical data tasks:

Task Goal (one-shot prompt)
csv_repair Fix corrupted numeric / categorical fields in iris_dirty.csv and return a clean file.
dedup Drop duplicate customers (case-insensitive) in customers_dup.csv.
session_anl Return {"total_users": …, "total_sessions": …} from sessions.csv.
endpoint_metrics Same JSON, but from a raw-event HTTP endpoint with no schema hints (2025-04-01 → 2025-06-20).

Key findings

A bar chart of overall scores is shown below.


2 | Methodology

Datasets – The first three tasks use small synthetic files (≤150 rows). endpoint_metrics queries a live sessions endpoint; ground-truth counts were pre-computed and hidden.

Scoring – CSV tasks: row-level overlap. Analytics tasks: 1 − MAPE on required keys. Composite: mean of the four task scores.

Runs – One-shot prompt per task, default model settings, no retries.

Note: Scores in 0 likely mean the try failed or the output was outside of boundries.