Date run: 24 June 2025
We benchmarked several large-language-models (LLM), OpenAI (o3), Claude (Opus), Replit (Chat), DGi and v0 (high), on four practical data tasks:
Task | Goal (one-shot prompt) |
---|---|
csv_repair | Fix corrupted numeric / categorical fields in iris_dirty.csv and return a clean file. |
dedup | Drop duplicate customers (case-insensitive) in customers_dup.csv. |
session_anl | Return {"total_users": …, "total_sessions": …} from sessions.csv. |
endpoint_metrics | Same JSON, but from a raw-event HTTP endpoint with no schema hints (2025-04-01 → 2025-06-20). |
Key findings
A bar chart of overall scores is shown below.
Datasets – The first three tasks use small synthetic files (≤150 rows). endpoint_metrics queries a live sessions endpoint; ground-truth counts were pre-computed and hidden.
Scoring – CSV tasks: row-level overlap. Analytics tasks: 1 − MAPE on required keys. Composite: mean of the four task scores.
Runs – One-shot prompt per task, default model settings, no retries.
Note: Scores in 0 likely mean the try failed or the output was outside of boundries.