Date run: 24 June 2025
We benchmarked several large-language-models (LLM), OpenAI (o3), Claude (Opus), Replit (Chat), DGi, Julius and v0 (high), on four practical data tasks:
| Task | Goal (one-shot prompt) | 
|---|---|
| csv_repair | Fix corrupted numeric / categorical fields in iris_dirty.csv and return a clean file. | 
| dedup | Drop duplicate customers (case-insensitive) in customers_dup.csv. | 
| session_anl | Return {"total_users": …, "total_sessions": …} from sessions.csv. | 
| endpoint_metrics | Same JSON, but from a raw-event HTTP endpoint with no schema hints (2025-04-01 → 2025-06-20). | 
Key findings

A bar chart of overall scores is shown below.

Datasets – The first three tasks use small synthetic files (≤150 rows). endpoint_metrics queries a live sessions endpoint; ground-truth counts were pre-computed and hidden.