Date run: 24 June 2025


1 | Executive Summary

We benchmarked several large-language-models (LLM), OpenAI (o3), Claude (Opus), Replit (Chat), DGi, Julius and v0 (high), on four practical data tasks:

Task Goal (one-shot prompt)
csv_repair Fix corrupted numeric / categorical fields in iris_dirty.csv and return a clean file.
dedup Drop duplicate customers (case-insensitive) in customers_dup.csv.
session_anl Return {"total_users": …, "total_sessions": …} from sessions.csv.
endpoint_metrics Same JSON, but from a raw-event HTTP endpoint with no schema hints (2025-04-01 → 2025-06-20).

Key findings

Screenshot 2025-09-14 at 8.57.50 AM.png

A bar chart of overall scores is shown below.

Leaderboard.png


2 | Methodology

Datasets – The first three tasks use small synthetic files (≤150 rows). endpoint_metrics queries a live sessions endpoint; ground-truth counts were pre-computed and hidden.