Benchmark Evaluation

Date run: 24 June 2025

1 | Executive Summary

We benchmarked several large-language-models (LLM), OpenAI (o3), Claude (Opus), Replit (Chat), DGi, Julius and v0 (high), on four practical data tasks:

Task	Goal (one-shot prompt)
csv_repair	Fix corrupted numeric / categorical fields in iris_dirty.csv and return a clean file.
dedup	Drop duplicate customers (case-insensitive) in customers_dup.csv.
session_anl	Return {"total_users": …, "total_sessions": …} from sessions.csv.
endpoint_metrics	Same JSON, but from a raw-event HTTP endpoint with no schema hints (2025-04-01 → 2025-06-20).

Key findings

DGi wins with an overall score 0.9037, about 33 % higher than the best generic model.
Almost all platforms or LLMs failed given the unstructured and harder sessions json. DGi asks a clarifying question, then delivers the most accurate result.

Screenshot 2025-09-14 at 8.57.50 AM.png

Replit (0.6820) narrowly beats Grok (0.6661) for third.
v0 remains a baseline (0.0336).
Julius coming in as a valid player in the AI data analysis arena.

A bar chart of overall scores is shown below.

2 | Methodology

Datasets – The first three tasks use small synthetic files (≤150 rows). endpoint_metrics queries a live sessions endpoint; ground-truth counts were pre-computed and hidden.