Announcing ComputeEval, an Open-Source Framework for Evaluating LLMs on CUDA

Large language models (LLMs) are revolutionizing how developers code and how they learn to code. For seasoned or junior developers alike, today’s state-of-the-art models can generate Python scripts, React-based websites, and more. In the future, powerful AI models will assist developers in writing high-performance GPU code. This raises an important question: How can it be determined whether an LLM can handle the intricacies of CUDA programming?

ComputeEval is an open-source framework and dataset designed to evaluate LLMs on CUDA code generation. The dataset is designed to evaluate the ability of an LLM to generate correct CUDA code in different areas of parallel programming such as memory management and thread synchronization. The framework is designed to simplify the evaluation of the generated code.

This post will present a look at how ComputeEval works as an evaluation framework, results of our evaluation on state-of-the-art models and what it means for the future of AI-assisted GPU development.

A new benchmark for high-performance GPU code generation

ComputeEval aims to provide a trusted, community-driven benchmark specifically for CUDA and high-performance GPU code. It is inspired by the benchmarks in other languages such as HumanEval. When it comes to CUDA, precision, parallelism, and performance are critical.

ComputeEval consists of the following:

Handcrafted real-world CUDA problems: Our team has curated a set of challenges that cover everything from kernel launches and thread management to memory layouts and shared memory utilization. Our initial release features 128 CUDA problems, serving as the foundation for evaluating how well LLMs tackle GPU programming challenges.
Functional correctness tests: Code is provided to run functional correctness tests in a sandboxed environment. This means you can safely execute generated code and verify that it works as intended.

To see the code, visit the nvidia/compute-eval GitHub repo. Find the dataset on Hugging Face.

Model performance

Our team evaluated several leading LLMs on ComputeEval to establish baseline performance metrics and understand the current state of AI-assisted CUDA programming (Table 1).

Model	pass@1	pass@3
OpenAI o3-mini	0.61	0.74
Anthropic Claude Sonnet 3.7	0.54	0.60
Llama 3.1 405b	0.4	0.55
Google Gemini 2.0 Flash Thinking	0.37	0.52

Table 1. ComputeEval 2025.1 results in state-of-the-art models. OpenAI o3-mini demonstrates the strongest performance in CUDA code generation followed by Anthropic’s Claude Sonnet 3.7 (no-thinking mode)

These results highlight that while LLMs are able to generate valid CUDA code in some basic cases, even the best models still fail to generate correct CUDA code for complex problems and in some cases don’t follow basic instructions that are able to do in other languages, indicating room for improvement in this complex domain.

Get started

ComputeEval isn’t just about measuring how well current models perform, it’s about setting a standard that drives continuous improvement in AI-assisted CUDA programming. Our team wants to push the limits of what LLMs can do in high-performance computing. As an open-source platform, ComputeEval is a resource that the community can trust and build on. By presenting challenges spanning expert topics across CUDA-X libraries and GPU architectures, ComputeEval also drives modernization by leveraging best practices by default.

In this first release, you’ll find 128 carefully designed CUDA challenges. But we’re not stopping there. We are already working on collecting more problems with our internal teams and partners. We will be open sourcing these problems too. Future updates will include refined tests and more detailed metrics that capture not only correctness but also performance measurement.

Seasoned HPC professionals, students, and hobbyists are invited to contribute by running the benchmark on additional models, submitting new CUDA and CUDA library problems through pull requests, and providing general feedback in GitHub Issues. Your feedback and contributions will help shape the future of this benchmark and make accelerated computing better for everyone. To see the code, visit the nvidia/compute-eval GitHub repo. Find the dataset on Hugging Face.