Creating a Python Leaderboard

This section describes how to create Python-based leaderboards, which expect Python submissions (they can still inline compile CUDA code though). To create leaderboards on a Discord server, the Discord bot expects you to have a Leaderboard Admin or Leaderboard Creator role. These can be assigned by admins / owners of the server. Nevertheless, this section is also useful for participants to understand how their submissions are evaluated.

Like we've mentioned before, each leaderboard specifies a number of GPUs to evaluate on based on the creator's choosing. You can think of each (task, GPU) pair as having essentially its own independent leaderboard, as for example, a softmax kernel on an NVIDIA T4 may perform very differently on an NVIDIA H100. We give leaderboard creators the option to select which GPUs they care about for their leaderboards -- for example, they may only care about NVIDIA A100 and NVIDIA H100 performance for their leaderboard.

To create a leaderboard you can run:

/leaderboard create {leaderboard_name: str} {deadline: str} {task_zip: .zipped folder}

After running this, similar to leaderboard submissions, a UI window will pop up asking which GPUs the leaderboard creator wants to enable submissions on. In the remaining section, we detail how the unzipped task_zip folder should be structured. Examples of these folders can be found here.

The `task.yml` specification.

When a user submits a reference kernel, it is launched inside of a leaderboard-specific evaluation harness, and we provide several copy-able examples of a leaderboard folder in our GitHub. The relevant files are defined in a task.yml -- for example, in the identity-py leaderboard, the YAML looks as follows:

task.yml
# What files are involved in leaderboard evaluation
files:
  - {"name": "submission.py", "source": "@SUBMISSION@"}
  - {"name": "task.py", "source": "task.py"}
  - {"name": "utils.py", "source": "utils.py"}
  - {"name": "reference.py", "source": "reference.py"}
  - {"name": "eval.py", "source": "eval.py"}

# Leaderboard language
lang: "py"


# Description of leaderboard task
description:
  Identity kernel in Python.

# Compilation flag for what to target as main
config:
  main: "eval.py"

# An example to provide to participants for writing a leaderboard submission
templates:
  Python: "template.py"

tests:
  - {"size": 128, "seed": 5236}
  - {"size": 129, "seed": 1001}
  - {"size": 256, "seed": 5531}

benchmarks:
  - {"size": 1024, "seed": 54352}
  - {"size": 4096, "seed": 6256}
  - {"size": 16384, "seed": 6252}
  - {"size": 65536, "seed": 125432}

This config file controls all relevant details about how participant will interact with the leaderboard. We will discuss each parameter in detail. Some of the more simple keys are:

lang controls the language of the leaderboard (py or cu)
config.main controls what file is treated as main. Usually should not be edited, and should be eval.py.
templates is an optional way to provide users with an example template for a kernel submission.

Required files in the leaderboard `.zip`

Other than task.yml, the files key controls the list of files that the evaluation harness expects. The leaderboard creator has to include all of these files, but we provide examples to make it a lot easier. The name key is how this file is imported locally, and the source key is the name of the actual file in the folder.

submission.py: This is a special key-value pair (denoted by @SUBMISSION@ value) that denotes the user submitted kernel (it should not exist in the .zip).
task.py ⭐: Specifies constants and the input / output type (e.g. arguments) that the leaderboard kernel should expect.
utils.py: Some extra utils that can be used for leaderboard logic.
reference.py ⭐: Leaderboard-specific logic for generating input data, the reference kernel, and correctness logic to compare user and reference kernel outputs.
eval.py: Run user and reference kernels to check correctness, then measure runtime of user kernel if it passes correctness checks. Usually does not need to be edited.

In short, most leaderboard creators will only have to edit task.py and reference.py, but we will go over how to edit these more in detail.

A simple `task.py` and `reference.py` example

To keep this simple, a leaderboard creator really only needs to specify:

The input / output types of the desired leaderboard kernel.
A generator that generates input data with specific properties.
An actual example reference kernel that serves as ground truth.
A comparison function to check for correctness of a user submitted kernel against the reference. We allow leaderboard creators full flexibility to specify things like margin of error.

We recommend following our examples for simplicity, but our task definition allows leaderboard creators to fully modify their evaluation harness. In the remaining sections, we will go over how to use our pre-defined examples. In all of our examples, the task.py file handles (1) and part of (2), while the reference.py file handles (2,3,4). Below, we provide the task.py for the identity-py leaderboard.

task.py
from typing import TypedDict
import torch


# Define input / output types for kernel
input_t = torch.Tensor
output_t = input_t

# Define modifiable arguments for input data generation
class TestSpec(TypedDict):
    size: int
    seed: int

The example above specifies aliases for the input (input_t) and output (output_t) types of the kernel task. It also specifies a struct called TestSpec, which specifies what arguments are passed into the input data generator at runtime. We distinguish between tests cases and benchmarks cases, the former being the actual leaderboard cases and the latter being for users to debug their code. Using this TestSpec specification, we provide test cases to the task.yml and fill in the arguments, as shown below:

task.yml
...
tests:
  - {"size": 128, "seed": 5236}
  - {"size": 129, "seed": 1001}
  - {"size": 256, "seed": 5531}

benchmarks:
  - {"size": 1024, "seed": 54352}
  - {"size": 4096, "seed": 6256}
  - {"size": 16384, "seed": 6252}
  - {"size": 65536, "seed": 125432}

Finally, we fill in details for the input data generator, reference kernel, and correctness checker for identity-py below:

import torch
from task import input_t, output_t
from utils import verbose_allclose


# Input data generator. Arguments must match TestSpec in task.py
def generate_input(size: int, seed: int) -> input_t:
    gen = torch.Generator(device='cuda')
    gen.manual_seed(seed)
    data = torch.empty(size, device='cuda', dtype=torch.float16)
    data.uniform_(0, 1, generator=gen)
    return data


# Referece kernel. Must take `input_t` and produce `output_t`
def ref_kernel(data: input_t) -> output_t:
    return data


# Returns any errors (empty if none)
def check_implementation(data, output) -> str:
    expected = ref_kernel(data)
    reasons = verbose_allclose(output, expected)
    if len(reasons) > 0:
        return "Mismatch found! custom implementation doesn't match reference.: " + reasons[0]
    return ''

As mentioned earlier, based on task.yml and task.py, each test case will pass a specified set of arguments to generate_input(...) to produce the input data for that task case. We recommend specifying a seed argument to properly randomizing inputs in a reproducible manner. Furthermore, check_implementation returns a string to give leaderboard creators the flexibility to provide error messages to participants to help debug.

Remark. Leaderboard creators have the flexibility to edit the logic in eval.py, which uses all of these functions to evaluate and measure the user specified kernels. The examples above assume the use of our eval.py implementation, but this can be modified if desired.

Deleting a Leaderboard

If you have sufficient permissions on the server, you can also delete leaderboards with:

/leaderboard delete {leaderboard_name: str}

This command will display a UI window with a list of available leaderboards. Select the leaderboard you want to delete from the list. Once confirmed, the leaderboard and all associated submissions will be permanently removed. Please use this command with caution, as it will also delete the leaderboard history as well.

Existing Leaderboard Examples

We try to provide examples of leaderboards that can be quickly copied and modified for other references here. Most leaderboards should be able to just modify these files.

The task.yml specification.​

Required files in the leaderboard .zip​

A simple task.py and reference.py example​

Deleting a Leaderboard​

Existing Leaderboard Examples​

The `task.yml` specification.

Required files in the leaderboard `.zip`

A simple `task.py` and `reference.py` example

Deleting a Leaderboard

Existing Leaderboard Examples