Skip to content

HCQ Compatible Runtime¤

Overview¤

The main aspect of HCQ-compatible runtimes is how they interact with devices. In HCQ, all interactions with devices occur in a hardware-friendly manner using command queues. This approach allows commands to be issued directly to devices, bypassing runtime overhead such as HIP or CUDA. Additionally, by using the HCQ API, these runtimes can benefit from various optimizations and features, including HCQGraph and built-in profiling capabilities.

Command Queues¤

To interact with devices you create a HWQueue. Some methods are required, like timestamp and synchronization methods like signal and wait, while others are dependent on it being a compute or copy queue.

For example, the following Python code enqueues a wait, execute, and signal command on the HCQ-compatible device:

HWQueue().wait(signal_to_wait, value_to_wait) \
         .exec(program, args_state, global_dims, local_dims) \
         .signal(signal_to_fire, value_to_fire) \
         .submit(your_device)

Each runtime should implement the required functions that are defined in the HWQueue classes.

HWQueue ¤

HWQueue()

Bases: Generic[SignalType, DeviceType, ProgramType, ArgsStateType]

A base class for hardware command queues in the HCQ (Hardware Command Queue) API. Both compute and copy queues should have the following commands implemented.

Methods:

  • signal

    Enqueues a signal command which sets the signal to the given value, ensuring all previous operations are completed.

  • wait

    Enqueues a wait command which halts execution until the signal is greater than or equal to a specific value.

  • timestamp

    Enqueues a timestamp command which records the current time in a signal after all previously enqueued commands are completed.

  • update_signal

    Updates a previously queued signal command.

  • update_wait

    Updates a previously queued wait command.

  • bind

    Associates the queue with a specific device for optimized execution.

  • submit

    Submits the command queue to a specific device for execution.

  • memory_barrier

    Enqueues a memory barrier command to ensure memory coherence between agents. Only on compute queues.

  • exec

    Enqueues an execution command for a kernel program. Only on compute queues.

  • update_exec

    Updates a previously queued execution command. Only on compute queues.

  • copy

    Enqueues a copy command to transfer data. Only on copy queues.

  • update_copy

    Updates a previously queued copy command. Only on copy queues.

signal ¤

signal(signal: SignalType, value: int)

Enqueues a signal command which sets the signal to the given value, ensuring all previous operations are completed.

Parameters:

  • signal (SignalType) –

    The signal to set

  • value (int) –

    The value to set the signal to

wait ¤

wait(signal: SignalType, value: int)

Enqueues a wait command which halts execution until the signal is greater than or equal to a specific value.

Parameters:

  • signal (SignalType) –

    The signal to wait on

  • value (int) –

    The value to wait for

timestamp ¤

timestamp(signal: SignalType)

Enqueues a timestamp command which records the current time in a signal after all previously enqueued commands are completed.

Parameters:

  • signal (SignalType) –

    The signal to store the timestamp

update_signal ¤

update_signal(
    cmd_idx: int,
    signal: Optional[SignalType] = None,
    value: Optional[int] = None,
)

Updates a previously queued signal command.

Parameters:

  • cmd_idx (int) –

    Index of the signal command to update

  • signal (Optional[SignalType], default: None ) –

    New signal to set (if None, keeps the original)

  • value (Optional[int], default: None ) –

    New value to set (if None, keeps the original)

update_wait ¤

update_wait(
    cmd_idx: int,
    signal: Optional[SignalType] = None,
    value: Optional[int] = None,
)

Updates a previously queued wait command.

Parameters:

  • cmd_idx (int) –

    Index of the wait command to update

  • signal (Optional[SignalType], default: None ) –

    New signal to wait on (if None, keeps the original)

  • value (Optional[int], default: None ) –

    New value to wait for (if None, keeps the original)

bind ¤

bind(dev: DeviceType)

Associates the queue with a specific device for optimized execution.

This optional method allows backend implementations to tailor the queue for efficient use on the given device. When implemented, it can eliminate the need to copy queues into the device, thereby enhancing performance.

Parameters:

  • dev (DeviceType) –

    The target device for queue optimization.

Note

Implementing this method is optional but recommended for performance gains.

submit ¤

submit(dev: DeviceType)

Submits the command queue to a specific device for execution.

Parameters:

  • dev (DeviceType) –

    The device to submit the queue to

memory_barrier ¤

memory_barrier()

Enqueues a memory barrier command to ensure memory coherence between agents. Only on compute queues.

exec ¤

exec(
    prg: ProgramType,
    args_state: ArgsStateType,
    global_size: Tuple[int, int, int],
    local_size: Tuple[int, int, int],
)

Enqueues an execution command for a kernel program. Only on compute queues.

Parameters:

  • prg (ProgramType) –

    The program to execute

  • args_state (ArgsStateType) –

    The args state to execute program with

  • global_size (Tuple[int, int, int]) –

    The global work size

  • local_size (Tuple[int, int, int]) –

    The local work size

update_exec ¤

update_exec(
    cmd_idx: int,
    global_size: Optional[Tuple[int, int, int]] = None,
    local_size: Optional[Tuple[int, int, int]] = None,
)

Updates a previously queued execution command. Only on compute queues.

Parameters:

  • cmd_idx (int) –

    Index of the execution command to update

  • global_size (Optional[Tuple[int, int, int]], default: None ) –

    New global work size (if None, keeps the original)

  • local_size (Optional[Tuple[int, int, int]], default: None ) –

    New local work size (if None, keeps the original)

copy ¤

copy(dest: int, src: int, copy_size: int)

Enqueues a copy command to transfer data. Only on copy queues.

Parameters:

  • dest (int) –

    The destination of the copy

  • src (int) –

    The source of the copy

  • copy_size (int) –

    The size of data to copy

update_copy ¤

update_copy(
    cmd_idx: int,
    dest: Optional[int] = None,
    src: Optional[int] = None,
)

Updates a previously queued copy command. Only on copy queues.

Parameters:

  • cmd_idx (int) –

    Index of the copy command to update

  • dest (Optional[int], default: None ) –

    New destination of the copy (if None, keeps the original)

  • src (Optional[int], default: None ) –

    New source of the copy (if None, keeps the original)

Implementing custom commands¤

To implement custom commands in the queue, use the @hcq_command decorator for your command implementations.

hcq_command ¤

hcq_command(
    func: Callable[Concatenate[QueueType, P], None]
) -> Callable[Concatenate[QueueType, P], QueueType]

Decorator for HWCommandQueue commands. Enables command indexing and stores metadata for command updates.

For example
  @hcq_command
  def command_method(self, ...): ...

HCQ Compatible Device¤

The HCQCompiled class defines the API for HCQ-compatible devices. This class serves as an abstract base class that device-specific implementations should inherit from and implement.

HCQCompiled ¤

HCQCompiled(
    device: str,
    allocator: HCQAllocator,
    renderer: Renderer,
    compiler: Compiler,
    runtime,
    signal_t: Type[SignalType],
    comp_queue_t: Type[HWQueue],
    copy_queue_t: Optional[Type[HWQueue]],
)

Bases: Compiled, Generic[SignalType]

A base class for devices compatible with the HCQ (Hardware Command Queue) API.

Signals¤

Signals are device-dependent structures used for synchronization and timing in HCQ-compatible devices. They should be designed to record both a value and a timestamp within the same signal. HCQ-compatible backend implementations should use HCQSignal as a base class.

HCQSignal ¤

HCQSignal(value: int = 0, is_timeline: bool = False)

Methods:

  • wait

    Waits the signal is greater than or equal to a specific value.

Attributes:

value property writable ¤

value: int

timestamp property ¤

timestamp: Decimal

Get the timestamp field of the signal.

This property provides read-only access to the signal's timestamp.

Returns:

  • Decimal

    The timestamp in microseconds.

wait ¤

wait(
    value: int,
    timeout: int = getenv("HCQDEV_WAIT_TIMEOUT_MS", 30000),
)

Waits the signal is greater than or equal to a specific value.

Parameters:

  • value (int) –

    The value to wait for.

  • timeout (int, default: getenv('HCQDEV_WAIT_TIMEOUT_MS', 30000) ) –

    Maximum time to wait in milliseconds. Defaults to 10s.

The following Python code demonstrates the usage of signals:

signal = your_device.signal_t()

HWQueue().timestamp(signal) \
         .signal(signal, value_to_fire) \
         .submit(your_device)

signal.wait(value_to_fire)
signaled_value = signal.value # should be the same as `value_to_fire`
timestamp = signal.timestamp
Synchronization signals¤

Each HCQ-compatible device must allocate two signals for global synchronization purposes. These signals are passed to the HCQCompiled base class during initialization: an active timeline signal self.timeline_signal and a shadow timeline signal self._shadow_timeline_signal which helps to handle signal value overflow issues. You can find more about synchronization in the synchronization section

HCQ Compatible Allocator¤

The HCQAllocator base class simplifies allocator logic by leveraging command queues abstractions. This class efficiently handles copy and transfer operations, leaving only the alloc and free functions to be implemented by individual backends.

HCQAllocator ¤

HCQAllocator(
    dev: DeviceType,
    batch_size: int = 2 << 20,
    batch_cnt: int = 32,
)

Bases: LRUAllocator, Generic[DeviceType]

A base allocator class compatible with the HCQ (Hardware Command Queue) API.

This class implements basic copy operations following the HCQ API, utilizing both types of HWQueue.

Methods:

_alloc ¤

_alloc(size: int, options: BufferSpec) -> HCQBuffer

HCQ Allocator Result Protocol¤

Backends must adhere to the HCQBuffer protocol when returning allocation results.

HCQBuffer ¤

Bases: Protocol

Attributes:

size instance-attribute ¤

size: int

va_addr instance-attribute ¤

va_addr: int

HCQ Compatible Program¤

HCQProgram is a base class for defining programs compatible with HCQ-enabled devices. It provides a flexible framework for handling different argument layouts (see HCQArgsState).

HCQProgram ¤

HCQProgram(
    args_state_t: Type[HCQArgsState],
    dev: DeviceType,
    name: str,
    kernargs_alloc_size: int,
)

Bases: Generic[DeviceType]

Methods:

  • __call__

    Enqueues the program for execution with the given arguments and dimensions.

  • fill_kernargs

    Fills arguments for the kernel, optionally allocating space from the device if kernargs_ptr is not provided.

__call__ ¤

__call__(
    *bufs: HCQBuffer,
    global_size: Tuple[int, int, int] = (1, 1, 1),
    local_size: Tuple[int, int, int] = (1, 1, 1),
    vals: Tuple[int, ...] = (),
    wait: bool = False
) -> Optional[float]

Enqueues the program for execution with the given arguments and dimensions.

Parameters:

  • bufs (HCQBuffer, default: () ) –

    Buffer arguments to execute the kernel with.

  • global_size (Tuple[int, int, int], default: (1, 1, 1) ) –

    Specifies the global work size for kernel execution (equivalent to CUDA's grid size).

  • local_size (Tuple[int, int, int], default: (1, 1, 1) ) –

    Specifies the local work size for kernel execution (equivalent to CUDA's block size).

  • vals (Tuple[int, ...], default: () ) –

    Value arguments to execute the kernel with.

  • wait (bool, default: False ) –

    If True, waits for the kernel to complete execution.

Returns:

  • Optional[float]

    Execution time of the kernel if 'wait' is True, otherwise None.

fill_kernargs ¤

fill_kernargs(
    bufs: Tuple[HCQBuffer, ...],
    vals: Tuple[int, ...] = (),
    kernargs_ptr: Optional[int] = None,
) -> HCQArgsState

Fills arguments for the kernel, optionally allocating space from the device if kernargs_ptr is not provided. Args: bufs: Buffers to be written to kernel arguments. vals: Values to be written to kernel arguments. kernargs_ptr: Optional pointer to pre-allocated kernel arguments memory. Returns: Arguments state with the given buffers and values set for the program.

Arguments State¤

HCQArgsState is a base class for managing the argument state for HCQ programs. Backend implementations should create a subclass of HCQArgsState to manage arguments for the given program.

HCQArgsState ¤

HCQArgsState(
    ptr: int,
    prg: ProgramType,
    bufs: Tuple[HCQBuffer, ...],
    vals: Tuple[int, ...] = (),
)

Bases: Generic[ProgramType]

Methods:

update_buffer ¤

update_buffer(index: int, buf: HCQBuffer)

update_var ¤

update_var(index: int, val: int)

Lifetime: The HCQArgsState is passed to HWQueue.exec and is guaranteed not to be freed until HWQueue.submit for the same queue is called.

Synchronization¤

HCQ-compatible devices use a global timeline signal for synchronizing all operations. This mechanism ensures proper ordering and completion of tasks across the device. By convention, self.timeline_value points to the next value to signal. So, to wait for all previous operations on the device to complete, wait for self.timeline_value - 1 value. The following Python code demonstrates the typical usage of signals to synchronize execution to other operations on the device:

HWQueue().wait(your_device.timeline_signal, your_device.timeline_value - 1) \
         .exec(...)
         .signal(your_device.timeline_signal, your_device.timeline_value) \
         .submit(your_device)
your_device.timeline_value += 1

# Optionally wait for execution
your_device.timeline_signal.wait(your_device.timeline_value - 1)

HCQGraph¤

HCQGraph is a core feature that implements GraphRunner for HCQ-compatible devices. HCQGraph builds static HWQueue for all operations per device. To optimize enqueue time, only the necessary parts of the queues are updated for each run using the update APIs of the queues, avoiding a complete rebuild. Optionally, queues can implement a bind API, which allows further optimization by eliminating the need to copy the queues into the device ring.