Skip to content

Runtime Overview¤

Overview¤

A typical runtime consists of the following parts:

Compiled¤

The Compiled class is responsible for initializing and managing a device.

Compiled ¤

Compiled(
    device: str,
    allocator: Allocator,
    renderer: Optional[Renderer],
    compiler: Optional[Compiler],
    runtime,
    graph=None,
)

Methods:

  • synchronize

    Synchronize all pending operations on the device.

synchronize ¤

synchronize()

Synchronize all pending operations on the device.

This method ensures that all previously queued operations on the device have been completed before proceeding.

Allocator¤

The Allocator class is responsible for managing memory on the device. There is also a version called the LRUAllocator, which caches allocated buffers to optimize performance.

Allocator ¤

Methods:

_alloc ¤

_alloc(size: int, options: BufferSpec)

_copyin ¤

_copyin(dest, src: memoryview)

_copyout ¤

_copyout(dest: memoryview, src)

_free ¤

_free(opaque, options: BufferSpec)

alloc ¤

alloc(size: int, options: Optional[BufferSpec] = None)

free ¤

free(
    opaque, size: int, options: Optional[BufferSpec] = None
)

LRUAllocator ¤

LRUAllocator()

Bases: Allocator

The LRU Allocator is responsible for caching buffers. It ensures that buffers are not freed until it is absolutely necessary, optimizing performance.

Methods:

Attributes:

cache instance-attribute ¤

cache: dict[tuple[int, Optional[BufferSpec]], Any] = (
    defaultdict(list)
)

alloc ¤

alloc(size: int, options: Optional[BufferSpec] = None)

free ¤

free(
    opaque: Any,
    size: int,
    options: Optional[BufferSpec] = None,
)

free_cache ¤

free_cache()

Program¤

The Program class is created for each loaded program. It is responsible for executing the program on the device. As an example, here is a CPUProgram implementation which loads program and runs it.

CPUProgram ¤

CPUProgram(name: str, lib: bytes)

Methods:

Attributes:

Source code in tinygrad/device.py
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
def __init__(self, name:str, lib:bytes):
  if sys.platform == "win32":
    PAGE_EXECUTE_READWRITE = 0x40
    MEM_COMMIT =  0x1000
    MEM_RESERVE = 0x2000
    ctypes.windll.kernel32.VirtualAlloc.restype = ctypes.c_void_p
    self.mem = ctypes.windll.kernel32.VirtualAlloc(ctypes.c_void_p(0), ctypes.c_size_t(len(lib)), MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE)
    ctypes.memmove(self.mem, lib, len(lib))
    ctypes.windll.kernel32.GetCurrentProcess.restype = ctypes.c_void_p
    proc = ctypes.windll.kernel32.GetCurrentProcess()
    ctypes.windll.kernel32.FlushInstructionCache(ctypes.c_void_p(proc), ctypes.c_void_p(self.mem), ctypes.c_size_t(len(lib)))
    self.fxn = ctypes.CFUNCTYPE(None)(self.mem)
  else:
    from mmap import mmap, PROT_READ, PROT_WRITE, PROT_EXEC, MAP_ANON, MAP_PRIVATE
    # On apple silicon with SPRR enabled (it always is in macos) RWX pages are unrepresentable: https://blog.svenpeter.dev/posts/m1_sprr_gxf/
    # MAP_JIT allows us to easily flip pages from RW- to R-X and vice versa. It is a noop on intel cpus. (man pthread_jit_write_protect_np)
    self.mem = mmap(-1, len(lib), MAP_ANON | MAP_PRIVATE | (MAP_JIT if OSX else 0), PROT_READ | PROT_WRITE | PROT_EXEC)

    if OSX: CPUProgram.rt_lib.pthread_jit_write_protect_np(False)
    self.mem.write(lib)
    if OSX: CPUProgram.rt_lib.pthread_jit_write_protect_np(True)

    # __clear_cache isn't a normal libc function, but a compiler support routine found in libgcc_s for gcc and compiler-rt for clang.
    # libgcc_s comes as shared library but compiler-rt is only a bunch of static library archives which we can't directly load, but fortunately
    # it somehow found its way into libSystem on macos (likely because it used __builtin_clear_cache) and libgcc_s is ~always present on linux
    # Using ["name"] instead of .name because otherwise name is getting mangled: https://docs.python.org/3.12/reference/expressions.html#index-5
    CPUProgram.rt_lib["__clear_cache"](ctypes.c_void_p(mv_address(self.mem)), ctypes.c_void_p(mv_address(self.mem) + len(lib)))

    self.fxn = ctypes.CFUNCTYPE(None)(mv_address(self.mem))

atomic_lib class-attribute instance-attribute ¤

atomic_lib = (
    CDLL(find_library("atomic"))
    if platform == "linux"
    else None
)

fxn instance-attribute ¤

fxn = CFUNCTYPE(None)(mem)

mem instance-attribute ¤

mem = VirtualAlloc(
    c_void_p(0),
    c_size_t(len(lib)),
    MEM_COMMIT | MEM_RESERVE,
    PAGE_EXECUTE_READWRITE,
)

rt_lib class-attribute instance-attribute ¤

rt_lib = CDLL(
    find_library("System" if OSX else "kernel32")
    if OSX or platform == "win32"
    else "libgcc_s.so.1"
)

__call__ ¤

__call__(*bufs, vals=(), wait=False)
Source code in tinygrad/device.py
264
265
266
267
268
269
270
271
272
def __call__(self, *bufs, vals=(), wait=False):
  args = list(bufs) + list(vals)
  # NOTE: replace this by --target={host's triple}-elf in clang args once we only support macos sequoia and later.
  # Apple relaxes abi requirement for stack arguments to always be at least 8 byte aligned on arm64
  # https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms
  # This hack is required because clang/llvm bug doesn't allow us to just use {host's triple}+'-elf' (relocation failures)
  # The bug was fixed in https://github.com/llvm/llvm-project/commit/454cc36630296262cdb6360b60f90a64a97f7f1a but was only backported to xcode 16+
  if platform.machine() == "arm64" and OSX: args = args[:8] + [ctypes.c_int64(a) if isinstance(a, int) else a for a in args[8:]]
  return cpu_time_execution(lambda: self.fxn(*args), enable=wait)

__del__ ¤

__del__()
Source code in tinygrad/device.py
274
275
def __del__(self):
  if sys.platform == 'win32': ctypes.windll.kernel32.VirtualFree(ctypes.c_void_p(self.mem), ctypes.c_size_t(0), 0x8000) #0x8000 - MEM_RELEASE

Compiler¤

The Compiler class compiles the output from the Renderer and produces it in a device-specific format.

Compiler ¤

Compiler(cachekey: Optional[str] = None)

Methods:

Attributes:

Source code in tinygrad/device.py
282
def __init__(self, cachekey:Optional[str]=None): self.cachekey = None if getenv("DISABLE_COMPILER_CACHE") else cachekey

cachekey instance-attribute ¤

cachekey = (
    None if getenv("DISABLE_COMPILER_CACHE") else cachekey
)

compile ¤

compile(src: str) -> bytes
Source code in tinygrad/device.py
283
def compile(self, src:str) -> bytes: return src.encode()   # NOTE: empty compiler is the default

compile_cached ¤

compile_cached(src: str) -> bytes
Source code in tinygrad/device.py
284
285
286
287
288
289
def compile_cached(self, src:str) -> bytes:
  if self.cachekey is None or (lib := diskcache_get(self.cachekey, src)) is None:
    assert not getenv("ASSERT_COMPILE"), f"tried to compile with ASSERT_COMPILE set\n{src}"
    lib = self.compile(src)
    if self.cachekey is not None: diskcache_put(self.cachekey, src, lib)
  return lib

disassemble ¤

disassemble(lib: bytes)
Source code in tinygrad/device.py
290
def disassemble(self, lib:bytes): pass