Skip to content

Intro

The tinygrad framework has four pieces

  • a PyTorch like frontend.
  • a scheduler which breaks the compute into kernels.
  • a lowering engine which converts ASTs into code that can run on the accelerator.
  • an execution engine which can run that code.

There is a good bunch of tutorials by Di Zhu that go over tinygrad internals.

There's also a doc describing speed

Frontend¤

Everything in Tensor is syntactic sugar around constructing a graph of UOps.

The UOp graph specifies the compute in terms of low level tinygrad ops. Not all UOps will actually become realized. There's two types of UOps, base and view. base contains compute into a contiguous buffer, and view is a view. Inputs to a base can be either base or view, inputs to a view can only be a single base.

Scheduling¤

The scheduler converts the graph of UOps into a list of ExecItem. One ExecItem is one kernel on the GPU, and the scheduler is responsible for breaking the large compute graph into subgraphs that can fit in a kernel. ast specifies what compute to run, and bufs specifies what buffers to run it on.

ExecItem dataclass ¤

ExecItem(
    ast: UOp,
    bufs: list[Buffer | None] = list(),
    metadata: tuple[Metadata, ...] = (),
    fixedvars: dict[str, int] = dict(),
    prg: Runner | None = None,
)

Methods:

  • lower

    Populate self.prg by lowering the AST.

lower ¤

lower()

Populate self.prg by lowering the AST.

Source code in tinygrad/engine/realize.py
144
145
146
147
148
149
150
151
152
153
154
def lower(self):
  """Populate self.prg by lowering the AST."""
  if self.prg is not None: return self
  try: self.prg = cast(Runner, si_lowerer.rewrite(self.ast, self.bufs))
  except Exception as e:
    if DEBUG >= 2:
      print(f"error lowering {self.ast.op}")
      print("tensor operations:")
      pprint.pprint(self.metadata, indent=2)
    raise e
  return self

Lowering¤

The code in realize lowers ExecItem by populating its prg field with

run_schedule ¤

run_schedule(
    schedule: list[ExecItem],
    var_vals: dict[str, int] | None = None,
    do_update_stats=True,
)
Source code in tinygrad/engine/realize.py
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
def run_schedule(schedule:list[ExecItem], var_vals:dict[str, int]|None=None, do_update_stats=True):
  while len(schedule):
    ei = schedule.pop(0).lower()
    if len(capturing) and CAPTURING: capturing[0].add(ei)
    if VALIDATE_WITH_CPU and ei.ast.op is Ops.SINK:
      # copy in allocated buffers from the GPU
      bufs = [b for b in ei.bufs if b is not None]
      nb: list[Buffer|None] = [Buffer("CPU", b.size, b.dtype) for b in bufs]
      for cpu_b, gpu_b in zip(nb, bufs):
        if cpu_b is not None and gpu_b.is_allocated(): cpu_b.ensure_allocated().copyin(gpu_b.as_memoryview())

      # run on GPU
      ei.run(var_vals, do_update_stats=do_update_stats)

      # validate the output buffers match (NOTE: this is assuming the output is buffer 0)
      with Context(BEAM=0): ExecItem(ei.ast, nb, ei.metadata, ei.fixedvars).run(var_vals, do_update_stats=do_update_stats)
      import numpy as np
      assert nb[0] is not None
      np.testing.assert_allclose(bufs[0].numpy(), nb[0].numpy(), rtol=1e-3, atol=1e-3)
    else:
      ei.run(var_vals, do_update_stats=do_update_stats)

There's a ton of complexity hidden behind this, see the codegen/ directory.

First we lower the AST to UOps, which is a linear list of the compute to be run. This is where the BEAM search happens.

Then we render the UOps into code with a Renderer, then we compile the code to binary with a Compiler.

Execution¤

Creating ExecItem, which has a run method

ExecItem dataclass ¤

ExecItem(
    ast: UOp,
    bufs: list[Buffer | None] = list(),
    metadata: tuple[Metadata, ...] = (),
    fixedvars: dict[str, int] = dict(),
    prg: Runner | None = None,
)

Methods:

  • lower

    Populate self.prg by lowering the AST.

  • run

Attributes:

ast instance-attribute ¤

ast: UOp

bufs class-attribute instance-attribute ¤

bufs: list[Buffer | None] = field(default_factory=list)

fixedvars class-attribute instance-attribute ¤

fixedvars: dict[str, int] = field(default_factory=dict)

metadata class-attribute instance-attribute ¤

metadata: tuple[Metadata, ...] = ()

prg class-attribute instance-attribute ¤

prg: Runner | None = None

lower ¤

lower()

Populate self.prg by lowering the AST.

Source code in tinygrad/engine/realize.py
144
145
146
147
148
149
150
151
152
153
154
def lower(self):
  """Populate self.prg by lowering the AST."""
  if self.prg is not None: return self
  try: self.prg = cast(Runner, si_lowerer.rewrite(self.ast, self.bufs))
  except Exception as e:
    if DEBUG >= 2:
      print(f"error lowering {self.ast.op}")
      print("tensor operations:")
      pprint.pprint(self.metadata, indent=2)
    raise e
  return self

run ¤

run(
    _var_vals: dict[str, int] | None = None,
    wait=False,
    jit=False,
    do_update_stats=True,
) -> float | None
Source code in tinygrad/engine/realize.py
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
def run(self, _var_vals:dict[str, int]|None=None, wait=False, jit=False, do_update_stats=True) -> float|None:
  if self.prg is None: self.lower()
  assert self.prg is not None
  var_vals = self.fixedvars if _var_vals is None else (_var_vals|self.fixedvars)
  # reorder bufs to match program globals if needed
  _bufs = [self.bufs[i] for i in self.prg.p.globals] if isinstance(self.prg, CompiledRunner) else self.bufs
  bufs = [unwrap(x) for x in _bufs] if jit else [unwrap(x).ensure_allocated() for x in _bufs]
  if PROFILE:
    payload = {"metadata":self.metadata, "var_vals":var_vals, "bufs":[b.trace_num for b in bufs], "name":self.prg.display_name}
    payload["outputs"], payload["inputs"] = (self.prg.p.outs, self.prg.p.ins) if isinstance(self.prg, CompiledRunner) else ([0], [1])
    cpu_events.append(ProfilePointEvent(self.prg.device, "exec", len(cpu_events), payload))
  et = self.prg(bufs, var_vals, wait=wait or DEBUG >= 2)
  if do_update_stats:
    GlobalCounters.kernel_count += 1
    GlobalCounters.global_ops += (op_est:=sym_infer(self.prg.estimates.ops, var_vals))
    GlobalCounters.global_mem += (mem_est:=sym_infer(self.prg.estimates.mem, var_vals))
    if et is not None: GlobalCounters.time_sum_s += et
    if DEBUG >= 2:
      lds_est = sym_infer(self.prg.estimates.lds, var_vals)
      mem_est = min(mem_est, lds_est)   # there can't be more memory accessed than loads/stores. remove this when symbolic is fixed
      header_color = 'magenta' if jit else ('green' if self.prg.first_run else None)
      ptm = colored(time_to_str(et, w=9), "yellow" if et > 0.01 else None) if et is not None else ""
      flops, membw, ldsbw = op_est/(et or 1e-20), mem_est/(et or 1e-20), lds_est/(et or 1e-20)
      flops_str = f"{flops*1e-9:7.0f} GFLOPS" if flops < 1e14 else colored(f"{flops*1e-12:7.0f} TFLOPS", 'green')
      mem_str = f"{membw*1e-9:4.0f}|{ldsbw*1e-9:<6.0f} GB/s" if membw < 1e13 and ldsbw < 1e15 else \
        colored(f"{membw*1e-12:4.0f}|{ldsbw*1e-12:<6.0f} TB/s", 'green')
      print(f"{colored(f'*** {self.prg.device[:7]:7s} {GlobalCounters.kernel_count:4d}', header_color)}"+
        f" {self.prg.display_name+' '*(46-ansilen(self.prg.display_name))} arg {len(bufs):2d} mem {GlobalCounters.mem_used/1e9:6.2f} GB"+
        ("" if et is None else f" tm {ptm}/{GlobalCounters.time_sum_s*1e3:9.2f}ms ({flops_str} {mem_str})")+
        f" {[repr(m) if TRACEMETA >= 2 else str(m) for m in self.metadata] if self.metadata else ''}")
    self.prg.first_run = False
  return et

Lists of ExecItem can be condensed into a single ExecItem with the Graph API (rename to Queue?)

Runtime¤

Runtimes are responsible for device-specific interactions. They handle tasks such as initializing devices, allocating memory, loading/launching programs, and more. You can find more information about the runtimes API on the runtime overview page.

All runtime implementations can be found in the runtime directory.

HCQ Compatible Runtimes¤

HCQ API is a lower-level API for defining runtimes. Interaction with HCQ-compatible devices occurs at a lower level, with commands issued directly to hardware queues. Some examples of such backends are NV and AMD, which are userspace drivers for NVIDIA and AMD devices respectively. You can find more information about the API on HCQ overview page