Intro
The tinygrad framework has four pieces
- a PyTorch like frontend.
- a scheduler which breaks the compute into kernels.
- a lowering engine which converts ASTs into code that can run on the accelerator.
- an execution engine which can run that code.
There is a good bunch of tutorials by Di Zhu that go over tinygrad internals.
There's also a doc describing speed
Frontend¤
Everything in Tensor is syntactic sugar around constructing a graph of UOps.
The UOp graph specifies the compute in terms of low level tinygrad ops. Not all UOps will actually become realized. There's two types of UOps, base and view. base contains compute into a contiguous buffer, and view is a view. Inputs to a base can be either base or view, inputs to a view can only be a single base.
Scheduling¤
The scheduler converts the graph of UOps into a list of ExecItem. One ExecItem is one kernel on the GPU, and the scheduler is responsible for breaking the large compute graph into subgraphs that can fit in a kernel. ast specifies what compute to run, and bufs specifies what buffers to run it on.
ExecItem
dataclass
¤
ExecItem(
ast: UOp,
bufs: list[Buffer | None] = list(),
metadata: tuple[Metadata, ...] = (),
fixedvars: dict[str, int] = dict(),
prg: Runner | None = None,
)
Methods:
-
lower–Populate self.prg by lowering the AST.
lower
¤
lower()
Populate self.prg by lowering the AST.
Source code in tinygrad/engine/realize.py
144 145 146 147 148 149 150 151 152 153 154 | |
Lowering¤
The code in realize lowers ExecItem by populating its prg field with
run_schedule
¤
run_schedule(
schedule: list[ExecItem],
var_vals: dict[str, int] | None = None,
do_update_stats=True,
)
Source code in tinygrad/engine/realize.py
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | |
There's a ton of complexity hidden behind this, see the codegen/ directory.
First we lower the AST to UOps, which is a linear list of the compute to be run. This is where the BEAM search happens.
Then we render the UOps into code with a Renderer, then we compile the code to binary with a Compiler.
Execution¤
Creating ExecItem, which has a run method
ExecItem
dataclass
¤
ExecItem(
ast: UOp,
bufs: list[Buffer | None] = list(),
metadata: tuple[Metadata, ...] = (),
fixedvars: dict[str, int] = dict(),
prg: Runner | None = None,
)
Methods:
Attributes:
-
ast(UOp) – -
bufs(list[Buffer | None]) – -
fixedvars(dict[str, int]) – -
metadata(tuple[Metadata, ...]) – -
prg(Runner | None) –
fixedvars
class-attribute
instance-attribute
¤
lower
¤
lower()
Populate self.prg by lowering the AST.
Source code in tinygrad/engine/realize.py
144 145 146 147 148 149 150 151 152 153 154 | |
run
¤
run(
_var_vals: dict[str, int] | None = None,
wait=False,
jit=False,
do_update_stats=True,
) -> float | None
Source code in tinygrad/engine/realize.py
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | |
Lists of ExecItem can be condensed into a single ExecItem with the Graph API (rename to Queue?)
Runtime¤
Runtimes are responsible for device-specific interactions. They handle tasks such as initializing devices, allocating memory, loading/launching programs, and more. You can find more information about the runtimes API on the runtime overview page.
All runtime implementations can be found in the runtime directory.
HCQ Compatible Runtimes¤
HCQ API is a lower-level API for defining runtimes. Interaction with HCQ-compatible devices occurs at a lower level, with commands issued directly to hardware queues. Some examples of such backends are NV and AMD, which are userspace drivers for NVIDIA and AMD devices respectively. You can find more information about the API on HCQ overview page