Videos

Hierarchical Bayesian Inference with Transformers: Approximation Theory and Learned Representations

Presenter
January 16, 2026
Abstract
Transformers trained on sequential prediction tasks exhibit ""in-context learning"", the ability to adapt to new tasks at inference time given only a sequence of examples. While recent work suggests these models can simulate specific learning algorithms, the precise mechanisms remain opaque. In this talk, I will investigate this phenomenon in a controlled setting where the training data is generated by a Hierarchical Gaussian Process (HGP). In this regime, the ideal in-context learner is the posterior predictive functional Psi, which maps the context dataset and a query point to the predictive density. First, I will discuss a theoretical framework for bounding the approximation error between a Transformer and the target functional Psi. I will outline how spectral properties of the kernel family and a covering number for the hyperparameter space govern the required network capacity, determined by architectural parameters such as depth, width, and attention head count. Second, I will present preliminary empirical evidence that Transformers trained on a prequential objective naturally recover structures aligned with these theoretical constructions. We analyze architectures where context inputs are pre-encoded by an MLP and evaluate the trained encoder's ability to represent the underlying kernel family over the space of hyperparameters. By minimizing the approximation error between the true kernel matrix and a linear reconstruction based on the encoded features, and evaluating a range of approximation metrics, we observe scenarios where the encoder learns a feature map capable of linearly representing the kernel uniformly over the hyperparameter space.