Recent autoregressive generation models such as SoRA, GPT-4, LLaVa-next have made significant headway in modeling the short-range co-occurrence statistics in language – at the level of tokens or “word pieces” – and in vision at the pixel/region resolution. Empirically, these models are trained and evaluated by querying the model using a partial token sequence $t_1, \dots, t_{k-1}$, sampling the conditional distributions $P(t_k | t_1, \dots, t_{k-1})$ over multiple turns, and finally judging the quality (called the “Semantics”) of the resulting generated sequence of tokens $t_{k+1}, t_{k+2}, \dots$ in context of the original prompt. Such setups have proven to approach (and, in some cases, exceed) human performance at pre-existing NLP tasks such as Sentiment/Topic Classification, Sentence-level Similarity and Question-Answering and Short Story Completion. As the length demand of these token-by-token generations grows, the corresponding efforts to train and deploy these models – specifically, the sheer volume of the training data, and the computational demands to jointly condition on many input prompt tokens at once -- grow exponentially. Today, many models exceed 100B parameters, cannot be hosted locally and draw significant power. Thus, many efforts today seek alternate means by which to extend the range of Semantic consistency of these generation models.
This thesis presents a method to address this problem by asking the following question: What if instead of attempting to model the long-range semantics directly, we identify the Semantic information for shorter prompts, and then stow away the Semantics as a representation beyond simple tokens. From here, these intermediate representations can be manipulated, aggregated and extended within an augmented representation space of lower complexity: a “Workspace”. The estimation of such a Workspace can be formulated as a State Space model: the intermediate representations of the Semantics constitute samples from an underlying latent state process (designed to satisfy Markovian assumptions) and the generated text segments constitute the observable short contexts. Over a series of works (chapters) that include such diverse applications as discovering author biases and intentions across internet-scale social media by piecing together the clues in individual posts, to DNA sequence alignment and assembly of short reads with respect to a much larger reference genome, we develop the cognitive theory of the Workspace and a computational implementation -- and demonstrate empirical success at existing and constructed tasks -- of such a Workspace for the text modality.