Jake Ryland Williams

Drexel University

How To Train Your Own Transformer From Scratch

Few researchers have access to the resources needed to train the state-of-the-art language models (LMs) used in cutting-edge technologies. Processing 'big data' over computational frameworks and expensive GPUs, there are substantial environmental implications: in 2019, one team of researchers estimated that 626,000 pounds of carbon dioxide were produced from the costs associated to producing one model’s parameters (GPT-2's)—the lifetime emissions of approximately five cars. Its developers, OpenAI, reported in 2018 that 'since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time'. After OpenAI released GPT-3 in 2021, a report estimated that by using '10,000 GPUs and 400 gigabits per second of network connectivity per server', the months it took to process '45 Terabytes of text data from all over the internet means that 'GPT-3 could have easily cost 10 or 20 million dollars to train'. Staring down this trend in 2018, OpenAI even suggested: it’s worth preparing for the implications of systems far outside today's capabilities'. We will demonstrate a system intended to fill this profound need with a hyper-efficient, closed-form NLP framework that relieves the costs of developing NLP tools by eliminating the need for backpropagation, and resolves model opacity via interpretable procedures for dimensionality reduction and positional encoding, as well as for pre-training and fine-tuning. Abstracting the salient features of a modern transformer—and methods for parallelized pre-computation of zeroth-order models—our key achievement has been the elimination of backpropagation from training processes: we compute the points towards which gradients descend. Demonstrating this, our prototype—It’s a Machine and Natural Language Model (IaMaN-LM)—applies the closed-form solution to the naïve Bayesian model of co-occurrence: Word2Vec’s softmax-optimized skip-gram objective. As a nuclear engineer may set the dimensions of a charge, our proposed theory sets the parameters of an LM without testing every 'bomb' between a random guess and the target performance. By demonstrating tools released to perform hyper-efficient NLP, we hope to enable developers to leverage limited resources and train their own transformers from scratch, with both sharper resolution and smaller resource requirements. Software will be release in October 2022 here: https://github.com/jakerylandwilliams/IaMaN/

Bio: Jake Ryland Williams (Ph.D., Mathematical Sciences) is an Associate Professor of Information Science at Drexel University and a developer of its Graduate Data Science Program. Dr. Williams is PI of the Computational Open Data Exploration and Design (CODED) laboratory, which engineers openly-available, web-based data sets of high scientific value, alongside work on information theoretic foundations that advance the development of machine learning algorithms.