Build LLM
from Scratch
The training of large-scale language models (10b+ parameters) was reserved for AI researchers. However, it may make sense to build an LLM from scratch for some businesses developing their own custom models for security or privacy reasons.
How much does it cost?
​
Meta’s Llama 2 models required about 180,000 GPU hours to train its 7 billion parameter model and 1,700,000 GPU hours to train the 70 billion model. This is about a ~10b parameter model can take 100,000 GPU hours to train, and a ~100b parameter takes 1,000,000 GPU hours.
​
Translating this in cloud computing costs: an Nvidia A100 GPU costs around $1–2 per GPU per hour. That means a ~10b parameter model costs about $150,000 to train, and a ~100b parameter model costs ~$1,500,000.
​
Alternatively, you can buy the A100 GPUs about $10,000 multiplied by 1000 GPUs to form a cluster or $10,000,000. The energy cost to be about $100 per megawatt hour and it requiring about 1,000 megawatt hours to train a 100b parameter model or an energy cost of about $100,000 per 100b parameter model.
These costs do not include funding a team of ML engineers, data engineers, data scientists, and others needed for model development.
How Do You Do It?
​
If you still want to build LLM from scratch, the process breaks down into 4 key steps.
-
Data Curation
-
Model Architecture
-
Training at Scale
-
Evaluation
​
Step 1: Data Curation
Machine learning models are a product of their training data (i.e. “garbage in, garbage out”).
This presents a major challenge for LLMs due to the tremendous scale of data required. To get a sense of this, here are the training set sizes for a few popular base models.
​
GPT-3 175b: 0.5T Tokens (T = Trillion) This translates to about a trillion words of text i.e. about 1,000,000 novels or 1,000,000,000 news articles.
​
The internet is the most common LLM data mine, which includes countless text sources such as webpages, books, scientific articles, codebases, and conversational data. There are many open datasets such as Common Crawl and Hugging Face.
​
An alternative is to generate synthetic data. Use an LLM to generate text. Regardless, diversity is a key aspect of a good training dataset.
​
How do we prepare the data? Gathering a mountain of text data is only half the battle. The next stage of data curation is to ensure training data quality. Quality Filtering — This aims to remove “low-quality” text.
​
De-duplication — Another key preprocessing step is text de-duplication.
Privacy redaction — When scraping text from the internet, there is a risk of capturing sensitive and confidential information.
​
Tokenization — Language models (i.e. neural networks) do not “understand” text; they can only work with numbers. Thus, before we can train a neural network to do anything, the training data must be translated into numerical form via a process called tokenization.
​
Step 2: Model Architecture
Transformers are the state-of-the-art approach for language modeling. A transformer is a neural network architecture that uses attention mechanisms to generate mappings between inputs and outputs. An attention mechanism learns dependencies between different elements of a sequence based on its content and position. There are 3 types of Transformers consisting of 2 key modules: an encoder and a decoder. These modules can be standalone or combined, which enables three types of Transformers.
​
Encoder-only — an encoder translates tokens into a semantically meaningful numerical representation (i.e. embeddings) using self-attention. Embeddings take context into account. Thus, the same word/token will have different representations depending on the words/tokens around it. These transformers work well for tasks requiring input understanding, such as text classification or sentiment analysis. A popular encoder-only model is Google’s BERT.
​
Decoder-only — a decoder, like an encoder, translates tokens into a semantically meaningful numerical representation. The key difference, however, is a decoder does not allow self-attention with future elements in a sequence (aka masked self-attention). Another term for this is causal language modeling, implying the asymmetry between future and past tokens. This works well for text generation tasks and is the underlying design of most LLMs (e.g. GPT-3, Llama, Falcon, and many more).
​
Encoder-Decoder — we can combine the encoder and decoder modules to create an encoder-decoder transformer. This was the architecture proposed in the original “Attention is all you need” paper.
There is an important balance between training time, dataset size, and model size. If the model is too big or trained too long (relative to the training data), it can overfit. If too small or not trained long enough, it may underperform. Hoffman et al. present an analysis for optimal LLM size based on compute and token count and recommend a scaling schedule including all three factors. Roughly, they recommend 20 tokens per model parameter (i.e. 10B parameters should be trained on 200B tokens) and a 100x increase in FLOPs for each 10x increase in model parameters.
​
Step 3: Training at Scale
Large language models (LLMs) are trained via self-supervised learning. What this typically looks like (i.e. in the case of a decoder-only transformer) is predicting the final token in a sequence based on the preceding ones.
While this is conceptually straightforward, the central challenge emerges in scaling up model training to ~10–100B parameters. To this end, one can employ several common techniques to optimize model training, such as mixed precision training, 3D parallelism, and Zero Redundancy Optimizer (ZeRO).
Training Techniques
​
Mixed precision training is a common strategy to reduce the computational cost of model development.
Parallelization distributes training across multiple computational resources (i.e. CPUs or GPUs or both).
Pipeline parallelism — distributes transformer layers across multiple GPUs and reduces the communication volume during distributed training by loading consecutive layers on the same GPU.
These three training techniques (and many more) are implemented by DeepSpeed, a Python library for deep learning optimization.
​
Beyond computational costs, scaling up LLM training presents challenges in training stability i.e. the smooth decrease of the training loss toward a minimum value. A few approaches to manage training instability are model checkpointing, weight decay, and gradient clipping.
​
Hyperparameters are settings that control model training. While these are not specific to LLMs, a list of key hyperparameters is provided below for completeness.
Step 4: Evaluation
​
A key part of this iterative process is model evaluation, which examines model performance on a set of tasks. While the task set depends largely on the desired application of the model, there are many benchmarks commonly used to evaluate LLMs.
​
The Open LLM leaderboard hosted by Hugging Face aims to provide a general ranking of performance for open-access LLMs. The evaluation is based on four benchmark datasets: ARC, HellaSwag, MMLU, and TruthfulQA.
​
Fine-Tuning a Pre-trained LLM
​
LLM-powered applications are useful using prompt engineering. However, sometimes a more sophisticated solution model fine-tuning can help. Fine-tuning takes a pre-trained model and trains at least one internal model parameter (i.e. weights). The key upside of this approach is that models can achieve better performance. For example, compare base GPT-3 model and text-davinci-003 (a fine-tuned model. The fine-tuning used for text-davinci-003 responses in a more helpful, honest, and harmless.
​
Fine-tuning not only improves the performance of a base model, but a smaller (fine-tuned) model can often outperform larger models. There are 3 generic ways one can fine-tune a model: self-supervised, supervised, and reinforcement learning. These are not mutually exclusive in that any combination of these three approaches can be used in succession to fine-tune a single model.
​
-
Self-supervised learning consists of training a model based on the training data, typically given a sequence of words it predict the next word.
-
Supervised learning involves training a model on input-output pairs for a particular task, answering questions or responding to user prompts.
-
Reinforcement learning (RL) uses a reward model to guide the training of the base model. The reward model can be combined with a reinforcement learning algorithm to fine-tune the pre-trained model.
Supervised Fine-tuning Steps
​
-
Choose fine-tuning task
-
Prepare training dataset input-output pairs and preprocess data
-
Choose a base model
-
Fine-tune model via supervised learning
-
Evaluate model performance
Costs
When it comes to fine-tuning a model with ~100M-100B parameters, one needs to be thoughtful of computational costs.
​
-
The first option is to train all internal model parameters (called full parameter tuning).
-
Transfer Learning is to preserve the useful representations/features the model has learned from past training when applying the model to a new task.
-
Parameter Efficient Fine-tuning (PEFT) involves augmenting a base model with a relatively small number of trainable parameters. LoRA (Low-Rank Adaptation) method is to pick a subset of layers in an existing model and modify their weights according to the following equation.
Software Requirements
Training a Large Language Model (LLM) is an advanced machine learning task that requires some specific tools and know-how. One excellent resource for fine tuning is Lightning.AI.
​
1. Python Programming Environment
You'll need a programming environment like Anaconda or PyCharm to write and execute Python code.
2. LLM Library
The Hugging Face Transformers library is a popular choice for working with pre-trained language models.
3. Deep Learning Framework
You'll need a deep learning framework like PyTorch or TensorFlow to train the model.
​
These software tools can be downloaded for free from the corresponding websites:
Data Requirements
​
1. A Large Dataset of Python Code
This will provide examples for the model to learn from.
2. A Dataset of Human-Written Descriptions of Python Code
These descriptions will act as prompts for the Python code.
You can find these datasets on the following websites:
Tools Requirements
​
1. Text Editor or IDE
Sublime Text or Visual Studio Code are popular choices.
2. Command-Line Interface (CLI)
A necessary tool for running commands.
3. Cloud Computing Platform
Google Cloud Platform and Amazon Web Services can provide the necessary computing resources.
​
You can download these tools for free from the corresponding websites:
Additional Tips​
Use a Large Dataset: More data leads to better learning.Use a Powerful Computer: Training an LLM is computationally demanding. Be Patient. Training takes time and effort, and results may not be immediate.​
Conclusion
​Training build LLM from scratch is a complex task that requires careful preparation and execution. By following this guide, obtaining the necessary software, data, and tools, and applying a consistent, iterative approach, you can create a powerful tool that can generate Python code from text prompts. Remember, patience and persistence are key, and the rewards of a well-trained LLM can be significant in automating code creation and understanding.