HuggingFace: LLM Course

0. SETUP

Commonly used libraries

  • transformers
  • datasets
  • torch

1. TRANSFORMER MODELS

Pipeline() Function

The pipeline() function in the 🤗 Transformers library simplifies using models by integrating preprocessing and postprocessing steps.

from transformers import pipeline

text = "Huggingface is awesome!"

# sentiment analysis:
e2e_model = pipeline("sentiment-analysis")
e2e_model(text)
Tip

We can pass several sentences in one go!

e2e_model([
	"I've been waiting for a HuggingFace course my whole life.", 
	"I hate this so much!"
])

Tasks and Pipeline() Compatibility

  • NLP Pipelines
Task Description Pipeline()
feature-extraction Extract vector representations of text. ✓
fill-mask Fills masked text data. ✓
question-answering Retrieve the answer to a question from a given text. ✓
sentence-similarity Determine how similar two texts are. ✗
summarization Create a shorter version of a text while preserving key information. ✓
table-question-answering Answering a question about an information on a given table. ✓
text-classification Classify text into predefined categories. ✓
text-generation Generate text from a prompt. ✓
text-ranking Rank a set of texts based on their relevance to a query. ✗
token-classification NLU task in which a label is assigned to some tokens in a text. ✓
translation Convert text from one language to another. ✓
zero-shot-classification Classify text without prior training on specific labels. ✓
  • Vision pipelines
Task Description Pipeline()
depth-estimation Estimate the depth of different objects present in an image. ✓
image-classification Identify objects in an image. ✓
image-feature-extraction Extract semantically meaningful features given an image. ✓
image-segmentation Divides an image into segments where each pixel in the image is mapped to an object ✓
image-to-image Transform image. (eg. inpainting, colorization, Super Resolution) ✓
image-to-text Generate text descriptions of images. ✓
image-to-video Generate a video influenced by text prompts. ✗
keypoint-detection Identify meaningful distinctive points or features in an image. ✗
mask-generation Generate masks that identify a specific object or region of interest in a given image. ✓
object-detection Locate and identify objects in images. ✓
video-classification Assign a label or class to an entire video. ✓
text-to-image Generating images from input text. ✗
text-to-video Generating consistent sequence of images from text. ✗
unconditional-image-generation Generating images with no condition in any context. ✗
video-to-video Transform input video into a new video with altered visual styles, motion, or content. ✗
zero-shot-image-classification Classify previously unseen classes during training of a model. ✓
zero-shot-object-detection Detect objects and their classes in images, without any prior training or knowledge of the classes. ✓
text-to-3d Text-to-3D models take in text input and produce 3D output. ✗
image-to-3d Image-to-3D models take in image input and produce 3D output. ✗
  • Audio pipelines
Task Description Pipeline()
audio-classification Classify audio into categories. ✓
audio-to-audio family of tasks in which the input is an audio and the output is one or multiple generated audios. (eg. speech enhancement, source separation, ...) ✗
automatic-speech-recognition Convert speech to text. ✓
text-to-audio Convert text to spoken audio. ✓
  • Multimodal pipelines
Task Description Pipeline()
any-to-any Understand two or more modalities and output two or more modalities. ✗
audio-text-to-text Generate textual responses or summaries based on both audio input and text prompts. ✗
document-question-answering Take a (document, question) pair as input and return an answer in natural language. ✗
visual-document-retrieval Searching for relevant image-based documents, such as PDFs based on input text prompt. ✓
image-text-to-text Take in an image and text prompt and output text. ✓
video-text-to-text Take in a video and a text prompt and output text. ✗
visual-question-answering Answering open-ended questions based on an image depending on text prompt. ✓

Classification of LLM Models

1: Encoder Based (Auto-Encoding Transformers)

  • Focus: Understanding context, generating embeddings.
  • Mechanism: Bidirectional attention (sees past & future tokens).
  • Training: Masked Language Modeling (MLM).
  • Use Cases: Text classification, sentiment analysis, Named Entity Recognition (NER), question answering (understanding).
  • Examples: BERT, RoBERTa.

2: Decoder Based (Auto-Regressive Transformers)

  • Focus: Generating new text, predicting next token.
  • Mechanism: Unidirectional attention (sees only past tokens).
  • Training: Predicts next word in a sequence.
  • Use Cases: Text generation, summarization, chatbots, code generation.
  • Examples: GPT series, LLaMA, Claude.

3: Encoder + Decoder Based Transformers

  • Focus: Transforming one sequence into another.
  • Mechanism: Encoder-Decoder architecture. Encoder processes input, Decoder generates output having influenced by input.
  • Training: Maps input sequence to output sequence.
  • Use Cases: Machine translation, abstractive summarization, text style transfer.
  • Examples: T5, BART, NMT models.

Auto-Encoding Models

Model: BaseAutoEncodingModel

# Model: BaseAutoEncodingModel
BaseAutoEncodingModel(
	'embedder': BaseAutoEncodingModelEmbedderModule(...),
	'encoder': BaseAutoEncodingModelEncoderModule(...),
	'pooler': BaseAutoEncodingModelPoolerModule(...)
)

Module: BaseAutoEncodingModelEmbedder

# Module: BaseAutoEncodingModelEmbedder
BaseAutoEncodingModelEmbedder(
    'word_emb': WordEmbedder(...), 
    'pos_emb': PositionEmbedder(...), 
    'tok_type_emb': TokenTypeEmbedder(...), 
    'layer_norm': LayerNorm(...), 
    'dropout': Dropout(...)
)

Module: BaseAutoEncodingModelEncoder

# Module: BaseAutoEncodingModelEncoder
BaseAutoEncodingModelEncoder(
    'layers': ModuleList(
		'layer': N x BaseAutoEncodingModelEncoderLayer(
		    'attention': BaseAutoEncodingModelAttention(...), 
	        'intermediate': BaseAutoEncodingModelIntermediate(...), 
	        'output': BaseAutoEncodingModelOutput(...)
		)
	)
)

# SubModule: BaseAutoEncodingModelAttention
BaseAutoEncodingModelAttention(
	'self_attention': BaseAutoEncodingModelSelfAttention(
		'Q': Linear(...), 
		'K': Linear(...), 
		'V': Linear(...), 
		'dropout': Dropout(...)
	),
	'self_output': BaseAutoEncodingModelSelfOutput(
		'dense': Linear(...), 
		'layer_norm': LayerNorm(...), 
		'dropout': Dropout(...)
	),
)

# SubModule: BaseAutoEncodingModelIntermediate
BaseAutoEncodingModelIntermediate(
	'dense': Linear(...), 
	'activation': Activation(...)
)

# SubModule: BaseAutoEncodingModelOutput
BaseAutoEncodingModelOutput(
	'dense': Linear(...), 
	'layer_norm': LayerNorm(...), 
	'dropout': Dropout(...) 
)

Module: BaseAutoEncodingModelPooler

BaseAutoEncodingModelPooler(
	'dense': Linear(...), 
	'activation': Activation(...)
)

Train Sequence to Sequence Model

import torch.functional as F
import torch.nn as nn
import transformers
import tokenizers
import torch
import os
import math
from torch.utils.data import TensorDataset, DataLoader
from tqdm import tqdm
import glob

if os.path.exists("tokenizer.json"):
    tokenizer = tokenizers.Tokenizer.from_file("tokenizer.json")
else:
    tokenizer = tokenizers.SentencePieceUnigramTokenizer()

    tokenizer.train(
        files=['./notebooks/processed.hi', './notebooks/processed.en'],
        vocab_size=8000,
        show_progress=True,
        special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
    )
    tokenizer.save("tokenizer.json")

unk_token_id = tokenizer.token_to_id("[UNK]")
pad_token_id = tokenizer.token_to_id("[PAD]")
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
mask_token_id = tokenizer.token_to_id("[MASK]")

class TranslationModelConfig(transformers.PretrainedConfig):
    model_type = "translation-hi2en"

    def __init__(
            self,
            vocab_size=8000,
            d_model=512,
            num_encoder_layers=6,
            num_decoder_layers=6,
            num_heads=8,
            dim_feedforward=512,
            dropout=0.1,
            pad_token_id=pad_token_id,
            eos_token_id=sep_token_id,
            decoder_start_token_id=cls_token_id,
            initializer_range=0.02,
            max_position_embeddings=512,
            **kwargs):
        super().__init__(pad_token_id=pad_token_id, eos_token_id=eos_token_id, **kwargs)
        self.vocab_size = vocab_size
        self.d_model = d_model
        self.num_encoder_layers = num_encoder_layers
        self.num_decoder_layers = num_decoder_layers
        self.num_heads = num_heads
        self.dim_feedforward = dim_feedforward
        self.dropout = dropout
        self.decoder_start_token_id = decoder_start_token_id
        self.initializer_range = initializer_range
        self.max_position_embeddings = max_position_embeddings


class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout = 0.1, max_len = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))

        pe = torch.zeros(max_len, d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        seq_len = x.size(1)
        x = x + self.pe[:, :seq_len]
        return self.dropout(x)


class TranslationModel(transformers.PreTrainedModel):

    def __init__(self, config):
        super().__init__(config)

        self.cfg = config

        self.src_embedding = nn.Embedding(config.vocab_size, config.d_model, padding_idx=config.pad_token_id)
        self.tgt_embedding = nn.Embedding(config.vocab_size, config.d_model, padding_idx=config.pad_token_id)

        self.positional_encoding = PositionalEncoding(config.d_model, config.dropout, config.max_position_embeddings)

        encoder_layer = nn.TransformerEncoderLayer(
            d_model=config.d_model,
            nhead=config.num_heads,
            dim_feedforward=config.dim_feedforward,
            dropout=config.dropout,
            batch_first=True
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=config.num_encoder_layers)

        decoder_layer = nn.TransformerDecoderLayer(
            d_model=config.d_model,
            nhead=config.num_heads,
            dim_feedforward=config.dim_feedforward,
            dropout=config.dropout,
            batch_first=True
        )
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=config.num_decoder_layers)

        self.output_layer = nn.Linear(config.d_model, config.vocab_size)

        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.cfg.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.cfg.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        max_len = self.cfg.max_position_embeddings
        if src.size(1) > max_len:
            src = src[:, :max_len]
            if src_mask is not None:
                src_mask = src_mask[:, :max_len]
        if tgt.size(1) > max_len:
            tgt = tgt[:, :max_len]
            if tgt_mask is not None and tgt_mask.size(0) > max_len:
                tgt_mask = tgt_mask[:max_len, :max_len]

        src_emb = self.positional_encoding(self.src_embedding(src) * math.sqrt(self.cfg.d_model))
        tgt_emb = self.positional_encoding(self.tgt_embedding(tgt) * math.sqrt(self.cfg.d_model))

        memory = self.encoder(src_emb, src_key_padding_mask=src_mask)

        output = self.decoder(tgt_emb, memory, tgt_mask=tgt_mask, memory_key_padding_mask=src_mask)

        logits = self.output_layer(output)

        return logits

model = TranslationModel(TranslationModelConfig())

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
model.to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

criterion = nn.CrossEntropyLoss(ignore_index=pad_token_id)

batch_size = 32
seq_len = 256

class PairedDataset(torch.utils.data.Dataset):
    def __init__(self, src_data, tgt_data):
        self.tokenizer = tokenizer
        assert len(src_data) == len(tgt_data), "Source and target data must have the same length."
        self.src_data = src_data
        self.tgt_data = tgt_data
        self.max_len = 512

    def __len__(self):
        return len(self.src_data)

    def __getitem__(self, idx):
        src_ids = tokenizer.encode(self.src_data[idx]).ids
        tgt_ids = tokenizer.encode(self.tgt_data[idx]).ids
        if len(src_ids) > self.max_len:
            src_ids = src_ids[:self.max_len]
        if len(tgt_ids) > self.max_len:
            tgt_ids = tgt_ids[:self.max_len]
        return src_ids, tgt_ids

dataset = PairedDataset(
    open('./notebooks/processed.hi', 'r').readlines(),
    open('./notebooks/processed.en', 'r').readlines()
)

train_size = int(0.8 * len(dataset))
val_size = int(0.1 * len(dataset))
test_size = len(dataset) - train_size - val_size
train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, val_size, test_size])

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=lambda batch: (
    torch.nn.utils.rnn.pad_sequence([torch.tensor(item[0])[:512] for item in batch], batch_first=True, padding_value=pad_token_id),
    torch.nn.utils.rnn.pad_sequence([torch.tensor(item[1])[:512] for item in batch], batch_first=True, padding_value=pad_token_id)
))

num_epochs = 500

checkpoint_dir = "checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)

latest_checkpoint = None
checkpoints = sorted(glob.glob(os.path.join(checkpoint_dir, "checkpoint_epoch_*.pt")))
if checkpoints:
    latest_checkpoint = checkpoints[-1]

start_epoch = 0
if latest_checkpoint is not None:
    print(f"Loading checkpoint from {latest_checkpoint} to resume training...")
    checkpoint = torch.load(latest_checkpoint, map_location=device)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    start_epoch = checkpoint['epoch']
    print(f"Resuming from epoch {start_epoch}")

print("\nStarting training loop on actual data...")
for epoch in range(start_epoch, num_epochs):
    model.train()
    total_loss = 0

    for batch_idx, (src_batch, tgt_batch) in enumerate(tqdm(train_dataloader, ncols=80)):
        src_batch, tgt_batch = src_batch.to(device), tgt_batch.to(device)

        src_mask = (src_batch == pad_token_id).to(device)

        decoder_input = tgt_batch[:, :-1]
        target_labels = tgt_batch[:, 1:]

        decoder_input_seq_len = decoder_input.size(1)
        causal_tgt_mask = torch.triu(torch.ones(decoder_input_seq_len, decoder_input_seq_len), diagonal=1).bool().to(device)

        output_logits = model(src_batch, decoder_input, src_mask=src_mask, tgt_mask=causal_tgt_mask)

        loss = criterion(output_logits.view(-1, model.cfg.vocab_size), target_labels.reshape(-1))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}")

    checkpoint_path = os.path.join(checkpoint_dir, f"checkpoint_epoch_{epoch+1}.pt")
    torch.save({
        'epoch': epoch + 1,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': avg_loss
    }, checkpoint_path)

    checkpoints = sorted(glob.glob(os.path.join(checkpoint_dir, "checkpoint_epoch_*.pt")))
    if len(checkpoints) > 2:
        os.remove(checkpoints[0])

print("\nTraining loop on actual data finished.")
print("If the 'Average Loss' is decreasing over epochs, your model is training!")