1. Transformer[1]

I start it by following the video[1] from Andrej Karpathy and check the code from the repo. In the video, he shows how to build a GPT by oneself step by step which also shows the way to build a NLP model in a high level. I summarize these steps and add some useful tips that mentioned in the video.

1.1 Data Preprocessing

Tokenizer

1.2 Bigram Model

The code is here. High-level to train a bigram model

A mapping from characters to integers
Split train and test
Data Loading
Estimate loss
Define the Bigram Language Model
Optimizer: AdamW
Train and generate outputs

Bigram Model: it is also called 2-gram model which means the prediction of the next word is just related to the one word before. As to a N-gram model, the $N$ word is based on all the $N-1$ words.

1.3 Transformer

Some example snippets.

Head/Self-Attention

        
      
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # input of size (batch, time-step, channels)
        # output of size (batch, time-step, head size)
        B,T,C = x.shape
        k = self.key(x)   # (B,T,hs)
        q = self.query(x) # (B,T,hs)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,hs)
        out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out        

Block

        
      
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """
    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

Model

        
      
class GPTLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # better init, not covered in the original GPT video, but important, will cover in followup video
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx                

2. ChatGPT

2.1 Key ideas about how ChatGPT works[4]

The steps to train ChatGPT today (Feb-16-2023)

Pre-training: The downsteam task is to predict the next word in a sentence. The dataset is from the Internet texts.
Fine-tune: The dataset is from Reviewers’ instructions.
- InstructGPT : Aligning Language Models to Follow Instruction. (link)
- Addressing biases: guidelines. Future system:
Improve default behavior.
Define your AI’s values, within broad bounds.
Public input on defaults and hard bounds.

2.2 Train a GPT-2[1]

I fork the original repo to store my updates personally here: https://github.com/C0ldstudy/nanoGPT.

ChatGPT Function List Cheat List

Some useful materials for using ChatGPT:

https://platform.openai.com/docs/guides/gpt-best-practices

1. NLP Tasks

Text Generation
Summarization
Open Domain Question Answering
Paraphrasing
Sentiment Analysis (few-shot or zero-shot)
Table to Text
Text to Table
Toekn Classification
- Prompt: classify the named entities in this text: George Washington and his troops crossed the Delaware River on December 25, 1776 during the American Revolutionary War.
Dataset Gneneration
- Prompt: generate more datapoints from this text:
Language Translation

2. Code

Code Generation:
- Prompt: show me how to make an http request in Python
Code Explanation:
- Prompt: explain this python code
Docstring Generation
- Prompt: write a docstring description for this function
Programming Language Conversion
- Prompt: convert this code from Python to Javascript
Data Object Conversions (JSON, XML, CSV etc.)
Knowledge Graph Generation
- Prompt: convert this text into nodes and edges
HTML to Text (Web Scraping)
- Prompt: convert this HTML to text
  3. Structured Output Styles
List
- Prompt: give me a list of 5 citrus fruits
Numbered List
Headings and Subheadings
- Prompt: convert this text into headings and subheadings
Tables
- Prompt: create a table from this list: Oranges, Lemons, Limes, Grapefruit, Tangerines
  4. Unstructured Output Styles
Narrative Modes (1st, 2nd or in the 3rd person)
- Prompt: write a paragraph on how to make brownies in the 1st person
Formal/Informal
Personas
- Prompt: write a paragraph on the topic of cellular automata in the style of a social media influencer
Custom Text Manipulation
- Prompt: write a paragraph on the history of the calculator, include emojis at the end of every sentence, and do not capitalize the first word in each sentence
  5. Media Types
Write Social Media Posts
- Prompt: write a tweet on futurism
Write Blogs/Emails/Poems/Songs/Resumes/Cover Letters
6. Meta ChatGPT
Ask ChatGPT About Its Own Capabilities
- Prompt: what ways can you structure text output?
Correct ChatGPT on Its Knowledge
Ask ChatGPT to Expand on Answers

References:

Let’s build GPT: from scratch, in code, spelled out. from Andrej Karpathy.
Hugging face
Transformers-Tutorials
How should AI systems behave, and who should decide?

Transformer Related Implementation

1. Transformer[1]

1.1 Data Preprocessing

1.2 Bigram Model

1.3 Transformer

Head/Self-Attention

Block

Model

2. ChatGPT

2.1 Key ideas about how ChatGPT works[4]

2.2 Train a GPT-2[1]

ChatGPT Function List Cheat List

1. NLP Tasks

2. Code

3. Structured Output Styles

4. Unstructured Output Styles

5. Media Types

6. Meta ChatGPT

Further Reading

ChatGPT Prompt Course Note

LLM Paper Summary

Paper Summary 2023