Home Transformer Related Implementation
Post
Cancel

Transformer Related Implementation

1. Transformer[1]

I start it by following the video[1] from Andrej Karpathy and check the code from the repo. In the video, he shows how to build a GPT by oneself step by step which also shows the way to build a NLP model in a high level. I summarize these steps and add some useful tips that mentioned in the video.

1.1 Data Preprocessing

Tokenizer

1.2 Bigram Model

The code is here. High-level to train a bigram model

  • A mapping from characters to integers
  • Split train and test
  • Data Loading
  • Estimate loss
  • Define the Bigram Language Model
  • Optimizer: AdamW
  • Train and generate outputs

Bigram Model: it is also called 2-gram model which means the prediction of the next word is just related to the one word before. As to a N-gram model, the $N$ word is based on all the $N-1$ words.

1.3 Transformer

Some example snippets.

Head/Self-Attention

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # input of size (batch, time-step, channels)
        # output of size (batch, time-step, head size)
        B,T,C = x.shape
        k = self.key(x)   # (B,T,hs)
        q = self.query(x) # (B,T,hs)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,hs)
        out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out        

Block

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """
    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

Model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
class GPTLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # better init, not covered in the original GPT video, but important, will cover in followup video
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx                

2. ChatGPT

2.1 Key ideas about how ChatGPT works[4]

The steps to train ChatGPT today (Feb-16-2023)

  • Pre-training: The downsteam task is to predict the next word in a sentence. The dataset is from the Internet texts.
  • Fine-tune: The dataset is from Reviewers’ instructions.
  • Improve default behavior.
  • Define your AI’s values, within broad bounds.
  • Public input on defaults and hard bounds.

2.2 Train a GPT-2[1]

I fork the original repo to store my updates personally here: https://github.com/C0ldstudy/nanoGPT.

ChatGPT Function List Cheat List

Some useful materials for using ChatGPT:

  • https://platform.openai.com/docs/guides/gpt-best-practices

1. NLP Tasks

  • Text Generation
  • Summarization
  • Open Domain Question Answering
  • Paraphrasing
  • Sentiment Analysis (few-shot or zero-shot)
  • Table to Text
  • Text to Table
  • Toekn Classification
    • Prompt: classify the named entities in this text: George Washington and his troops crossed the Delaware River on December 25, 1776 during the American Revolutionary War.
  • Dataset Gneneration
    • Prompt: generate more datapoints from this text:
  • Language Translation

2. Code

  • Code Generation:
    • Prompt: show me how to make an http request in Python
  • Code Explanation:
    • Prompt: explain this python code
  • Docstring Generation
    • Prompt: write a docstring description for this function
  • Programming Language Conversion
    • Prompt: convert this code from Python to Javascript
  • Data Object Conversions (JSON, XML, CSV etc.)
  • Knowledge Graph Generation
    • Prompt: convert this text into nodes and edges
  • HTML to Text (Web Scraping)
    • Prompt: convert this HTML to text

      3. Structured Output Styles

  • List
    • Prompt: give me a list of 5 citrus fruits
  • Numbered List
  • Headings and Subheadings
    • Prompt: convert this text into headings and subheadings
  • Tables
    • Prompt: create a table from this list: Oranges, Lemons, Limes, Grapefruit, Tangerines

      4. Unstructured Output Styles

  • Narrative Modes (1st, 2nd or in the 3rd person)
    • Prompt: write a paragraph on how to make brownies in the 1st person
  • Formal/Informal
  • Personas
    • Prompt: write a paragraph on the topic of cellular automata in the style of a social media influencer
  • Custom Text Manipulation
    • Prompt: write a paragraph on the history of the calculator, include emojis at the end of every sentence, and do not capitalize the first word in each sentence

      5. Media Types

  • Write Social Media Posts
    • Prompt: write a tweet on futurism
  • Write Blogs/Emails/Poems/Songs/Resumes/Cover Letters

    6. Meta ChatGPT

  • Ask ChatGPT About Its Own Capabilities
    • Prompt: what ways can you structure text output?
  • Correct ChatGPT on Its Knowledge
  • Ask ChatGPT to Expand on Answers

References:

  1. Let’s build GPT: from scratch, in code, spelled out. from Andrej Karpathy.
  2. Hugging face
  3. Transformers-Tutorials
  4. How should AI systems behave, and who should decide?
This post is licensed under CC BY 4.0 by the author.

ChatGPT Prompt Course Note

LLM Paper Summary