1. Transformer[1]
I start it by following the video[1] from Andrej Karpathy and check the code from the repo. In the video, he shows how to build a GPT by oneself step by step which also shows the way to build a NLP model in a high level. I summarize these steps and add some useful tips that mentioned in the video.
1.1 Data Preprocessing
Tokenizer
1.2 Bigram Model
The code is here. High-level to train a bigram model
- A mapping from characters to integers
- Split train and test
- Data Loading
- Estimate loss
- Define the Bigram Language Model
- Optimizer: AdamW
- Train and generate outputs
Bigram Model: it is also called 2-gram model which means the prediction of the next word is just related to the one word before. As to a N-gram model, the $N$ word is based on all the $N-1$ words.
1.3 Transformer
Some example snippets.
Head/Self-Attention
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class Head(nn.Module):
""" one head of self-attention """
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# input of size (batch, time-step, channels)
# output of size (batch, time-step, head size)
B,T,C = x.shape
k = self.key(x) # (B,T,hs)
q = self.query(x) # (B,T,hs)
# compute attention scores ("affinities")
wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
wei = F.softmax(wei, dim=-1) # (B, T, T)
wei = self.dropout(wei)
# perform the weighted aggregation of the values
v = self.value(x) # (B,T,hs)
out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
return out
class MultiHeadAttention(nn.Module):
""" multiple heads of self-attention in parallel """
def __init__(self, num_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.proj = nn.Linear(head_size * num_heads, n_embd)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
out = torch.cat([h(x) for h in self.heads], dim=-1)
out = self.dropout(self.proj(out))
return out
Block
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class FeedFoward(nn.Module):
""" a simple linear layer followed by a non-linearity """
def __init__(self, n_embd):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(dropout),
)
def forward(self, x):
return self.net(x)
class Block(nn.Module):
""" Transformer block: communication followed by computation """
def __init__(self, n_embd, n_head):
# n_embd: embedding dimension, n_head: the number of heads we'd like
super().__init__()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedFoward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
return x
Model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
class GPTLanguageModel(nn.Module):
def __init__(self):
super().__init__()
# each token directly reads off the logits for the next token from a lookup table
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
self.ln_f = nn.LayerNorm(n_embd) # final layer norm
self.lm_head = nn.Linear(n_embd, vocab_size)
# better init, not covered in the original GPT video, but important, will cover in followup video
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx, targets=None):
B, T = idx.shape
# idx and targets are both (B,T) tensor of integers
tok_emb = self.token_embedding_table(idx) # (B,T,C)
pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
x = tok_emb + pos_emb # (B,T,C)
x = self.blocks(x) # (B,T,C)
x = self.ln_f(x) # (B,T,C)
logits = self.lm_head(x) # (B,T,vocab_size)
if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)
return logits, loss
def generate(self, idx, max_new_tokens):
# idx is (B, T) array of indices in the current context
for _ in range(max_new_tokens):
# crop idx to the last block_size tokens
idx_cond = idx[:, -block_size:]
# get the predictions
logits, loss = self(idx_cond)
# focus only on the last time step
logits = logits[:, -1, :] # becomes (B, C)
# apply softmax to get probabilities
probs = F.softmax(logits, dim=-1) # (B, C)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
# append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idx
2. ChatGPT
2.1 Key ideas about how ChatGPT works[4]
The steps to train ChatGPT today (Feb-16-2023)
- Pre-training: The downsteam task is to predict the next word in a sentence. The dataset is from the Internet texts.
- Fine-tune: The dataset is from Reviewers’ instructions.
- InstructGPT : Aligning Language Models to Follow Instruction. (link)
- Addressing biases: guidelines. Future system:
- Improve default behavior.
- Define your AI’s values, within broad bounds.
- Public input on defaults and hard bounds.
2.2 Train a GPT-2[1]
I fork the original repo to store my updates personally here: https://github.com/C0ldstudy/nanoGPT.
ChatGPT Function List Cheat List
Some useful materials for using ChatGPT:
- https://platform.openai.com/docs/guides/gpt-best-practices
1. NLP Tasks
- Text Generation
- Summarization
- Open Domain Question Answering
- Paraphrasing
- Sentiment Analysis (few-shot or zero-shot)
- Table to Text
- Text to Table
- Toekn Classification
- Prompt: classify the named entities in this text: George Washington and his troops crossed the Delaware River on December 25, 1776 during the American Revolutionary War.
- Dataset Gneneration
- Prompt: generate more datapoints from this text:
- Language Translation
2. Code
- Code Generation:
- Prompt: show me how to make an http request in Python
- Code Explanation:
- Prompt: explain this python code
- Docstring Generation
- Prompt: write a docstring description for this function
- Programming Language Conversion
- Prompt: convert this code from Python to Javascript
- Data Object Conversions (JSON, XML, CSV etc.)
- Knowledge Graph Generation
- Prompt: convert this text into nodes and edges
- HTML to Text (Web Scraping)
- List
- Prompt: give me a list of 5 citrus fruits
- Numbered List
- Headings and Subheadings
- Prompt: convert this text into headings and subheadings
- Tables
- Narrative Modes (1st, 2nd or in the 3rd person)
- Prompt: write a paragraph on how to make brownies in the 1st person
- Formal/Informal
- Personas
- Prompt: write a paragraph on the topic of cellular automata in the style of a social media influencer
- Custom Text Manipulation
- Write Social Media Posts
- Prompt: write a tweet on futurism
- Write Blogs/Emails/Poems/Songs/Resumes/Cover Letters
6. Meta ChatGPT
- Ask ChatGPT About Its Own Capabilities
- Prompt: what ways can you structure text output?
- Correct ChatGPT on Its Knowledge
- Ask ChatGPT to Expand on Answers
References: