tokenization- why do we need it?
To convert string data into integers of some form
here we will use character-level encoding and decoding
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
print(encode("hii there")) #output - [46, 47, 47, 1, 58, 46, 43, 56, 43]
print(decode(encode("hii there"))) # output - hii there
The above is a simple example of encoding, there are different more efficient methods as well. we will stick to this for this project
next, we will use this encoder-decoder function to encode all of the input text into a list of integers, basically encoding the entire text dataset
data = torch.tensor(encode(text), dtype=torch.long)
converting the long text data into a tensor
So we never feed the whole text data into a model, that's computationally expensive, so we divide the train data into diff chunks and feed that into the model
block_size = 8
train_data[:block_size+1] # output - tensor([18, 47, 56, 57, 58, 1, 15, 47, 58])
Here the block size indicates the chunk of data we will consider at a time
Note that the tensor([18, 47, 56, 57, 58, 1, 15, 47, 58]) has multiple examples packed into it and all these characters follow each other. meaning - if input is [18], the output is likely [47], in the context of [18, 47] the output is [56] and so on all the way up to [18, 47, 56, 57, 58, 1, 15, 47] and the output is likely [58]
so in one pass, we train on all the 8 examples, with content between 1 and block size (here 8), so that the model can understand context everything in between. so the model would predict everything up to block size and then we have to start truncating coz the model would never receiver input size bigger than block size
Batch dimension: efficiency purpose, parallel processing, multiple chunks at the same time but they are completely independent of each other
get_batch() works:batch_size = 4
block_size = 8
def get_batch(split):
# generate a small batch of data of inputs x and targets y
data = train_data if split == 'train' else val_data
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
return x, y
note that ix = torch.randint(len(data) - block_size, (batch_size,)) generate batch_size number of offset ( so 4 numbers between 0 and len(data) - block_size)
Now x = torch.stack([data[i:i+block_size] for i in ix]) first block_size characters starting at āiā and y are its next characters.
torch.stack takes 1-dim tensors and stacks them up in a 4 x 8 tensors
class BigramLanguageModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
# each token directly reads off the logits for the next token from a lookup table
self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
def forward(self, idx, targets=None):
# idx and targets are both (B,T) tensor of integers
logits = self.token_embedding_table(idx) # (B,T,C)
return logits