tokenization- why do we need it?

To convert string data into integers of some form

here we will use character-level encoding and decoding

# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
  
print(encode("hii there"))  #output - [46, 47, 47, 1, 58, 46, 43, 56, 43]
print(decode(encode("hii there"))) # output - hii there

The above is a simple example of encoding, there are different more efficient methods as well. we will stick to this for this project

next, we will use this encoder-decoder function to encode all of the input text into a list of integers, basically encoding the entire text dataset

data = torch.tensor(encode(text), dtype=torch.long)

converting the long text data into a tensor

Chunks:

So we never feed the whole text data into a model, that's computationally expensive, so we divide the train data into diff chunks and feed that into the model

block_size = 8
train_data[:block_size+1] # output - tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

Here the block size indicates the chunk of data we will consider at a time

Note that the tensor([18, 47, 56, 57, 58, 1, 15, 47, 58]) has multiple examples packed into it and all these characters follow each other. meaning - if input is [18], the output is likely [47], in the context of [18, 47] the output is [56] and so on all the way up to [18, 47, 56, 57, 58, 1, 15, 47] and the output is likely [58]

so in one pass, we train on all the 8 examples, with content between 1 and block size (here 8), so that the model can understand context everything in between. so the model would predict everything up to block size and then we have to start truncating coz the model would never receiver input size bigger than block size

Batch dimension: efficiency purpose, parallel processing, multiple chunks at the same time but they are completely independent of each other

how get_batch() works:

batch_size = 4
block_size = 8
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

note that ix = torch.randint(len(data) - block_size, (batch_size,)) generate batch_size number of offset ( so 4 numbers between 0 and len(data) - block_size)

Now x = torch.stack([data[i:i+block_size] for i in ix]) first block_size characters starting at ā€˜i’ and y are its next characters.

torch.stack takes 1-dim tensors and stacks them up in a 4 x 8 tensors

Bigram model:

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
            super().__init__()
            # each token directly reads off the logits for the next token from a lookup table
            self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)
        return logits