Question 1

Does this match GPT or Claude exactly?

Accepted Answer

It approximates standard tokenization. Specific BPE tokenizers vary by model.

Question 2

Why are my prompts more tokens than words?

Accepted Answer

Tokens are smaller. 'Unbelievable' might be 3-4 tokens.

Question 3

How are punctuation and spaces tokenized?

Accepted Answer

Punctuation usually becomes its own token. Leading spaces attach to the next word.

Question 4

Safe chunk size for embeddings?

Accepted Answer

Stay 200-300 tokens below the model max for instruction headroom.

Question 5

Code tokenization vs prose?

Accepted Answer

Code has more tokens per character due to brackets and semicolons.

Question 6

Does CJK tokenize differently?

Accepted Answer

Chinese, Japanese and Korean tokenize per character or subword.

Text Tokenizer

How to Use This Text Tokenizer