Text Tokenizer
Preview how text splits into tokens in one click — perfect for sizing LLM prompts and embedding chunks.
How to Use This Text Tokenizer
Step 1
Paste your prompt or chunk
Step 2
See each token and the total count
Step 3
Compare to your model's context window
Step 4
Hit Copy and resize as needed
What Is Text Tokenizer?
LLM costs and context windows are driven by tokens, not characters or words. Misjudging tokenization leads to truncated prompts and surprise bills.
Paste your prompt draft and see the token count instantly.
If you're engineering prompts, you stay under context limits. If you're chunking for embeddings, you size for the model's input window. If you're estimating costs, you forecast accurately before sending.
Frequently Asked Questions
Does this match GPT or Claude exactly?
It approximates standard tokenization. Specific BPE tokenizers vary by model.
Tip: Use tiktoken for exact GPT-4 counts.
Why are my prompts more tokens than words?
Tokens are smaller. 'Unbelievable' might be 3-4 tokens.
Tip: Common short words are 1 token; rare words and code can be 4-10.
How are punctuation and spaces tokenized?
Punctuation usually becomes its own token. Leading spaces attach to the next word.
Tip: This is why ' hello' and 'hello' differ in BPE.
Safe chunk size for embeddings?
Stay 200-300 tokens below the model max for instruction headroom.
Tip: 500-800 token chunks work well for retrieval quality.
Code tokenization vs prose?
Code has more tokens per character due to brackets and semicolons.
Tip: Code prompts can be 1.5x denser than prose.
Does CJK tokenize differently?
Chinese, Japanese and Korean tokenize per character or subword.
Tip: CJK is far more token-dense than English.