Package smile.llm.tokenizer
Class Tiktoken
java.lang.Object
smile.llm.tokenizer.Tiktoken
- All Implemented Interfaces:
Tokenizer
- Direct Known Subclasses:
Tokenizer
tiktoken is a fast BPE tokenizer by OpenAI.
-
Field Summary
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
allowSpecialTokens
(boolean allowSpecialTokens) Sets how special tokens will be encoded.decode
(int[] tokens) Decodes a list of token IDs into a string.int[]
Encodes a string into a list of token IDs.int[]
Encodes a string into a list of token IDs.boolean
Returns how special tokens will be encoded.Loads a tiktoken model file.String[]
Segments text into tokens.
-
Field Details
-
ranks
Token -> Rank -
specialTokens
Special Token -> Rank
-
-
Constructor Details
-
Tiktoken
public Tiktoken(Pattern pattern, Map<Bytes, Integer> ranks, String bos, String eos, String... specialTokens) Constructor.- Parameters:
pattern
- The regex pattern to split the input text into tokens.ranks
- The token to rank map.bos
- The beginning of sequence token.eos
- The end of sequence token.specialTokens
- Optional special tokens.
-
-
Method Details
-
allowSpecialTokens
public void allowSpecialTokens(boolean allowSpecialTokens) Sets how special tokens will be encoded.- Parameters:
allowSpecialTokens
- If false, special tokens will be encoded as natural text. Otherwise, they will be encoded as special tokens.
-
isSpecialTokenAllowed
public boolean isSpecialTokenAllowed()Returns how special tokens will be encoded.- Returns:
- false if special tokens will be encoded as natural text; true if they will be encoded as special tokens.
-
encode
Description copied from interface:Tokenizer
Encodes a string into a list of token IDs. -
encode
Description copied from interface:Tokenizer
Encodes a string into a list of token IDs. -
decode
Description copied from interface:Tokenizer
Decodes a list of token IDs into a string. -
tokenize
Description copied from interface:Tokenizer
Segments text into tokens. -
load
Loads a tiktoken model file.- Parameters:
path
- The tiktoken model file path.- Returns:
- the token -> rank map.
- Throws:
IOException
- if fail to load the model.
-