🚀 JTokkit - Java Tokenizer Kit

Welcome to JTokkit, a Java tokenizer library designed for use with OpenAI models.

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);
assertEquals("hello world", enc.decode(enc.encode("hello world")));

// Or get the tokenizer corresponding to a specific OpenAI model
enc = registry.getEncodingForModel(ModelType.TEXT_EMBEDDING_ADA_002);

💡 Quickstart

For a quick getting started, see our documentation.

📖 Introduction

JTokkit aims to be a fast and efficient tokenizer designed for use in natural language processing tasks using the OpenAI models. It provides an easy-to-use interface for tokenizing input text, for example for counting required tokens in preparation of requests to the GPT-3.5 model. This library resulted out of the need to have similar capacities in the JVM ecosystem as the library tiktoken provides for Python.

🤖 Features

✅ Implements encoding and decoding via r50k_base, p50k_base, p50k_edit, cl100k_base and o200k_base

✅ Easy-to-use API

✅ Easy extensibility for custom encoding algorithms

✅ Zero Dependencies

✅ Supports Java 8 and above

✅ Fast and efficient performance

📊 Performance

JTokkit is between 2-3 times faster than a comparable tokenizer.

For details on the benchmark, see the benchmark directory.

🛠️ Installation

You can install JTokkit by adding the following dependency to your Maven project:

<dependency>
    <groupId>com.knuddels</groupId>
    <artifactId>jtokkit</artifactId>
    <version>1.1.0</version>
</dependency>

Or alternatively using Gradle:

dependencies {
    implementation 'com.knuddels:jtokkit:1.1.0'
}

🔰 Getting Started

To use JTokkit, simply create a new EncodingRegistry and use getEncoding to retrieve the encoding you want to use. You can then use the encode and decode methods to encode and decode text.

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);
IntArrayList encoded = enc.encode("This is a sample sentence.");
// encoded = [2028, 374, 264, 6205, 11914, 13]
        
String decoded = enc.decode(encoded);
// decoded = "This is a sample sentence."

// Or get the tokenizer based on the model type
Encoding secondEnc = registry.getEncodingForModel(ModelType.TEXT_EMBEDDING_ADA_002);
// enc == secondEnc

The EncodingRegistry and Encoding classes are thread-safe and can be freely shared among components.

➰ Extending JTokkit

You may want to extend JTokkit to support custom encodings. To do so, you have two options:

Implement the Encoding interface and register it with the EncodingRegistry

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding customEncoding = new CustomEncoding();
registry.registerEncoding(customEncoding);

Add new parameters for use with the existing BPE algorithm

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
GptBytePairEncodingParams params = new GptBytePairEncodingParams(
        "custom-name",
        Pattern.compile("some custom pattern"),
        encodingMap,
        specialTokenEncodingMap
);
registry.registerGptBytePairEncoding(params);

Afterwards you can use the custom encodings alongside the default ones and access them by using registry.getEncoding("custom-name"). See the JavaDoc for more details.

📄 License

JTokkit is licensed under the MIT License. See the LICENSE file for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 199 Commits
.github		.github
benchmark		benchmark
docs		docs
gradle/wrapper		gradle/wrapper
lib		lib
reference_test_generator		reference_test_generator
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
gradlew		gradlew
gradlew.bat		gradlew.bat
renovate.json		renovate.json
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 JTokkit - Java Tokenizer Kit

💡 Quickstart

📖 Introduction

🤖 Features

📊 Performance

🛠️ Installation

🔰 Getting Started

➰ Extending JTokkit

📄 License

About

Releases 10

Contributors 10

Languages

License

knuddelsgmbh/jtokkit

Folders and files

Latest commit

History

Repository files navigation

🚀 JTokkit - Java Tokenizer Kit

💡 Quickstart

📖 Introduction

🤖 Features

📊 Performance

🛠️ Installation

🔰 Getting Started

➰ Extending JTokkit

📄 License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 10

Contributors 10

Languages