Documentation
knowmatic provides tiny ML models you embed directly in your application. They run locally and return structured predictions your agent can use to make routing decisions, which LLM to call, how much reasoning budget to allocate, and what language context to inject.
These are not API endpoints. You load the models into your app (browser via ONNX Runtime Web, or server via ONNX Runtime Node) and call them locally. The output is a JSON object you use however you want.
Model: difficulty_classifier
Takes a prompt string and predicts its difficulty level. Use this to decide which model tier your agent should route to, send easy prompts to a cheap, fast model and reserve expensive models for hard problems.
{
"predictions": [
{ "label": "Easy", "score": 0.993 },
{ "label": "Hard", "score": 0.004 },
{ "label": "Medium", "score": 0.003 }
],
"latencyMs": 68.9
}Model: reasoning_effort
Predicts how much reasoning a prompt requires. Use this to set the thinking budget on models that support it (e.g. Claude's extended thinking), save tokens and latency on prompts that don't need deep reasoning.
{
"predictions": [
{ "label": "low", "score": 0.935 },
{ "label": "medium", "score": 0.062 },
{ "label": "high", "score": 0.003 }
],
"latencyMs": 82.7
}Model: code_classifier
Detects the programming language of code snippets. Use this to inject language-specific system prompts, select the right linter or formatter context, or route to a model fine-tuned for that language.
{
"predictions": [
{ "label": "Python", "score": 0.994 },
{ "label": "TypeScript", "score": 0.006 },
{ "label": "CSS", "score": 0.000 }
],
"latencyMs": 39.6
}Model: sft
A small generative transformer that produces inline text completions. Generates tokens one at a time with configurable temperature, top-k sampling, and a minimum confidence threshold. Stops when confidence drops below 90% or the EOS token is produced.
Unlike the classifiers, this model streams tokens. Each token is emitted as it's generated. The output is raw text you append to the user's input as a suggestion.
{
{ "type": "token", "token": " calculates", "tokenId": 4821 }
{ "type": "token", "token": " the", "tokenId": 290 }
{ "type": "token", "token": " sum", "tokenId": 1564 }
{ "type": "done" }All classifier models (difficulty, reasoning effort, code) return the same structure. Predictions are sorted by score, highest first. Use the top prediction for routing or inspect multiple predictions for confidence-based logic.
interface InferenceResult {
predictions: {
label: string; // The predicted class label
score: number; // Probability (0-1), all scores sum to 1
}[];
latencyMs: number; // Inference time in milliseconds
}How to use this in your agent: Call the classifier before your LLM API request. Read predictions[0].label for the top prediction, then use it to select the model, set the thinking budget, or inject context. The entire classification runs locally in under 100ms.