Documentation

Model Outputs

knowmatic provides tiny ML models you embed directly in your application. They run locally and return structured predictions your agent can use to make routing decisions, which LLM to call, how much reasoning budget to allocate, and what language context to inject.

These are not API endpoints. You load the models into your app (browser via ONNX Runtime Web, or server via ONNX Runtime Node) and call them locally. The output is a JSON object you use however you want.

Difficulty Classifier

Model: difficulty_classifier

Takes a prompt string and predicts its difficulty level. Use this to decide which model tier your agent should route to, send easy prompts to a cheap, fast model and reserve expensive models for hard problems.

Output Labels

Label

Suggested Routing

Easy

Haiku ($): fast, cheap

Medium

Sonnet ($$): balanced

Hard

Opus ($$$): most capable

Example Output

{
  "predictions": [
    { "label": "Easy", "score": 0.993 },
    { "label": "Hard", "score": 0.004 },
    { "label": "Medium", "score": 0.003 }
  ],
  "latencyMs": 68.9
}

Model Details

3 classes·16K vocab·512 max tokens·Quantized ONNX

Reasoning Effort

Model: reasoning_effort

Predicts how much reasoning a prompt requires. Use this to set the thinking budget on models that support it (e.g. Claude's extended thinking), save tokens and latency on prompts that don't need deep reasoning.

Output Labels

Label

Suggested Use

low

Minimal thinking budget, direct answer

medium

Moderate thinking, some reasoning steps

high

Full thinking budget, complex reasoning

Example Output

{
  "predictions": [
    { "label": "low", "score": 0.935 },
    { "label": "medium", "score": 0.062 },
    { "label": "high", "score": 0.003 }
  ],
  "latencyMs": 82.7
}

Model Details

3 classes·16K vocab·512 max tokens·Quantized ONNX

Code Classifier

Model: code_classifier

Detects the programming language of code snippets. Use this to inject language-specific system prompts, select the right linter or formatter context, or route to a model fine-tuned for that language.

Output Labels (30 languages)

AssemblyBatchfileCC#C++CMakeCSSDockerfileFORTRANGOHTMLHaskellJavaJavaScriptJuliaLuaMakefileMarkdownPHPPerlPowerShellPythonRubyRustSQLScalaShellTeXTypeScriptVisual Basic

Example Output

{
  "predictions": [
    { "label": "Python", "score": 0.994 },
    { "label": "TypeScript", "score": 0.006 },
    { "label": "CSS", "score": 0.000 }
  ],
  "latencyMs": 39.6
}

Model Details

30 classes·16K vocab·512 max tokens·Quantized ONNX

Autocomplete (SFT)

Model: sft

A small generative transformer that produces inline text completions. Generates tokens one at a time with configurable temperature, top-k sampling, and a minimum confidence threshold. Stops when confidence drops below 90% or the EOS token is produced.

Output Format

Unlike the classifiers, this model streams tokens. Each token is emitted as it's generated. The output is raw text you append to the user's input as a suggestion.

{
{ "type": "token", "token": " calculates", "tokenId": 4821 }
{ "type": "token", "token": " the", "tokenId": 290 }
{ "type": "token", "token": " sum", "tokenId": 1564 }
{ "type": "done" }

Generation Parameters

Parameter

Default

Description

maxNewTokens

Max tokens to generate per request

temperature

0.1

Sampling temperature (lower = more deterministic)

topK

Top-k filtering for token sampling

minConfidence

0.55

Stop if top token probability drops below this

repetitionPenalty

1.5

Penalize repeated tokens

Model Details

8 layers·6 heads·384 dim·16K vocab·Quantized ONNX

Common Output Structure

All classifier models (difficulty, reasoning effort, code) return the same structure. Predictions are sorted by score, highest first. Use the top prediction for routing or inspect multiple predictions for confidence-based logic.

interface InferenceResult {
  predictions: {
    label: string;   // The predicted class label
    score: number;   // Probability (0-1), all scores sum to 1
  }[];
  latencyMs: number; // Inference time in milliseconds
}

How to use this in your agent: Call the classifier before your LLM API request. Read predictions[0].label for the top prediction, then use it to select the model, set the thinking budget, or inject context. The entire classification runs locally in under 100ms.

Documentation

Model Outputs

Difficulty Classifier

Model: difficulty_classifier

Output Labels

Label

Suggested Routing

Easy

Haiku ($): fast, cheap

Medium

Sonnet ($$): balanced

Hard

Opus ($$$): most capable

Example Output

{
  "predictions": [
    { "label": "Easy", "score": 0.993 },
    { "label": "Hard", "score": 0.004 },
    { "label": "Medium", "score": 0.003 }
  ],
  "latencyMs": 68.9
}

Model Details

3 classes·16K vocab·512 max tokens·Quantized ONNX

Reasoning Effort

Model: reasoning_effort

Output Labels

Label

Suggested Use

low

Minimal thinking budget, direct answer

medium

Moderate thinking, some reasoning steps

high

Full thinking budget, complex reasoning

Example Output

{
  "predictions": [
    { "label": "low", "score": 0.935 },
    { "label": "medium", "score": 0.062 },
    { "label": "high", "score": 0.003 }
  ],
  "latencyMs": 82.7
}

Model Details

3 classes·16K vocab·512 max tokens·Quantized ONNX

Code Classifier

Model: code_classifier

Detects the programming language of code snippets. Use this to inject language-specific system prompts, select the right linter or formatter context, or route to a model fine-tuned for that language.

Output Labels (30 languages)

AssemblyBatchfileCC#C++CMakeCSSDockerfileFORTRANGOHTMLHaskellJavaJavaScriptJuliaLuaMakefileMarkdownPHPPerlPowerShellPythonRubyRustSQLScalaShellTeXTypeScriptVisual Basic

Example Output

{
  "predictions": [
    { "label": "Python", "score": 0.994 },
    { "label": "TypeScript", "score": 0.006 },
    { "label": "CSS", "score": 0.000 }
  ],
  "latencyMs": 39.6
}

Model Details

30 classes·16K vocab·512 max tokens·Quantized ONNX

Autocomplete (SFT)

Model: sft

Output Format

Unlike the classifiers, this model streams tokens. Each token is emitted as it's generated. The output is raw text you append to the user's input as a suggestion.

{
{ "type": "token", "token": " calculates", "tokenId": 4821 }
{ "type": "token", "token": " the", "tokenId": 290 }
{ "type": "token", "token": " sum", "tokenId": 1564 }
{ "type": "done" }

Generation Parameters

Parameter

Default

Description

maxNewTokens

Max tokens to generate per request

temperature

0.1

Sampling temperature (lower = more deterministic)

topK

Top-k filtering for token sampling

minConfidence

0.55

Stop if top token probability drops below this

repetitionPenalty

1.5

Penalize repeated tokens

Model Details

8 layers·6 heads·384 dim·16K vocab·Quantized ONNX

Common Output Structure

interface InferenceResult {
  predictions: {
    label: string;   // The predicted class label
    score: number;   // Probability (0-1), all scores sum to 1
  }[];
  latencyMs: number; // Inference time in milliseconds
}