knowmatic hobby lab
HomeDocs

Documentation

Model Outputs

knowmatic provides tiny ML models you embed directly in your application. They run locally and return structured predictions your agent can use to make routing decisions, which LLM to call, how much reasoning budget to allocate, and what language context to inject.

These are not API endpoints. You load the models into your app (browser via ONNX Runtime Web, or server via ONNX Runtime Node) and call them locally. The output is a JSON object you use however you want.


Difficulty Classifier

Model: difficulty_classifier

Takes a prompt string and predicts its difficulty level. Use this to decide which model tier your agent should route to, send easy prompts to a cheap, fast model and reserve expensive models for hard problems.

Output Labels

Label
ID
Suggested Routing
Easy
0
Haiku ($): fast, cheap
Medium
1
Sonnet ($$): balanced
Hard
2
Opus ($$$): most capable

Example Output

{
  "predictions": [
    { "label": "Easy", "score": 0.993 },
    { "label": "Hard", "score": 0.004 },
    { "label": "Medium", "score": 0.003 }
  ],
  "latencyMs": 68.9
}

Model Details

3 classes·16K vocab·512 max tokens·Quantized ONNX

Reasoning Effort

Model: reasoning_effort

Predicts how much reasoning a prompt requires. Use this to set the thinking budget on models that support it (e.g. Claude's extended thinking), save tokens and latency on prompts that don't need deep reasoning.

Output Labels

Label
ID
Suggested Use
low
0
Minimal thinking budget, direct answer
medium
1
Moderate thinking, some reasoning steps
high
2
Full thinking budget, complex reasoning

Example Output

{
  "predictions": [
    { "label": "low", "score": 0.935 },
    { "label": "medium", "score": 0.062 },
    { "label": "high", "score": 0.003 }
  ],
  "latencyMs": 82.7
}

Model Details

3 classes·16K vocab·512 max tokens·Quantized ONNX

Code Classifier

Model: code_classifier

Detects the programming language of code snippets. Use this to inject language-specific system prompts, select the right linter or formatter context, or route to a model fine-tuned for that language.

Output Labels (30 languages)

AssemblyBatchfileCC#C++CMakeCSSDockerfileFORTRANGOHTMLHaskellJavaJavaScriptJuliaLuaMakefileMarkdownPHPPerlPowerShellPythonRubyRustSQLScalaShellTeXTypeScriptVisual Basic

Example Output

{
  "predictions": [
    { "label": "Python", "score": 0.994 },
    { "label": "TypeScript", "score": 0.006 },
    { "label": "CSS", "score": 0.000 }
  ],
  "latencyMs": 39.6
}

Model Details

30 classes·16K vocab·512 max tokens·Quantized ONNX

Autocomplete (SFT)

Model: sft

A small generative transformer that produces inline text completions. Generates tokens one at a time with configurable temperature, top-k sampling, and a minimum confidence threshold. Stops when confidence drops below 90% or the EOS token is produced.

Output Format

Unlike the classifiers, this model streams tokens. Each token is emitted as it's generated. The output is raw text you append to the user's input as a suggestion.

{
{ "type": "token", "token": " calculates", "tokenId": 4821 }
{ "type": "token", "token": " the", "tokenId": 290 }
{ "type": "token", "token": " sum", "tokenId": 1564 }
{ "type": "done" }

Generation Parameters

Parameter
Default
Description
maxNewTokens
15
Max tokens to generate per request
temperature
0.1
Sampling temperature (lower = more deterministic)
topK
5
Top-k filtering for token sampling
minConfidence
0.55
Stop if top token probability drops below this
repetitionPenalty
1.5
Penalize repeated tokens

Model Details

8 layers·6 heads·384 dim·16K vocab·Quantized ONNX

Common Output Structure

All classifier models (difficulty, reasoning effort, code) return the same structure. Predictions are sorted by score, highest first. Use the top prediction for routing or inspect multiple predictions for confidence-based logic.

interface InferenceResult {
  predictions: {
    label: string;   // The predicted class label
    score: number;   // Probability (0-1), all scores sum to 1
  }[];
  latencyMs: number; // Inference time in milliseconds
}

How to use this in your agent: Call the classifier before your LLM API request. Read predictions[0].label for the top prediction, then use it to select the model, set the thinking budget, or inject context. The entire classification runs locally in under 100ms.

knowmatic hobby lab
HomeDocs