GlossaryApril 23, 2026By IncoreSoft Team

AI Inference

AI inference is the process of running a trained machine learning model on new input data to produce a prediction — in video analytics, that means running a face recognition, object detection, or other model on a live camera frame to detect what's in it.

AI Inference

How It Works

Inference is the production-time counterpart to training:

A model is trained once on a large dataset, producing a set of learned weights.
The weights are exported to an optimized runtime format (TensorRT, ONNX, OpenVINO).
At runtime, each new frame from a camera is passed through the model, which outputs predictions (bounding boxes, identities, alerts).

Unlike training — which needs huge compute and can take days — inference must be fast, often under 50 ms per frame, and runs continuously 24/7.

Why It Matters

Inference performance determines whether an AI system is practical:

Latency — real-time use cases (gun detection, fire alerts) require sub-second response.
Throughput — one server often handles 20–100 camera streams simultaneously.
Cost — efficient inference reduces GPU requirements and energy consumption.

IncoreSoft's VEZHA platform is engineered for fast inference with sub-50 ms latency on its face recognition and ALPR modules, running across edge, cloud, and VMS-integrated deployments.

Use Cases

Real-time alerts — weapon, fire, or fall detection triggering instant notifications
Live access control — face recognition unlocking doors in under a second
High-density video — running analytics on hundreds of simultaneous streams
Edge deployment — inference on cameras or nearby appliances for low latency

Frequently Asked Questions

What's the difference between training and inference?

Training creates the model from labeled data — a compute-heavy, one-time process. Inference applies the trained model to new data in production — lightweight and continuous.

Does inference require a GPU?

For complex models at high frame rates, yes. For smaller models or lower frame rates, modern CPUs and edge accelerators (NPUs, VPUs) handle inference efficiently.

How do you reduce inference latency?

Techniques include model quantization (reducing precision from 32-bit to 8-bit), pruning (removing unused parameters), distillation (training a smaller model to mimic a larger one), and specialized runtimes like TensorRT.

Blog

Ready to Get Started?

Fill in the form and our team will get back to you shortly.