InternalAutomation

Models

Computer Vision & Vision Models

Deploy custom vision AI for image recognition, object detection, visual inspection, and video analytics tailored to your business needs.

  • open-weight
  • you own the weights
  • self-hostable
  • SFT + LoRA
Open
Open-weight families
Access the leading open-weight models from the Qwen, Kimi, and GLM families, fine-tuned on your data.
~30
Days to first fine-tune
From your data to a model running in production, then improved from real usage.
Yours
Weights + pipeline
You own the trained weights, adapters, and the retraining pipeline. Self-hostable.

02 / The catalog

Open-weight models, fine-tuned and yours

One place for the models worth building on. Access the leading open-weight families, tune them to your data, and keep the weights.

8 open-weight bases
  • Qwen3.7-7B-InstructLanguageFast, low-cost base for chat, extraction, and classification.
    Qwen7B128K ctxopen-weight
  • Qwen3.7-32B-InstructLanguageBalanced accuracy and cost for most production fine-tunes.
    Qwen32B128K ctxopen-weight
  • Qwen3.7-72B-InstructLanguageFrontier accuracy for the hardest reasoning tasks.
    Qwen72B128K ctxopen-weight
  • Qwen3.7-VL-7BVisionReads images, scans, and document layouts.
    Qwen7B32K ctxopen-weight
  • Qwen3.7-VL-32BVisionHigher-fidelity visual understanding for inspection and OCR.
    Qwen32B32K ctxopen-weight
  • KimiLanguageVery long context for whole-document and full-history reasoning.
    MoonshotMoE256K ctxopen-weight
  • GLMLanguageStrong bilingual performance and tool use.
    Zhipu32B128K ctxopen-weight
  • GLM-VVisionVision-language model for multimodal workflows.
    ZhipuVLM64K ctxopen-weight

03 / Fine-tune

Configure a model, then watch it train

Pick the shape of the build and run an illustrative fine-tune. When it fits, book a build for that exact spec.

Spec the model, then watch it train.

Set the shape of the build and run an illustrative fine-tune right here: the loss falls, the eval climbs, and the log streams. Every number is an estimate, not a promise.

Base size
Recommended approachLoRA adapterA LoRA adapter trains fast and cheap, and you can swap it per task without retraining.
step60/60
loss0.611
eval0.81
tok/s1.8k

Compute band: ~1 to 2 GPU-h. Illustrative: params x examples x 3 epochs.

awaiting run... the curve plots as steps complete
train_config.yaml
base_model: Qwen3.7-7B-Instruct
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
sequence_len: 8192
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 0.0002
optimizer: adamw_bnb_8bit
datasets:
  - path: ./data/your-dataset.jsonl
    type: chat_template
val_set_size: 0.05

Base: Qwen3.7-7B-Instruct. Open-weight, trained on your data, owned by you.

Book a build for this spec

04 / What it changes

What the build is designed to do

  1. 01Automate visual inspection with accuracy that exceeds human capabilities
  2. 02Monitor operations in real-time with intelligent video analysis
  3. 03Build custom image recognition tailored to your specific products and environment
  4. 04Process thousands of images per hour without fatigue or inconsistency
  5. 05Enable 24/7 visual surveillance with automated anomaly detection
  6. 06Combine vision AI with other data sources for multimodal intelligence

07 / Proof

Computer Vision & Vision Models in the real world

Real builds where this service did the work. See the setup, the rollout, and the results.

08 / FAQs

Computer Vision & Vision Models questions

How is this different from your existing Computer Vision Solutions service?

Our Computer Vision Solutions service focuses on deploying complete vision-powered business solutions end-to-end. This Computer Vision and Vision Models service is specifically about building and training custom vision AI models tailored to your unique visual domain. Think of it as the model-building expertise behind the solutions. Clients who need custom-trained models for novel use cases, fine-tuned detection for their specific products, or multimodal AI that combines vision with language come to this service for the specialized model development work.

How many training images do I need for a custom vision model?

Thanks to modern transfer learning and foundation models, you need far fewer images than you might expect. For many object detection and classification tasks, 200 to 1,000 labeled images per category produce strong results. For more nuanced tasks like defect detection with subtle visual differences, 500 to 2,000 examples may be needed. We assist with data collection strategies, annotation, and augmentation techniques that maximize model performance from limited data. A pilot with sample data helps us establish exact requirements for your use case.

Can vision models work in challenging lighting or environmental conditions?

Yes. We specifically train and test models under the real-world conditions they will encounter in your environment. This includes varying lighting conditions, camera angles, weather effects for outdoor applications, motion blur from moving conveyor belts, and occlusion from overlapping objects. We use data augmentation techniques during training to make models hold up to these variations, and we can recommend environmental adjustments like supplemental lighting that improve performance cost-effectively.

Do vision models require powerful hardware to run?

The hardware requirements depend on the application. Many vision models run efficiently on affordable edge devices like NVIDIA Jetson or even optimized mobile processors, processing video feeds in real-time at low power consumption. More demanding applications, like analyzing multiple high-resolution camera feeds simultaneously, may require dedicated GPU servers. We optimize model architectures for your deployment target, balancing accuracy against computational efficiency to ensure the model runs reliably on the hardware that fits your budget.

Turn Computer Vision & Vision Models into something your team actually uses.

Name the work you want this to handle. We will map the build, show what is worth doing first, and what it costs. If there is no fit, we will say so.