AI/ML engineering
Move models from notebook to production. Engineers who own the full pipeline.
A fine-tuned model that never leaves a notebook is an expensive experiment, not a product feature. We staff engineers who version data and training with DVC, promote models through gated registries, serve inference with batching and autoscaling, and watch production inputs for drift. You get reproducible training, controlled promotion, and inference that finance can reason about.
Scope your ML roadmapimport torchfrom transformers import AutoModelForCausalLM, TrainingArgumentsfrom peft import LoraConfig, get_peft_modelimport mlflowmodel = get_peft_model( AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16), LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"]),)args = TrainingArguments( output_dir="checkpoints/invoice-llm", per_device_train_batch_size=4, bf16=True,)with mlflow.start_run(run_name="invoice-llm-v3"): trainer = Trainer(model=model, args=args, train_dataset=dataset) trainer.train(resume_from_checkpoint=True) mlflow.log_metrics({"eval_loss": trainer.state.best_metric}) Core stack
- PyTorch & training
- DVC & MLOps pipelines
- MLflow & Weights & Biases
- Triton & FastAPI serving
- Hugging Face & LLMs
- RAG & retrieval
5+
Average years in applied ML
Engineers who've shipped models, not just Kaggle notebooks or coursework projects.
Deep-Dive Tech Stack
Production ML needs the same rigor from training through serving and monitoring. We match on the MLOps stack you run so experiments, registries, and endpoints stay linked when models and data change after launch.
-
PyTorch & training
Custom training, distributed jobs on multi-GPU nodes, and export to ONNX or TorchScript. Train-serve skew from preprocessing drift and preempted jobs corrupting checkpoints are handled with frozen preprocessing pipelines versioned alongside weights and idempotent, resumable training runs.
-
DVC & MLOps pipelines
Data and model versioning with pipeline DAGs tying datasets to configs to artifacts. Pinned dependencies, hashed datasets, and CI that fails on pipeline drift replace "works on my laptop" with audit-ready reproducibility.
-
MLflow & Weights & Biases
Experiment tracking, hyperparameter sweeps, and registry workflows with approval gates before production. Runs are compared on business metrics, and each promoted config is traceable so rollback is a registry pointer change, not a frantic retrain.
-
Triton & FastAPI serving
NVIDIA Triton for GPU batching and dynamic batching, or FastAPI for CPU models and LLM endpoints. Concurrency, warm-up, and autoscaling are tuned so p99 latency and inference cost drop when batching and right-sized instances replace always-on oversized GPUs.
-
Hugging Face & LLMs
Transformer fine-tuning with LoRA or QLoRA, hardened tokenizer pipelines, and eval harnesses for hallucination and safety regressions. They plan for context limits, token cost at scale, and when retrieval beats a larger model.
-
RAG & retrieval
LangChain or LlamaIndex pipelines with chunking, embedding selection, and retrieval evaluation tied to answer quality. Prompts and index schemas are versioned so a bad re-embed does not silently degrade production answers.
-
Production monitoring
Drift detection on input features, latency SLOs, and business KPIs linked to model versions. Shadow deployments and canary routes limit blast radius when a new model underperforms after promotion.
-
Feast & feature stores
Online and offline feature consistency for training and inference, point-in-time correct joins, and versioned feature definitions. Train-serve skew from ad hoc SQL in notebooks drops when serving reads the same feature view the model was trained on.
-
ONNX & model export
Export paths from PyTorch or scikit-learn to ONNX for CPU-optimized inference and cross-runtime deployment. Quantization and graph optimization reduce latency and cost when GPU is unnecessary for the model size and traffic profile.
Metrics ML leads actually track
- Average years in applied ML
- 5+
- Inference cost reduction potential
- 60%+
- Fine-tune to staging deployment
- 2–4 wks
- Reproducible experiment tracking
- 100%
Engineers who've shipped models, not just Kaggle notebooks or coursework projects.
Through quantization, batching, and right-sized GPU instances on past engagements.
For scoped LLM or classical ML projects with clear evaluation criteria.
Every training run logged with data version, seed, and config. Audit-ready from day one.
ML staffing: no hype, just process
How do you handle time-zone crossovers?
Training jobs run async; sync time covers standups, eval reviews, and deployment windows. We block 3–4 hours of overlap with your product and platform teams so decisions don't stall waiting for someone to wake up.
Do your engineers fine-tune models on our data?
Yes, in your environment or a dedicated tenant you control. Data stays under your policies. We sign NDAs and follow your data handling requirements before any access is granted.
What is your code review process for ML code?
Reviews cover reproducibility (seeds, data hashes), eval methodology, and inference safety. We catch data leakage in splits and silent metric regressions before merge, not after a bad deploy.
Can you integrate with our existing MLOps stack?
We work inside your MLflow, W&B, SageMaker, or Vertex setup. We don't force a proprietary platform migration to staff engineers.
How do you handle model drift in production?
We set up monitoring on input distributions, latency, and business KPIs, not just accuracy on a static holdout set. Alert thresholds and retrain triggers are documented upfront.