Discover Technata Job board

Find your next tech job in Kanata North, Canada’s largest technology park. Then explore endless international opportunities and dream about where your career will take you. With the Country’s largest density of technology companies ranging from promising startups to leading global giants, Kanata North is the place to be if you are serious about a career in tech.

MLOps Engineer - AWS-Focused ML Infrastructure

Keysight Technologies

Keysight Technologies

Software Engineering, Other Engineering, Data Science
Singapore
Posted on Jan 7, 2026
Overview

Keysight is at the forefront of technology innovation, delivering breakthroughs and trusted insights in electronic design, simulation, prototyping, test, manufacturing, and optimization. Our ~15,000 employees create world-class solutions in communications, 5G, automotive, energy, quantum, aerospace, defense, and semiconductor markets for customers in over 100 countries. Learn more about what we do.

Our award-winning culture embraces a bold vision of where technology can take us and a passion for tackling challenging problems with industry-first solutions. We believe that when people feel a sense of belonging, they can be more creative, innovative, and thrive at all points in their careers.


Responsibilities

We are expanding our engineering team with a dedicated MLOps Engineer specializing in AWS to support the deployment, scaling, and operationalization of machine learning solutions across our manufacturing and semiconductor analytics platforms. This role will serve as a critical bridge between our Machine Learning Engineers—focused on Generative AI and classical ML—and production environments, ensuring seamless, reliable, and efficient ML workflows.

You will collaborate closely with the Senior Machine Learning Engineer (GenAI Platform) and the Machine Learning Engineer (Classical ML and Predictive Analytics) to automate pipelines, monitor model performance, and manage infrastructure for high-stakes applications like test plan generation, anomaly detection, predictive maintenance, and market intelligence. In our AWS-centric ecosystem, you will leverage best-in-class tools to enable rapid iteration while maintaining compliance, security, and cost efficiency in regulated industrial settings.

This position is perfect for a mid-level professional with a passion for DevOps in ML contexts, who excels at turning complex models into robust, production-ready systems.

  • Design, implement, and maintain end-to-end MLOps pipelines on AWS, including CI/CD automation for model training, validation, deployment, and retraining, using services like SageMaker, CodePipeline, CodeBuild, and Step Functions.
  • Support the Generative AI platform by operationalizing AWS Bedrock workflows, including RAG pipelines, vector databases (e.g., via OpenSearch or Pinecone integrations), Lambda functions, and agentic systems—ensuring scalability for large-scale data processing like historical test plans and news article summarization.
  • Enable classical ML initiatives by deploying and monitoring models built with XGBoost, Scikit-learn, and NLP architectures (e.g., RNNs/LSTMs) on AWS infrastructure, incorporating drift detection for anomaly tracking in sensor data and competitor pricing monitoring.
  • Manage infrastructure as code (IaC) using Terraform or CloudFormation to provision and optimize AWS resources, such as EC2 instances, S3 buckets, EMR for Apache Spark-based processing (supporting our PMA product), and ECS/EKS for containerized deployments.
  • Implement comprehensive monitoring, logging, and alerting systems with CloudWatch, X-Ray, and third-party tools (e.g., Prometheus/Grafana integrations) to track model performance, detect anomalies, handle concept drift, and ensure high availability for customer-facing tools like Q&A chatbots and predictive maintenance advisors.
  • Collaborate in an Agile environment with ML engineers, data scientists, and SRE teams to conduct A/B testing, version models, automate rollbacks, and optimize costs through auto-scaling and spot instances.
  • Enforce security and compliance best practices, including IAM roles, VPC configurations, data encryption, and audit logging, to safeguard sensitive manufacturing data and meet industry standards.
  • Troubleshoot production issues, perform root-cause analysis, and drive continuous improvements in ML operations, staying ahead of AWS innovations to enhance platform reliability and efficiency.

Qualifications

Must-have qualifications

  • Bachelor's or Master's degree in Computer Science, Engineering, Information Systems, or a related technical field.
  • 3–5 years of experience in MLOps, DevOps, or cloud engineering roles, with a proven track record of deploying and managing ML models in production environments.
  • Deep expertise in AWS services for ML and data workflows, including SageMaker (real-time endpoints, inference components, multi-instance/multi-variant deployments), Bedrock (provisioned throughput, cross-Region inference profiles for scaling & resilience), EMR (for Spark-based PMA workloads), Lambda, S3, ECR, and orchestration tools like Step Functions or Airflow.
  • Proven experience with Amazon Elastic Container Registry (ECR): building, scanning for vulnerabilities, tagging, versioning, and pushing custom Docker images for inference containers (including Bring-Your-Own-Container patterns for custom ML frameworks, vLLM, or deep learning environments); managing ECR lifecycle policies, replication across regions, and secure access via IAM roles.
  • Strong proficiency in EC2-based ML deployments and infrastructure: selecting optimal instance types (e.g., ml.g family for GPU-heavy GenAI inference, g5/g6 for newer accelerators), configuring Auto Scaling Groups, managing spot instances for cost optimization, and handling EC2 fleets for custom hosting when SageMaker/Bedrock abstractions are insufficient.
  • Expertise in load balancing & scaling for ML inference: configuring and troubleshooting Application Load Balancers (ALB) or Network Load Balancers (NLB) integrated with SageMaker endpoints or ECS/EKS tasks; implementing SageMaker's built-in routing strategies (e.g., least outstanding requests for latency optimization); setting up auto-scaling policies (target tracking on CPU utilization, invocations per instance, or custom CloudWatch metrics); using cross-Region inference profiles in Bedrock for burst handling and global resilience; and ensuring high availability through multi-AZ deployments with minimum instance counts ≥2.
  • Demonstrated ability to resolve common deployment issues in production ML environments, including: cold-start latency in serverless/containerized inference, container pull failures from ECR, IAM permission misconfigurations causing access denied errors, model artifact corruption or version mismatches post-deployment, endpoint update failures without downtime (using blue/green or canary strategies), drift/throttling in high-concurrency scenarios (e.g., 429 errors in Bedrock), unhealthy instance recovery, and debugging via CloudWatch Logs, X-Ray traces, and SageMaker Model Monitor alerts.
  • Proficiency in IaC tools such as Terraform or CloudFormation to provision and optimize AWS resources (e.g., ECR repositories, EC2 fleets, ALBs, SageMaker endpoints, and auto-scaling configurations) in a repeatable, auditable manner.
  • Strong scripting and programming skills in Python (with libraries like Boto3), along with experience in CI/CD pipelines using Jenkins, GitHub Actions, or AWS CodePipeline — with specific focus on automated ECR image builds, model artifact promotion, and safe endpoint updates.
  • Familiarity with monitoring and observability stacks (e.g., CloudWatch, ELK Stack) and ML-specific tools for versioning (e.g., MLflow) and experiment tracking.
  • Experience in Agile methodologies, with hands-on participation in sprints, code reviews, and cross-functional problem-solving.
  • Solid understanding of ML concepts, including model drift, bias detection, and serving patterns, to effectively support both GenAI and classical ML teams.

Strongly preferred

  • Prior exposure to manufacturing, semiconductor, or industrial IoT domains, where data reliability and low-latency inference are critical.
  • Certifications such as AWS Certified Machine Learning – Specialty, AWS Certified DevOps Engineer, or equivalent.
  • Experience with hybrid ML setups, integrating on-premises data with cloud services, or handling large-scale NLP/Numerical data pipelines.
  • Knowledge of security frameworks like SOC 2 or ISO 27001, and tools for automated testing of ML infrastructure.
  • Prior experience troubleshooting and optimizing SageMaker multi-instance/multi-variant endpoints (including traffic shifting, shadow testing, and A/B deployments) and Bedrock inference profiles (Priority/Flex tiers, cross-Region routing for throughput and cost balancing).
  • Hands-on work with EC2 Auto Scaling in ML contexts, including handling GPU instance availability constraints, spot interruption recovery, and cost-effective scaling for bursty inference workloads.
  • Familiarity with advanced deployment patterns such as blue/green deployments, canary rollouts, and rollback automation to minimize production impact during model updates.

If you are a pragmatic, AWS-savvy engineer excited about operationalizing cutting-edge ML in mission-critical industries, this role offers the opportunity to build resilient systems that directly impact our company's innovation and customer outcomes. Join a dynamic team committed to excellence, with ample room for growth and technical leadership.

Careers Privacy Statement***Keysight is an Equal Opportunity Employer.***