Note: The job is a remote job and is open to candidates in USA. Flexential is a company focused on building and operating IT platforms, and they are seeking a Senior Platform Engineer to join their platform development team. This role involves hands-on engineering responsibilities for developing and managing critical platform subsystems, ensuring high availability and operational resiliency while utilizing native-AI capabilities.
Responsibilities
- Design, develop and operationally manage automated, resilient, high availability, self-healing, secure platforms with native-AI capabilities for IT needs, serving both internal as well as customer business capabilities
- Develop, and manage the Observability OpenTelemetry Central Backend Stack: Grafana Enterprise, Mimir, Loki, Tempo, and Alertmanager on Kubernetes/RKE2 via Helm and GitLab CI-CD
- Build and manage iaC and CI-CD for automated provisiong and deployment, including Terraform modules for Infra/VM/storage provisioning, Ansible AWX playbooks for OS/App bootstrap, ArgoCD and Helm for Kubernetes configuration
- Develop and manage OpenTelemetry Prometheus scrape profile library including SNMP exporters, REST API exporters, and cloud provider exporters (CloudWatch, Azure Monitor, GCP) for multiple device classes
- Develop AIOps capabilities on platforms for e.g Observability use-cases: anomaly detection integrations, event correlation rules in Alertmanager, and synthetic monitoring patterns to reduce alert noise
- Configure and maintain Zabbix auto-discovery: network range scanning, device classification, and Prometheus service discovery integration
- Build and harden Edge Stack deployments (Prometheus + OTel collector) per data center site using GitOps templates
- Integrate Alertmanager with ServiceNow: webhook routing, ticket enrichment, auto-close logic, and escalation policy configuration
- Maintain platform security: Conjur/CyberArk secret injection at runtime, mTLS between stack components, RBAC in Grafana Enterprise
- Author and maintain Grafana dashboards in JSON/GitLab — facility overview, network health, RED metrics, application telemetry
- Mentor mid-level engineers, lead code reviews, and establish engineering standards for the team. Represent platform engineering in cross-functional architecture reviews and executive-level program updates
- Perform other duties as required and assigned
Skills
- DevOps / Automation - 5+ years in a production environment, Kubernetes (RKE2/k3s), Helm chart deployment, system services, Docker/container
- LGTM Stack Development and Configuration - 4+ years: Grafana, Mimir, Loki, Tempo configuration, tuning, dash-boarding and production operations; Prometheus required
- Senior-level Python / Scripting frameworks - 5+ years, Automation scripts, exporter development, GitLab pipeline scripting, REST API integrations
- GitOps / CI/CD - 5+ years, GitLab CI/CD pipeline authoring; Terraform and Ansible as primary IaC tools; ArgoCD or Flux preferred
- AIOps / Observability Engineering - 2+ years, Alertmanager rule authoring, anomaly detection integration, event correlation, noise reduction techniques
- Working infrastructure (Linux/VM) management knowledge - 5+ years, Linux administration, VMware vCenter/VCF experience, Netapp storage management, network fundamentals (SNMP, TCP/IP)
- Secrets Management - 2+ years, CyberArk/Conjur, HashiCorp Vault, or equivalent — runtime secret injection patterns
- Minimal travel may be required
- Experience and/or knowledge of ITSM processes and workflow automation e.g. Incident & Response Mgmt (IRM), Release mgmt., ServiceNow ITSM integration, alert routing, escalation policy design, SLA-driven on-call workflows
- Hands-on experience or working knowledge of Boomi integrations PaaS(iPaaS) technologies
- Experience working with BAS / BMS systems in a Datacenter / OT environment
- Hands-on experience working with AWS products in a Well-architected Framework and multi-account model to develop various compute, storage, network iaaS and PaaS services for IT applications
Benefits
- Medical, Telehealth, Dental and Vision
- 401(k)
- Health Savings Accounts (HSA) and Flexible Spending Accounts (FSA)
- Life and AD&D
- Short Term and Long-Term disability
- Flex Paid Time Off (PTO)
- Leave of Absence
- Employee Assistance Program
- Wellness Program
- Rewards and Recognition Program
Company Overview
Flexential provides IT solutions including integrated colocation, interconnection, cloud, data protection, and professional services. It was founded in 2000, and is headquartered in Charlotte, North Carolina, USA, with a workforce of 501-1000 employees. Its website is https://www.flexential.com/.