Jalisco, mexico all on-site
job description
software engineer – cloud infrastructure & finops
fully remote
nearshore only
are you looking for a career that makes a positive difference in your life and in the lives of learners and educators across the globe? Do you want to work with fun and social people in a positive and engaged virtual office environment?
we are hiring a software engineer – cloud infrastructure & finops who will build and support reliable, high capacity and well-performing systems in support of our mission to reimagine learning for millions of students and learners worldwide. We call this work "site reliability engineering".
as a site reliability engineer, you will work in a small team accountable for cost optimization, telemetry, security, performance, and reliability in aws infrastructure. You will collaborate in a devops model with product development teams; designing, deploying and managing automation tools that increase predictability as well as time to market while reducing cost. If you love to build developer tools and automation, know aws services inside out, have complex distributed system experience, and like engineering software solutions to solve cloud-related problems with a focus on cost efficiency, then you will thrive in this position.
our stack
1. code: python, javascript, php, nodejs, yaml, bash
2. rdbms: postgresql, mysql
3. cache: elasticache (redis/memcached), dynamodb
4. containers: ecs & docker
5. cloud: amazon aws
6. telemetry: new relic (preferred), datadog, cloudwatch, cost explorer
7. build: github actions (preferred), jenkins (nice to have), github enterprise and more
8. run: pagerduty, exigence
9. config management and provisioning: puppet, ansible (nice to have)
10. web: apache httpd, nginx
11. infrastructure as code: terraform (preferred), serverless, cloudformation
your contributions
finops (cloud financial management)
1. drive cloud cost optimization strategies by analyzing usage patterns and implementing right-sizing recommendations.
2. partner with engineering and product teams to design cost-aware architectures and establish cost accountability.
3. develop and maintain dashboards to provide visibility into cloud spend.
4. proactively identify opportunities for reserved instances, savings plans, and other aws cost-saving mechanisms.
5. educate teams on cloud cost best practices and foster a culture of financial responsibility in cloud usage.
cloud engineering
1. hands-on design, analysis, development and troubleshooting of highly distributed large-scale production systems and event-driven, cloud-based services.
2. ensure repeatability, traceability, and transparency of our infrastructure automation (infrastructure-as-code, monitoring-as-code).
3. participate in continual learning of the aws ecosystem, game day scenarios, and professional conferences.
4. collaborative solutioning of enterprise applications with development teams utilizing our software stack.
observability engineering
1. ownership of reliability, uptime, system security, cost, operations, capacity, resiliency and performance-analysis thereof.
2. define, monitor and report on service level indicators for applications workloads.
3. support on-call rotations for operational duties that have not been addressed with automation, with an eye for correcting issues that result in on-call alarms.
4. maintain telemetry that improve the visibility to our applications' performance and business metrics and keep operational workload in-check.
5. develop, communicate, collaborate, and monitor standard processes to promote the long-term health and sustainability of operational development tasks.
devsecops
1. support healthy software development practices, including complying with agile software development methodology, building standards for code reviews, work packaging, and continuous delivery.
2. partner with cybersecurity and develop plans and automation to respond to new risks and vulnerabilities.
systems engineering
1. collaborate with systems admins to coordinate middleware, network, storage, database, windows, linux, vmware maintenance.
2. automate legacy on-prem system maintenance and migrate to cloud via thoughtful redesign.
resiliency engineering
1. collaborate with dev teams to identify failure points and blast radius of systems.
2. validate effectiveness of monitoring and observability configurations.
3. coordinate failure injection testing.
4. observe and document steady state production levels, growth patterns.
5. plan and forecast for seasonal growth, communicate trend lines with leadership, enhance infrastructure scaling plans to accommodate 2x planned load.
6. coordinate improvements of existing software and infrastructure to meet resiliency goals.
performance engineering
1. improving the performance, availability and scalability troubleshooting servers, network, hardware and capacity.
2. performance test planning, execution and reporting.
3. performance tune for low latency, high performance, scalable systems.
4. support load testing with open-source tools like k6.io, jmeter, etc.
qualifications
1. experience as a software engineer, with practical experience developing, debugging, and deploying enterprise applications.
2. experience reporting on cloud infrastructure costs using aws cost explorer.
3. experience with infrastructure automation technologies like terraform.
4. expertise in container/container-fleet-orchestration technologies like ecs or kubernetes.
5. versatility with troubleshooting diverse sets of hosting technologies: web server platforms, application platforms, operating systems, network components, virtualization technologies, storage, and database platforms.
6. expertise with continuous deployment based software development lifecycles (e.g. Ci/cd).
7. cloud database operations and deployment experience (rds mysql/postgres/aurora).
8. experience with application caching strategies and high concurrency workloads.
9. expertise with lean/agile deployment processes (blue/green, zdt, canary, load balancers/dns strategies).
10. familiarity with telemetry saas systems like new relic.
11. strong problem solving, root cause analysis and systems engineering skills.
12. excellent communication skills.
13. ability to design and manage escalation response plans from monitoring, react, respond, remediate and retrospect in culturally aligned (proactive, customer focused, collaborative, data-driven) ways.
14. demonstrated expertise building and managing highly scaled production infrastructure in the cloud.
nice to have
1. being able to translate between development, operations, security, product, and management dialects is a highly sought skill.
2. mhe is a polyglot organization. Being "conversational" in javascript/typescript, python, php, ruby, golang, java, bash, markdown, restructuredtext, hcl, json, yaml, and toml would be valuable. Being fluent in 2-3 of them would be a huge plus.
3. bs degree in computer science (or related technical field and/or equivalent industry experience) preferred.
#j-18808-ljbffr