Oracle Principal Site Reliability Engineer-DevOPS in Cambridge, Massachusetts
Analyze, design develop, troubleshoot and debug software programs for commercial or end user applications. Writes code, completes programming and performs testing and debugging of applications.
As a member of the software engineering division, you will analyze and integrate external customer specifications. Specify, design and implement modest changes to existing software architecture. Build new products and development tools. Build and execute unit tests and unit test plans. Review integration and regression test plans created by QA. Communicate with QA and porting engineering to discuss major changes to functionality.
Work is non-routine and very complex, involving the application of advanced technical/business skills in area of specialization. Leading contributor individually and as a team member, providing direction and mentoring to others. BS or MS degree or equivalent experience relevant to functional area. 7 years of software engineering or related experience.
This is a remote/office based position which may be performed anywhere in the United States except for within the state of Colorado.
Oracle is an Affirmative Action-Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability, protected veterans status, age, or any other characteristic protected by law.
Principal Site* Reliability Engineer- DEVOPS*
*Can be located anywhere in the US*
* *Must be US Citizens
Are you interested in building large-scale distributed infrastructure for the cloud? *Pandemic Response Systems – Infra. and Ops. *team is building new Software-as-a-Service technologies that operate at high-scale in a broadly distributed, multi-tenant cloud environment. Oracle’s extensive enterprise customer base is looking for rock solid cloud solutions that provide the same reliability and effectiveness that they have come to expect from Oracle.
Our customers run their businesses on our cloud, and our mission is to provide them with best in class, foundational cloud services. Oracle's Cloud team is being built with an entrepreneurial spirit that promotes an energetic, creative, and collaborative environment; while ensuring that employees are supported in their career goals and have opportunities for training and education. We appreciate and value commitment to family and enthusiastically encourage work / life balance.
What we’re looking for:
We are looking for a Site Reliability Engineer to join the Pandemic Response Systems – Infra. and Ops. team. The ideal candidate is technically strong, and able to persevere through complexity and ambiguity – They’ve directly worked on services that are highly available, scalable, and redundant. Automation is a core tenet for everything they do. They understand that simple systems are easier to operate and troubleshoot. They can balance speed with iteration and incremental improvements. They’ve made life easier for other developers and have motivated their teams to make both process and service improvements.
If you are passionate about taking ownership of big technical challenges and producing software solutions that have broad, significant impacts - come join our team!
Candidates should have broad working knowledge across multiple domains, but we love to see specialization as well. The basics we expect are: Networking, Linux Systems Engineering, Software Engineering/Automation, Database Services (big data technologies) and Distributed Systems.
In this role, you will:
As a Site Reliability Engineer (SRE), within the Pandemic Response Systems – Infra. and Ops. team, you will assist in designing and maintaining hosting, process, transform, and analyze operational processes. Your first mission will be to work closely with our software developers and data architects to define a sustainable operational model for PRS-IO services. This includes mechanisms to scale the systems by way of easy-to-use tooling and automation. You will work in concert with developers to evolve systems/products for better scalability, reliability and enable developer velocity. You will also author and maintain operational run books to help reduce mean Time of Incidents (TOI), and be responsible for managing and triaging operational tickets pertaining to the data platform services. Emphasis on driving prioritization and execution of work based on business impact is a must.
Solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence.
Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services.
Develop designs, architectures, standards, and methods for large-scale distributed systems.
Facilitate service capacity planning and demand forecasting, software performance analysis, and system tuning.
Work with other engineers within the Pandemic Response Systems – Infra. and Ops. team on the shared full stack ownership of a collection of services and/or technology areas. Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services.
Articulate technical characteristics of services and technology areas and guide development teams to engineer and add capabilities to internal Oracle services.
Act as ultimate escalation point for complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs).
Utilize a deep understanding of service topology and the dependencies required to troubleshoot issues and define mitigations.
Understand and explain the effect of product architecture decisions on distributed systems.
Serve as part of a 24x7 On Call rotation in support of the Pandemic Response Systems – Infra. and Ops.
Professional curiosity and a desire to a develop deep understanding of services and technologies.
Bachelor’s or Master’s degree in Computer Science or equivalent related field experience
Experience with Java, Python, or C including Object Oriented programming
Experience working with fault tolerant, highly available, high throughput, distributed, scalable systems
Aptitude to be a good team player and the desire to learn and implement new Cloud technologies as needed
Excellent organizational, verbal, and written communication skills
5 years of experience in two or more of the following
Software development / operations
Developing/operating large scale distributed services / applications
System Administration including Linux internals, TCP/IP, DNS, Load balancing technologies
Container administration and development utilizing Kubernetes, Docker, Mesos, or similar
Infrastructure automation through Terraform, Chef, Ansible, Puppet or similar
Big Data Infrastructure including Hadoop, Spark, NoSQL, Object Storage, or similar
Experience with TCP/IP and socket programming
Knowledge of cloud compute technologies, network monitoring, data processing and analytics
Experience with CI/CD pipelines including VCS (git, svn, etc)
Job: *Product Development
Title: Principal Site Reliability Engineer-DevOPS
Location: United States
Requisition ID: 210008RL
- Oracle Jobs