One place for hosting & domains

      Engineering

      Scaling Up Engineering: Managing People, Process, and Technology at Each Growth Stage


      Video

      “The significant problems we face cannot be solved at the same level of thinking we were at when we created them.”

      About the Talk

      Al Sene, VP of Engineering at DigitalOcean, shares how you can avoid the pitfalls of a rapidly growing engineering team and anticipate key inflection points when scaling. Get tips on managing resources across people, process, and technology at each growth stage.

      Resources

      About the Presenter

      Al Sene is VP of Engineering at DigitalOcean. He leads DigitalOcean’s Product Development teams, including Compute, Networking, Storage, Developer Productivity, and Data Platforms & Sciences. He is responsible for driving innovation and velocity in key services and products to meet the evolving demands of DigitalOcean customers. Prior to DigitalOcean, he held VP of Engineering roles at DDN and NexGen Storage – which was acquired by Fusion-io and SanDisk. He has also held several technical and management roles at Hewlett-Packard. He studied Computer Science at St. Cloud State University and received his MBA from Colorado State University.



      Source link

      Pain-Free Quarterly Planning for Product & Engineering Teams


      How to Join

      This Tech Talk is free and open to everyone. Register on Eventbrite here to receive a link to join on Tuesday, November 17, 2020, 4:30–5:15 p.m. ET.

      About the Talk

      Quarterly planning shouldn’t be like a visit to the dentist! Learn from the DigitalOcean Marketplace team about how they’re able to align and excite the team around goal setting with an easy-to-follow process they’ll teach you, live.

      What You’ll Learn

      • Tools you need to engage and align your entire team in your quarterly goal-setting and planning process, instead of it just being a “managers” or “product” thing.
      • Specific steps that take you from brainstorming to decision making and final goal selection, in less than two weeks.

      This Talk is Designed For

      • Product managers
      • Engineering managers
      • Anyone leading the product function, especially startup founders

      About the Presenters

      Antonio Rosales (@a_webtone) and John Gannon (@johnmgannon) have worked together on the DigitalOcean Marketplace as an Engineering Manager/Product Manager tandem from inception through its scale-up beyond 150+ open source and commercial apps.

      To join the live Tech Talk, register here.





      Source link

      A Guide to Evaluating Cloud Vendor Lock-In with Site Reliability Engineering


      Adoption of cloud services from the Big 3—AWS, Azure and Google—has outpaced new platforms and smaller niche vendors, prompting evolving discussion around the threat of “lock-in.” Thorough evaluation using common tenets can help determine the risk of adopting a new cloud service.

      Vendor lock-in is a service delivery method that makes a customer reliant on the vendor and limited third-party vendor partners. A critical aspect of vendor lock-in revolves around the risk a cloud service will inhibit the velocity of teams building upon cloud platforms. Vendor lock-in can immobilize a team, and changing vendors comes with substantial switching costs. That’s why even when a customer does commit to a vendor, they’re increasingly demanding the spend portability to switch infrastructure solutions within that vendor’s portfolio.

      There are common traits and tradeoffs to cloud vendor lock-ins, but how the services are consumed by an organization or team dictates how risky a lock-in may be. Just like evaluating the risks and rewards of any critical infrastructure choice, the evaluation should be structured around the team’s principles to make it more likely that the selection and adoption of a cloud service will be successful.

      The Ethos of Site Reliability Engineering

      Site Reliability Engineering (SRE), as both an ethos and job role, has gained a following across startups and enterprises. While teams don’t have to strictly adhere to it as a framework, the lessons and approaches documented in Google’s Site Reliability Engineering book provide an incredible common language for the ownership and operation of production services in modern cloud infrastructure platforms.

      Product Life Cycle
      The lifecycle of a typical service. The PRR can be started at any point of the service lifecycle, but the stages at which SRE engagement is applied have expanded over time. Learn more.

      The SRE ethos can provide significant guidance for the adoption of new platforms and infrastructure, particularly in its description of the Product Readiness Review (PRR). While the PRR as described in the Google SRE Book focuses on the onboarding of internally developed services for adoption of support by the SRE role, its steps and lessons can equally apply to an externally provided service.

      The SRE ethos also provides a shared language for understanding the principles of configuration and maintenance as code, or at least automation. If “Infrastructure as Code” is an operational tenant of running production systems on Cloud platforms, then a code “commit” can be thought of as a logical unit of change required to enact an operational change to a production or planned service. The number of commits required to enact adoption of a service can be considered a measurement of the amount of actions required to adopt a given cloud service, functionally representing a measurement of lock-in required for adoption of the cloud service.

      Combining the principles or SRE and its PRR model with the measurement of commits is a relatively simple and transparent way to evaluate the degrees of lock-in adoption the organization and team will be exposed to during adoption of a cloud service.

      Eliminating toil, a key tenet of SRE, should also be considered. Toil is a necessary but undesirable artifact of managing and supporting production infrastructure and has real cost to deployment and ongoing operations. Any evaluation or readiness review of a service must stick to the tenants of the SRE framework and acknowledge toil. If not, the reality of supporting that service will not be realized and may prevent a successful adoption.

      Putting Site Reliability Engineering and Product Readiness Review to Work

      Let’s create a hypothetical organization composed of engineers looking to release a new SaaS product feature as a set of newly developed services. The team is part of an engineering organization that has a relatively modern and capable toolset:

      • Systems infrastructure is provisioned and managed by Change Management backed by version control
      • Application Services are developed and deployed as containers
      • CI/CD is trusted and used by the organization to manage code release lifecycles
      • Everything and anything can be instrumented for monitoring and on-call alarms are sane
      • All underlying infrastructure is “reliable,” and all data sets are “protected” according to their requirements

      The team can deploy the new application feature as a set of services through its traditional toolset. But this release represents a unique opportunity to evaluate new infrastructure platforms that may potentially leverage new features or paradigms of that platform. For example, a hosted container service might simplify horizontal scaling or improve self-healing behaviors over the existing toolsets and infrastructure, and because the new services are greenfield, they can be developed to best fit the platform of choice.

      However, the adoption of a new platform exposes the organization to substantial risk, both from the inevitable growing pains in adopting any new service, and from exposing the organization to unforeseen and unacceptable amounts of vendor lock-in given its tradeoffs. This necessitates a risk discovery process as part of the other operational concerns of the PRR.

      Risk Discovery During a PRR

      Using the SRE PRR model and code commits as a measurement of change, we should have everything we need to evaluate the viability of a service along with its exposure to lock-in. The PRR model can be used to discover and map specifics of the service to the SRE practices and principles.

      Let’s again consider the hypothetical hosted container service that has unique and advantageous auto-scaling qualities. While it may simplify the operational burden of scaling, the new service presents a substantial new barrier to application delivery lifecycles.

      Each of the following items represent commits discovered during a PRR of the container service:

      • The in-place change management toolset may not have the library or features to provision and manage the new container service, and several commits may be required to augment change management to reflect that.
      • More than just eliminating toil, those change management commits are required in order to include the new container service into a responsible part of the CI/CD pipeline.
      • The new container service likely has new constraints for network security or API authorization and RBAC, which needs to be accounted for in order to minimize security risk to the organization. Representing these as infrastructure as code can require a non-trivial effort represented by more commits associated with the service.
      • The new container service may have network boundaries that present a barrier to log aggregation or access to other internal services required as a dependency for the new feature. Circumventing the boundaries or mirroring dependencies will likely require commits to implement and can be critical to the tenet of minimizing toil.
      • Persisting and accessing data in a reliable and performant way could be significantly more effort than any other task in the evaluation. The SRE book has a chapter on Data Integrity that is uniquely suited to cloud services, highlighting the amount of paradigms and effort required to make data available reliably in cloud services. Adoption of those paradigms and associated effort would have to be represented as commits to the overall adoption, continuing the exposure of risk.
      • Monitoring and instrumenting, though often overlooked, are anchor tenants of Site Reliability Engineering. Monitoring of distributed systems comes with a steep learning curve and can present a significant risk to the rest of the organization if not appropriately architected and committed to the monitoring toolset.
      • Finally, how will the new service and infrastructure be decommissioned? No production service lives forever, and eventually a feature service will either be re-engineered or potentially removed. Not only does the team need to understand how to decommission the platform, but they must also have the tooling and procedures to remove sensitive data, de-couple dependencies or make other efforts that are likely to require commits to enact.

      Takeaways from a PRR Using the SRE Framework

      While certainly not a complete list of things to discover during a PRR, the list above highlights key areas where changes must be made to the organization’s existing toolset and infrastructure in order to accommodate the adoption of a new container hosting service. The engineering time, as measured by commits, can evaluate the amount of work required. Since the work is specific to the container service, it functionally represents measurable vendor lock-in.

      Another hosting service with less elegant features may have APIs or tooling that are less effort for adoption, minimizing lock-in. Or the amount of potential lock-in is so significant that other solutions, such as a self-hosted container service, represent more value for less effort and lock-in risk than the adoption of an outside service. The evaluation can only be made by the team responsible for the decision. The right mode of evaluation, however, can make that decision a whole lot easier.

      Dan Lotterman


      READ MORE



      Source link