Software Engineering at Google Chapter #25 - Compute as a Service (2 of 3)

  • Google has refined their processes to a point where entire data centers can be automatically spun up with almost no human intervention and very little risk
  • Engineers need to "design for failure" and assume the VMs or containers that run their software will fail. How will their software react to and recover from these failures?
  • Work should be broken up into chunks and those chunks send back to a master so that if there is a failure the entire process does not need to be restarted - only the part that failed needs to be reprocessed
  • Instead of a human monitoring VMs / containers the data center scheduler monitors them and if one dies it will start a new one (pets vs cattle)
  • Pet servers are unique and you need humans to feed and care for them
  • Cattle servers are all the same and if one dies you can spin up a new one
  • Also consider planned maintenance vs unplanned outages. If there is a planned maintenance the data center scheduler should drain and redirect the traffic without simply killing the container or VM
  • In general there are two types of jobs that run on compute: Batch and serving
  • Batch jobs are concerned about throughput and are short lived
  • Serving jobs care about latency of a single request and are long lived (restarted when new code is pushed)
  • Serving jobs are easier to reason about when it comes to failure because for an application all serving nodes are the same thus it's easy to spin up a new one when one dies, no retry logic is needed
  • Serving jobs should be over-provisioned to serve traffic spikes without significant latency increases
  • Do not be tempted to use this over-provisioned area to run batch jobs as it defeats the purpose of the over-provisioning (what happens when there is a large and fast traffic increase and the serving job cannot expand because there is a batch job running?)
  • If you do put over-provisioned serving jobs and batch jobs on the same system be certain to have a way to automatically kill the batch jobs when there is a traffic spike
  • Google has the capacity to do this and thus most of their batch jobs run for "free" and are quickly killed when there's a traffic spike
  • Any serving job (such as a cluster master or leader) will have application state in memory and/or on disk and thus extra considerations must be taken (usually in the form of redundancy, backup masters, etc)
  • Another serving job to consider is when the machines are sharded and each machine serves part of the data. If a machine fails then it must be restarted with the proper data and that data may be unavailable to users while the new machine is being spun up
  • One way to deal with stateful data is to keep it on persistent storage that is external to the VM or container such as a RAID array or NAS
< BACK NEXT >
Tweet


   


   

Thank you for your time and attention.
Apply what you've learned here.
Enjoy it all.