Software Engineering at Google Chapter #25 - Compute as a Service (3 of 3)

  • Google learned that it's best to provision your cache to meet your latency goals but provision the core application for total load
  • Tradeoffs in this these realms are ones of cost. More redundancy = more cost
  • Applications can pull in data from a persistent source in order to "warm up" before they start serving requests
  • Hard coded hostnames are pets not cattle
  • Instead have applications connect to a service using a service discovery application such as etcd or Consul
  • Clients need to retry and handle failures gracefully
  • Designing for idempotency is important because a machine may temporarily go offline, the scheduler calls it dead and starts a new machine, but then the old machine comes back. You now have two systems processing the same data. How will your application deal with this?
  • There is a tradeoff of time and resources when it comes to developers running one-off jobs on their laptop vs distributed within the compute infrastructure
  • Compute resources are much cheaper than developer's time
  • Containers provide abstraction including file system, named resources, network ports, and more
  • Batch jobs can be killed without warning and restarted elsewhere
  • Serving jobs need to have their restarts throttled and contained so that the remaining nodes do not get overwhelmed (eg: if you kill 50% of a serving job the remaining 50% will be overloaded and the entire application will crash)
  • Your organization's compute service choice is important because it will quickly become "locked in" by tooling and process built around your chosen solution
  • Serverless architectures require that your code be truly stateless
  • Google uses very little serverless because Borg already does a good enough job with containers
  • Beware public cloud vendor lockin



Thank you for your time and attention.
Apply what you've learned here.
Enjoy it all.