Resilience Archives - markwilson.it

This content is 6 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

I’ve written previously about the couple of days I spent at ExCeL in February, learning about Microsoft’s latest developments at the Ignite Tour and, a few weeks later I found myself back at the same venue, this time focusing on Amazon Web Services (AWS) at the London AWS Summit (four years since my last visit).

Even with a predominantly Microsoft-focused client base, there are situations where a multi-cloud solution is required and so, it makes sense for me to expand my knowledge to include Amazon’s cloud offerings. I may not have the detail and experience that I have with Microsoft Azure, but certainly enough to make an informed choice within my Architect role.

One of the first things I noticed is that, for Amazon, it’s all about the numbers. The AWS Summit had a lot of attendees – 12000+ were claimed, for more than 60 technical sessions supported by 98 sponsoring partners. Frankly, it felt to me that there were a few too many people there at times…

When 4000 people all head for the same 10m wide door (next up will be queue for the gents!) #AWSSummit #ConferenceLogistics pic.twitter.com/sKIbkaOZbs
— Mark Wilson ???? (@markwilsonit) May 8, 2019

AWS is clearly growing – citing 41% growth comparing Q1 2019 with Q1 2018. And, whilst the comparisons with the industrial revolution and the LSE research that shows 95% of today’s startups would find traditional IT models limiting today were all good and valid, the keynote soon switched to focus on AWS claims of “more”. More services. More depth. More breadth.

I’m tired of rhetoric from cloud providers trying to demonstrate one-upmanship. Real customers don’t care about number of services: they want to know how to save money, increase agility whilst remaining secure and compliant. They can do that on AWS or Azure… select best fit…
— Mark Wilson ???? (@markwilsonit) May 8, 2019

There were some good customer slots in the keynote: Sainsbury’s Group CIO Phil Jordan and Group Digital Officer Clodagh Moriaty spoke about improving online experiences, integrating brands such as Nectar and Sainsbury’s, and using machine learning to re-plan retail space and to plan online deliveries. Ministry of Justice CDIO Tom Read talked about how the MOJ is moving to a microservice-based application architecture.

After the keynote, I immersed myself in technical sessions. In fact, I avoided the vendor booths completely because the room was absolutely packed when I tried to get near. My afternoon consisted of:

Driving digital transformation using artificial intelligence by Steven Bryen (@Steven_Bryen) and Bjoern Reinke.
AWS networking fundamentals by Perry Wald and Tom Adamski.
Creating resilience through destruction by Adrian Hornsby (@adhorn).
How to build an Alexa Skill in 30 minutes by Andrew Muttoni (@muttonia).

All of these were great technical sessions – and probably too much for a single blog post but, here goes anyway…

Driving digital transformation using artificial intelligence

Amazon thinks that driving better customer experience requires Artificial Intelligence (AI), specifically Machine Learning (ML). Using an old picture of London Underground workers sorting through used tickets in the 1950s to identify the most popular journeys, Steven Bryen suggested that more data leads to better analytics and better outcomes that can be applied in more ways (in a cyclical manner).

The term “artificial intelligence” has been used since John McCarthy coined it in 1955. The AWS view is that AI taking off because of:

Algorithms.
Data (specifically the ability to capture and store it at scale).
GPUs and acceleration.
Cloud computing.

Citing research from PwC [which I can’t find on the Internet], AWS claim that world GDP was $80Tn in 2018 and is expected to be $112Tn in 2030 ($15.7Tn of which can be attributed to AI).

Data science, artificial intelligence, machine learning and deep learning can be thought of as a series of concentric rings.

Machine learning can be supervised learning (betting better at finding targets); unsupervised (assume nothing and question everything); or reinforcement learning (rewarding high performing behaviour).

Amazon claims extensive AI experience through its own ML experience:

Recommendations Engine
Prime Air
Alexa
Go (checkoutless stores)
Robotic warehouses – taking trolleys to packer to scan and pack (using an IoT wristband to make sure robots avoid maintenance engineers).

Every day Amazon applies new AI/ML-based improvements to its business, at a global scale through AWS.

Challenges for organisations are that:

ML is rare
plus: Building and scaling ML technology is hard
plus: Deploying and operating models in production is time-consuming and expensive
equals: a lack of cost-effective easy-to-use and scalable ML services

Most time is spent getting data ready to get intelligence from it. Customers need a complete end-to-end ML stack and AWS provides that with edge technologies such as Greengrass for offline inference and modelling in SageMaker. The AWS view is that ML prediction becomes a RESTful API call.

With the scene set, Steven Bryen handed over to Bjoern Reinke, Drax Retail’s Director of Smart Metering.

Drax has converted former coal-fired power stations to use biomass: capturing carbon into biomass pellets, which are burned to create steam that drives turbines – representing 15% of the UK’s renewable energy.

Drax uses a systems thinking approach with systems of record, intelligence and engagement

System of intelligence need:

Trusted data.
Insight everywhere.
Enterprise automation.

Customers expect tailoring: efficiency; security; safety; and competitive advantage.

Systems of intelligence can be applied to team leaders, front line agents (so they already know that customer has just been online looking for a new tariff), leaders (for reliable data sources), and assistant-enabled recommendations (which are no longer futuristic).

Fragmented/conflicting data is pumped into a data lake from where ETL and data warehousing technologies are used for reporting and visualisation. But Drax also pull from the data lake to run analytics for data science (using Inawisdom technology).

The data science applications can monitor usage and see base load, holidays, etc. Then, they can look for anomalies – a deviation from an established time series. This might help to detect changes in tenants, etc. and the information can be surfaced to operations teams.

Björn Reinke from @DraxBiomass speaking about analysing energy consumption and looking for anomalies to improve systems of intelligence #MachineLearning #AWSSummit pic.twitter.com/mQOP298kjQ
— Mark Wilson ???? (@markwilsonit) May 8, 2019

AWS networking fundamentals

After hearing how AWS can be used to drive insight into customer activities, the next session was back to pure tech. Not just tech but infrastructure (all be it as a service). The following notes cover off some AWS IaaS concepts and fundamentals.

Customers deploy into virtual private cloud (VPC) environments within AWS:

For demonstration purposes, a private address range (CIDR) was used – 172.31.0.0/16 (a private IP range from RFC 1918). Importantly, AWS ranges should be selected to avoid potential conflicts with on-premises infrastructure. Amazon recommends using /16 (65536 addresses) but network teams may suggest something smaller.
AWS is dual-stack (IPv4 and IPv6) so even if an IPv6 CIDR is used, infrastructure will have both IPv4 and IPv6 addresses.
Each VPC should be broken into availability zones (AZs), which are risk domains on different power grids/flood profiles and a subnet placed in each (e.g. 172.31.0.0/24, 172.31.1.0/24, 172.31.2.0/24).
Each VPC has a default routing table but an administrator can create and assign different routing tables to different subnets.

To connect to the Internet you will need a connection, a route and a public address:

Create a public subnet (one with public and private IP addresses).
Then, create an Internet Gateway (IGW).
Finally, Create a route so that the default gateway is the IGW (172.31.0.0/16 local and 0.0.0.0/0 igw_id).
Alternatively, create a private subnet and use a NAT gateway for outbound only traffic and direct responses (172.31.0.0/16 local and 0.0.0.0/0 nat_gw_id).

Moving on to network security:

Network Security Groups (NSGs) provide a stateful distributed firewall so a request from one direction automatically sets up permissions for a response from the other (avoiding the need to set up separate rules for inbound and outbound traffic).
- Using an example VPC with 4 web servers and 3 back end servers:
  - Group into 2 security groups
  - Allow web traffic from anywhere to web servers (port 80 and source 0.0.0.0/0)
  - Only allow web servers to talk to back end servers (port 2345 and source security group ID)
Network Access Control Lists (NACLs) are stateless – they are just lists and need to be explicit to allow both directions.
Flow logs work at instance, subnet or VPC level and write output to S3 buckets or CloudWatch logs. They can be used for:
- Visibility
- Troubleshooting
- Analysing traffic flow (no payload, just metadata)
  - Network interface
  - Source IP and port
  - Destination IP and port
  - Bytes
  - Condition (accept/reject)
DNS in a VPC is switched on by default for resolution and assigning hostnames (rather than just using IP addresses).
- AWS also has the Route 53 service for customers who would like to manage their own DNS.

Finally, connectivity options include:

Peering for private communication between VPCs
- Peering is 1:1 and can be in different regions but the CIDR must not overlap
- Each VPC owner can send a request which is accepted by the owner on the other side. Then, update the routing tables on the other side.
- Peering can get complex if there are many VPCs. There is also a limit of 125 peerings so a Transit Gateway can be used to act as a central point but there are some limitations around regions.
- Each Transit Gateway can support up to 5000 connections.
AWS can be connected to on-premises infrastructure using a VPN or with AWS Direct Connect
- A VPN is established with a customer gateway and a virtual private gateway is created on the VPC side of the connection.
  - Each connection has 2 tunnels (2 endpoints in different AZs).
  - Update the routing table to define how to reach on-premises networks.
- Direct Connect
  - AWS services on public address space are outside the VPC.
  - Direct Connect locations have a customer or partner cage and an AWS cage.
  - Create a private virtual interface (VLAN) and a public virtual interface (VLAN) for access to VPC and to other AWS services.
  - A Direct Connect Gateway is used to connect to each VPC
- Before Transit Gateway customers needed a VPN per VPC.
  - Now they can consolidate on-premises connectivity
  - For Direct Connect it’s possible to have a single tunnel with a Transit Gateway between the customer gateway and AWS.
Route 53 Resolver service can be used for DNS forwarding on-premises to AWS and vice versa.
VPC Sharing provides separation of resources with:
- An Owner account to set up infrastructure/networking.
- Subnets shared with other AWS accounts so they can deploy into the subnet.
Interface endpoints make an API look as if it’s part of an organisation’s VPC.
- They override the public domain name for service.
- Using a private link can only expose a specific service port and control the direction of communications and no longer care about IP addresses.
Amazon Global Accelerator brings traffic onto the AWS backbone close to end users and then uses that backbone to provide access to services.

Creating resilience through destruction

Adrian Horn presenting at AWS Summit London

One of the most interesting sessions I saw at the AWS Summit was Adrian Horn’s session that talked about deliberately breaking things to create resilience – which is effectively the infrastructure version of test-driven development (TDD), I guess…

Actually, Adrian made the point that it’s not so much the issues that bringing things down causes as the complexity of bringing them back up.

“Failures are a given and everything will eventually fail over time”
Werner Vogels, CTO, Amazon.com

We may break a system into microservices to scale but we also need to think about resilience: the ability for a system to handle and eventually recover from unexpected conditions.

This needs to consider a stack that includes:

People
Application
Network and Data
Infrastructure

And building confidence through testing only takes us so far. Adrian referred to another presentation, by Jesse Robbins, where he talks about creating resilience through destruction.

Firefighters train to build intuition – so they know what to do in the event of a real emergency. In IT, we have the concept of chaos engineering – deliberately injecting failures into an environment:

Start small and build confidence:
- Application level
- Host failure
- Resource attacks (CPU, latency…)
- Network attacks (dependencies, latency…)
- Region attack
- Human attack (remove a key resource)
Then, build resilient systems:
- Steady state
- Hypothesis
- Design and run an experiment
- Verify and learn
- Fix
- (maybe go back to experiment or to start)
And use bulkheads to isolate parts of the system (as in shipping).

Think about:

Software:
- Certificate Expiry
- Memory leaks
- Licences
- Versioning
Infrastructure:
- Redundancy (multi-AZ)
- Use of managed services
- Bulkheads
- Infrastructure as code
Application:
- Timeouts
- Retries with back-offs (not infinite retries)
- Circuit breakers
- Load shedding
- Exception handing
Operations:
- Monitoring and observability
- Incident response
- Measure, measure, measure
- You build it, your run it

AWS’ Well Architected framework has been developed to help cloud architects build secure, high-performing, resilient, and efficient infrastructure for their applications, based on some of these principles.

Adrian then moved on to consider what a steady state looks like:

Normal behaviour of system
Business metric (e.g. pulse of Netflix – multiple clicks on play button if not working)
- Amazon extra 100ms load time led to 1% drop in sales (Greg Linden)
- Google extra 500ms of load time led to 20% fewer searches (Marissa Mayer)
- Yahoo extra 400ms of load time caused 5-9% increase in back clicks (Nicole Sullivan)

He suggests asking questions about “what if?” and following some rules of thumb:

Start very small
As close as possible to production
Minimise the blast radius
Have an emergency stop
- Be careful with state that can’t be rolled back (corrupt or incorrect data)

Use canary deployment with A-B testing via DNS or similar for chaos experiment (1%) or normal (99%).

Adrian then went on to demonstrate his approach to chaos engineering, including:

Fault injection queries for Amazon Aurora (can revert immediately)
- Crash a master instance
- Fail a replica
- Disk failure
- Disk congestion
DDoS yourself
- ~ wrk -t12 -c400 -d30s http://ipaddress/api/health
Add latency to network
- ~ tc qdisc add dev eth0 root netem delay 200ms
https://github.com/Netflix/SimianArmy
- Shut down services randomly
- Slow down performance
- Check conformity
- Break an entire region
- etc.
The chaos toolkit
Gremin
- Destruction as a service!
ToxiProxy
- Sit between components and add “toxics” to test impact of issues
Kube-Money project (for Kubernetes)
Pumba (for Docker)
Thundra (for Lambda)

Use post mortems for correction of errors – the 5 whys. Also, understand that there is no isolated “cause” of an accident.

My notes don’t do Adrian’s talk justice – there’s so much more that I could pick up from re-watching his presentation. Adrian tweeted a link to his slides and code – if you’d like to know more, check them out:

And Up! the slides from my talk “Creating Resiliency Through Destruction” are online https://t.co/5PpYEYL5qe and the code https://t.co/03vbnVjlUk #AWSSummit #chaosengineering #AWS pic.twitter.com/HC4v0AqqIl
— Adrian Hornsby (@adhorn) May 8, 2019

How to build an Alexa Skill in 30 minutes

Spoiler: I didn’t have a working Alexa skill at the end of my 30 minutes… nevertheless, here’s some info to get you started!

Amazon’s view is that technology tries to constrain us. Things got better with mobile and voice is the next step forward. With voice, we can express ourselves without having to understand a user interface [except we do, because we have to know how to issue commands in a format that’s understood – that’s the voice UI!].

I get the point being made – to add an item to a to-do list involves several steps:

Find phone
Unlock phone
Find app
Add item
etc.

Or, you could just say (for example) “Alexa, ask Ocado to add tuna to my trolley”.

Alexa is a service in the AWS cloud that understands request and acts upon them. There are two components:

Alexa voice service – how a device manufacturer adds Alexa to its products.
Alexa Skills Kit – to create skills that make something happen (and there are currently more than 80,000 skills available).

An Alexa-enabled device only needs to know to wake up, then stream some “mumbo jumbo” to the cloud, at which point:

Automatic speech recognition with translate text to speech
Natural language understanding will infer intent (not just text, but understanding…)

Creating skills is requires two parts:

Voice user interface:
- developer.amazon.com
Programming logic (Alexa Web Services):
- aws.amazon.com

Alexa-hosted skills use Lambda under the hood and creating the skill involves:

Give the skill a name.
Choose the development model.
Choose a hosting method.
Create a skill.
Test in a simulation environment.

Finally, some more links that may be useful:

Build and host your skill in the cloud: https://alexa.design/build.
Alexa SkillS Kit SDK for Node.JS: https://alexa.design/nodesdk.
Alexa CLI documentation: https://alexa.design/cli and https://bit.ly/cli-guide.
AWS promotional credits for Alexa: https://alexa.design/awspromo.

In summary

Looking back, the technical sessions made my visit to the AWS Summit worthwhile but overall, I was a little disappointed, as this tweet suggests:

End of day verdict on #AWSSummit: disappointing keynote (see https://t.co/BOFiQacH5f); good breakout sessions (at least the ones I went to); awful conference app (feedback is painfully slow); OK catering; too many people (so I skipped the expo). Learned lots but hoped for more… pic.twitter.com/QtUcaKqoAS
— Mark Wilson ???? (@markwilsonit) May 8, 2019

Would I recommend the AWS Summit to others? Maybe. Would I watch the keynote from home? No. Would I try to watch some more technical sessions? Absolutely, if they were of the quality I saw on the day. Would I bother to go to ExCeL with 12000 other delegates herded like cattle? Probably not…