Amazon Web Services Archives

Weeknote 2024/06: more playing with NFC; thoughts on QR code uses; and a trip to AWS’ UK HQ

Posted on Friday 9 February 2024Friday 9 February 2024 By Mark Wilson

This content is 1 year old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

Last week’s weeknote taught me one of two things. Either I’m getting boring now; or AI fatigue has reached a level where people just read past anything with ChatGPT in the title. Or maybe it was just that the Clippy meme put people off…

Whilst engagement is always nice, I write these weeknotes for mindful reflection. At least, that’s what I tell myself when I’m writing them. There’s also a part of me that says “you’ve done six weeks now… don’t stop and undo all that work”. Hmm, Sunk Cost Fallacy anyone?

So, let’s get stuck into what’s been happening in week 6 of 2024… there seems to be quite a lot here (or at least it took me a few hours to write!)

This week at work

Even with the input from ChatGPT that I mentioned last week, I’m still struggling to write data sheets. Maybe this is me holding myself back with my own expectations around the output. It’s also become a task that I simply must complete – even in draft – and then hand over to others to critique. Perfection is the enemy of good, and all that!

I’m also preparing to engage with a new client to assist with their strategy and innovation. One challenge is balancing the expectations of key client stakeholders, the Account Director, and the Service Delivery Manager with my own capabilities. In part, this is because expectations have been based on the Technical Architect who is aligned to the account. He’s been great on the technical side but I’m less hands-on and the value I will add is more high-level. And this is a problem of our own making – everyone has a different definition of what an (IT) Architect is. I wrote about this previously:

What’s needed are two things – a really solid Technical Architect with domain expertise, and someone who can act as a client side “CTO”. Those are generally different skillsets.

My work week ended with a day at Amazon Web Services (AWS). I spend a lot of time talking about Microsoft Azure, but my AWS knowledge is more patchy. With a multi-cloud mindset (and not just hybrid with Node4), I wanted to explore what’s happening in the world of AWS. More on that in a bit…

This week in tech

Let’s break this up into sections as we look at a few different subjects…

More fun with NFC tags

A few weeks ago, I wrote about the NFC tags I’d been experimenting with. This week I took it a bit further with:

Programming tags using the NFC Tools app. This means the tag action doesn’t rely on an iOS Shortcut and so isn’t limited to one user/device. Instead, the tag has a record stored in its memory that corresponds to an action – for example it might open a website. I was going to have a tag for guests to automatically connect to the guest Wi-Fi in our house but iOS doesn’t support reading Wi-Fi details from NFC (it’s fine with a QR code though… as I’ll discuss in a moment).
Using a tag and an automation to help me work out which bins to put out each week. Others have said “why not just set a recurring reminder?” and that is what I do behind the scenes. The trouble with reminders is notifications. Instead of the phone reminding me because it’s the right day (but perhaps I’m in the wrong place), I can scan and check which actions are needed this week.

A breakthrough with the biggest challenge any home owner has to navigate: which bins to put out ????

“Solved” using an NFC tag in the kitchen and some iOS Reminders… pic.twitter.com/eUSnuOW400
— Mark Wilson (@markwilsonit) February 5, 2024

QR codes are not the answer to sharing every link…

Yesterday, I couldn’t help but notice how many QR codes featured in my day. Unlike most of my recent journeys, my train ticket didn’t have a code. This is because Thameslink (the train operating company for my train from Bedford to London) appears to be stuck on an old technology stack. Their app is pretty useless and sends me to their website to buy tickets, which I then have to collect from a machine at the station. If I need to collect a ticket I might as well buy it on the day from the same machine (there are no Advance discounts available on my journey). So, paper train tickets with magnetic stripes it was.

Then, I was networking with some of the other delegates at the AWS re:Invent re:Cap event and found that people share QR codes from the LinkedIn app now. How did I not know this was a thing? (And to think I am playing with programming NFC tags to do cool things.) To be fair, I haven’t got out much recently – far too much of my post-pandemic work for risual was online. I even have paper business cards in my work bag. I don’t think I’ve given one to anyone in a long time though…

But QR codes were everywhere at AWS. They were In every presentation for links to product information, feedback links, even for the Wi-Fi in the room. And that’s the problem – QR codes are wonderful on a mobile device. But all too often someone creates a code and says “let’s share this – it will be cool”, without thinking of the use case.

A QR code for exchanging details in person. Yep, I get that.
A QR code on physical marketing materials to direct people to find out more. That works.
A QR code on an email. Get real. I’m reading it on one device – do you really want me to get another one to scan the code?
A QR code on the back of a van. Nice in principle but it’s a moving vehicle. Sometimes it won’t work so better to have a URL and phone number too. In which case what purpose does the QR code serve?
Multiple QR codes on a presentation slide. Hmm… tricky now. The camera app’s AI doesn’t know which one to use. What’s wrong with a short URL? Camera apps can usually recognise and scan URLs too.
QR codes for in-room Wi-Fi. Seems great at first, and worked flawlessly on my phone but I couldn’t get them to work on a Windows laptop. Well, I could read them in the camera app but it wouldn’t let me open the URL (or copy it to examine and find the password). For that I needed an app from the Microsoft Store. And I was offline. Catch 22. Luckily, someone wrote the password on a white board. Old skool. That works for me.

More of my tech life

I think Apple might have launched a VR headset. This is the meme that keeps on giving…

It was bound to happen pic.twitter.com/MYDxOSOv9O
— Sharat Chander | ?? (@Sharat_Chander) February 4, 2024

I learned that Google uses the 1e100.net domain to identify its servers, and the name comes from the scientific notation for 1 googol.
And I wonder how many call centre managers updated IVR system messages this week to remove the “unusually high call volumes” message after Martin Lewis got interested in the issue.
It looks like Google Street View is moving into stations:

Street View in stations. Is that new? pic.twitter.com/g2aChbr7Lv
— Mark Wilson (@markwilsonit) February 8, 2024

That visit to the AWS offices that I mentioned earlier…

On my way to an AWS event today… seems like the right occasion to wear cloudy socks (even if they do say Microsoft Azure around the top!) pic.twitter.com/m7jUkIMowa
— Mark Wilson (@markwilsonit) February 8, 2024

I started writing this on the train home, thinking there’s a lot of information to share. So it’s a brief summary rather than trying to include all the details:

The AWS event I attended was a recap of the big re:Invent conference that took place a few months ago. It took place at AWS’s UK HQ in London (Holborn). I’ve missed events like this. I used to regularly be at Microsoft’s Thames Valley Park (Reading) campus, or at a regional Microsoft TechNet or MSDN event. They were really good, and I knew many of the evangelists personally. These days, I generally can’t get past the waitlist for Microsoft events and it seems much of their budget is for pre-recorded virtual events that have huge audiences (but terrible engagement).
It was a long day – good to remind me why I don’t regularly commute – let alone to London. But it was great to carve out the time and dedicate it to learning.
Most of the day was split into tracks. I could only be in one place at one time so I skipped a lot of the data topics and the dedicated AI/ML ones (though AI is in everything). I focused on the “Every App” track.
A lot of the future looking themes are similar to those I know with Microsoft. GenAI, Quantum. The product names are different, the implementation concepts vary a little. There may be some services that one has and the other doesn’t. But it’s all very relatable. AWS seems a little more mature on the cost control front. But maybe that’s just my perception from what I heard in the keynote.
The session on innovating faster with Generative AI was interesting – if only to understand some of the concepts around choosing models and the pitfalls to avoid.
AWS Step Functions seem useful and I liked the demo with entertaining a friend’s child by getting ChatGPT to write a story then asking Dall-E to illustrate it.
One particularly interesting session for me was about application modernisation for Microsoft workloads. I’m not a developer, but even I could appreciate the challenges (e.g. legacy .NET Framework apps), and the concepts and patterns that can help (e.g strangler fig to avoid big bang replacement of a monolith). Some of the tools that can help looked pretty cool to.
DeepRacer is something I’d previously ignored – I have enough hobbies without getting into using AI to drive cars. But I get it now. It’s a great way to learn about cloud, data analysis, programming and machine learning through play. (Some people doing like the idea of “play” at work, so let’s call it “experimentation”).
There’s some new stuff happening in containers. AWS has EKS and ECS. Microsoft has AKS and ACS. Kubernetes (K8s) is an orchestration framework for containers. Yawn. I mean, I get it, and I can see why they are transformative but it seems every time I meet someone who talks about K8s they are evangelical. Sometimes containers are the solution. Sometimes they are not. Many of my clients don’t even have a software development capability. Saying to an ISV “we’re going to containerise your app” is often not entertained. OK, I’ll get off my soapbox now.
One thing AWS has that I’ve never heard Azure folks talk about is the ability to deliberately inject chaos into your app or infrastructure – so the session on the AWS Fault Injection Service was very interesting. I particularly like the ideas of simulating an availability zone outage or a region outage to test how your app will really perform.
Amazon has a contact centre platform called Connect. I did not know that. Now I do. It sounds quite interesting, but I’m unlikely to need to do anything more with it at Node4 – Microsoft Teams and Cisco WebEx are our chosen platforms.
The security recap was… a load of security enhancements. I get it. And they seem to make sense but they are also exactly what I would expect to see.
Amazon Security Lake is an interesting concept, but I had to step out of that session. It did make me wonder if it’s just SIEM (like Microsoft Sentinel). Apparently not. ASL is a data lake/log management system not a SIEM service, so bring your own security analytics.

In all, it was a really worthwhile investment of a day. I will follow up on some of the concepts in more detail – and I plan to write about them here. But I think the summary above is enough, for now.

This week’s reading, writing, watching and listening

I enjoy Jono Hey’s Sketchplanations. Unfortunately. when I was looking for one to illustrate the Sunk Cost Fallacy at the top of this post, I couldn’t find one. I did see there’s I see he has a book coming out in a few months’ time though. You can pre-order it at the place that does everything from A-Z.

What I did find though, is a sketch that could help me use less passive voice in these blog posts:

OMG. For every blog post I write, the software tells me I use too much passive voice. This trick could really help. Thank you @sketchplanator!https://t.co/4jH7FM98zo
— Mark Wilson (@markwilsonit) February 7, 2024

Inspired by something I saw on the TV, and after I found my previous notes, some of my thoughts here grew into a post of their own: Anti-social media.

My wife and I finished watching Lessons In Chemistry on Apple TV this week. I commented previously that one of my observations was we still have a long way to go on diversity, inclusion and equality but we’ve come a long way since the 1950s. And then I read this, from the LA Times Archive, reporting on how a woman was jailed for contempt of court after the Judge took offence to her wearing “slacks”, in 1938.

This week in photos

Only one from my instragram this week:

View this post on Instagram

A post shared by Mark Wilson (@markwilsonuk)

This isn’t mine, but I love it…

While the brilliance of the Citroën 2CV is a foregone conclusion (or, here, a fourgonnette conclusion ?), I can't help thinking that this is perhaps the most French photo I've ever taken… pic.twitter.com/tW8NIIUaRG
— Dr Jonathan Kershaw ?? (@jeckythump) February 4, 2024

Also:

Mercedes Benz CEO retires after 49 years and BMW released this video the moment it was announced. pic.twitter.com/D7iii6R5PJ
— Historic Vids (@historyinmemes) February 5, 2024

And what about this?

OMG! So many cars from my childhood here ??. Obvs the VW Kombi would be ace, but there’s a tasty Beemer in there too. Chuckling at the diagonal parking of the Volvo with the L plates! And what’s the story with the couple leaning on the blue Reliant 3-wheeler? https://t.co/iCc3RJwGDy
— Mark Wilson (@markwilsonit) February 7, 2024

This week at home

Putting home (and therefore family) at the end seems wrong, but the blog is about tech first, business second, and my personal life arguably shouldn’t feature so often.

The positive side of trying to be in the office at least a day or two a week is that I can do the school run. I may only have one “child” still at school but he’s learning to drive, so he can drive to school and I’ll continue to drive to work afterwards. He’s also driving to his hockey training and matches so its a good way to build experience before his driving test in a few months’ time.

Next week, my adult son (Matt) heads back to Greece for a couple of months’ cycle training. He’s also building new gravel/cyclocross bikes for later in the year, so “bits of bike” keep on appearing in the dining room… including some new wheels from one of the team sponsors, FFWD Wheels.

These have appeared in my dining room… which can only mean one thing… #VeloMatt is preparing for the road season ???????????? pic.twitter.com/lohzQodIz1
— Mark Wilson (@markwilsonit) February 5, 2024

Meanwhile, my wife is very excited because Matt will be invited to Buckingham Palace to receive his Duke of Edinburgh Gold Award. He can take a guest, hence Mrs W’s excitement. Let’s just hope he’s in the country at the time.

I really should try and use the time whilst he’s away to get out on my own bike as my own fitness is not where it should be.

That’s all for this week. See you all around the same time next week?

Featured image: author’s own.

The 5 or 6 Rs of cloud transformation

Posted on Thursday 9 July 2020Thursday 9 July 2020 By Mark Wilson

This content is 5 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

A few years ago, a couple of colleagues showed me something they had been working on – a “5 Rs” approach to classifying applications for cloud transformation. It was adopted for use in client engagements but I decided it needed to be extended – there was no “do nothing” option, so I added “Remain” as a 6th R.

I later discovered that my colleagues were not the first to come up with this model. When challenged, they maintained that it was an original idea (and I was convinced someone had stolen our IP when I saw it used by another IT services organisation!). Research suggests Gartner defined 5Rs in 2010 and both Microsoft and Amazon Web Services have since created their own variations (5Rs in the Microsoft Cloud Adoption Framework and 6Rs in Amazon Web Services’ Application Migration Strategies). I’m sure there are other variations too, but these are the main ones I come across.

For reference, this is the description of the 6Rs that we use where I work, at risual:

Replace (or repurchase) – with an equivalent software as a service (SaaS) application.
Rehost – move to IaaS (lift and shift). This is relatively fast, with minimal modification but won’t take advantage of cloud characteristics like auto-scaling.
Refactor (or replatform/revise) – decouple and move to PaaS. This may provide lower hosting and operational costs together with auto-scaling and high availability by default.
Redesign (or rebuild/rearchitect) – redevelop into a cloud-aware solution. For example, if a legacy application is providing good value but cannot be easily migrated, the application may be modernised by rebuilding it in the cloud. This is the most complicated approach and will involve creating a new architecture to add business value to the core application through the incorporation of additional cloud services.
Remain (or retain/revisit) – for those cases where the “do nothing” approach is appropriate although, even then, there may be optimisations that can be made to the way that the application service is provided.
Retire – for applications that have reached the end of their lifecycle and are no longer required.

Right now, I’m doing some work with a client who is looking at how to transform their IT estate and the 5/6Rs have come into play. To help my client, who is also working with both Microsoft and AWS, I needed to compare our version with Gartner’s, Microsoft’s and AWS’… and this is what I came up with:

risual	Gartner	Microsoft	AWS	Notes
Replace	Replace	Replace	Repurchase	Whilst AWS uses a different term, the approach is broadly similar – look to replace/repurchase existing solutions with a SaaS alternative: e.g. Office 365, Dynamics 365, Salesforce, WorkDay, etc.
Rehost	Rehost	Rehost	Rehost	All are closely aligned in thinking – rehost is the “lift and shift” option – based on infrastructure as a service (IaaS) – which is generally straightforward from a technical perspective but may not deliver the same long term benefits as other cloud transformation methods.
Refactor	Refactor	Refactor	Replatform	Refactoring generally involves the adoption of PaaS – for example making use of particular cloud frameworks, application hosting or database services; however this may be at the expense of portability between clouds. The exception is AWS, which uses refactor in a slightly different context and replatform for what is referred to as “lift, tinker and shift”.
	Revise			Gartner’s revise relates to modifying existing code before refactoring or rehosting. risual, Microsoft and AWS would all consider this as part of the refactoring/replatforming.
Redesign	Rebuild	Rebuild	Refactor/re-architect.	Gartner defines rebuilding as moving to PaaS, rebuilding the solution and rearchitecting the application. AWS groups its definition of refactoring and rearchitecting, although the definition of refactor is closer to Microsoft/Gartner’s rebuild – adding features, scale, or performance that would otherwise be difficult to achieve in the application’s existing environment (for example.
		Rearchitect		Microsoft makes the distinction between rebuilding (creating a new cloud-native codebase) and rearchitecting (looking for cost and operational efficiencies in applications that are cloud-capable but not cloud-native) – for example migrating from a monolithic architecture to a serverless architecture.
Remain			Retain/revisit	Perhaps because their application transformation strategies assume that there is always some transformation to be done, Gartner and Microsoft do not have a remain/retain option. This can be seen as the “do nothing” approach but, as AWS highlights, it’s really a revisit as the do nothing is a holding state. Maybe the application will be deprecated soon – or was recently purchased/upgraded and so is not a priority for further investment. It is likely to be addressed by one of the other approaches at some point in future.
Retire			Retire	Sometimes, an application has outlived its usefulness – or just costs more to run than it delivers in value, and should be retired. Neither Gartner nor Microsoft recognise this within their 5Rs.

Whichever 5 or 6Rs approach you take, it can be a useful approach for categorising potential transformation opportunities and I’m often surprised exercise how it exposes services that are consuming resources, long after their usefulness has ended.

“Disaster Recovery” and related thoughts…

Posted on Wednesday 6 May 2020Wednesday 6 May 2020 By Mark Wilson

Backup, Archive, High Availbility, Disaster Recovery, Business Continuity. All related. Yet all different.

One of my colleagues was recently faced with needing to run “a DR [disaster recovery] workshop” for a client. My initial impression was:

What disasters are they planning for?
I’ll bet they are thinking about Coronavirus and working remotely. That’s not really DR.
Or are they really thinking about a backup strategy?

So I decided to turn some of my rambling thoughts into a blog post. Each of these topics could be a post in its own right – I’m just scraping the surface here…

Let’s start with backup (and recovery)

Backups (of data) are a fairly simple concept. Anything that would create a problem if it was lost should be backed up. For example, my digital photos are considered to not exist at all unless they are synchronised (or backed up) to at least two other places (some network-attached storage, and the cloud).

In a business context, we run backups in order to be able to recover (restore) our content (configuration or data) within a given window. We may have weekly full backups and daily incremental or differential backups (perhaps with more regular snapshots), then retain parent, grandparent and great-grandparent copies of the full backups (four weeks) and keep each of these as (lunar) monthly backups for a year. That’s just an example – each organisation will have its own backup/retention policies and those backups may be stored on or off-site, on tape or disk.

In summary, backups are about making sure we have an up to date copy of our important configuration information and data, so we can recover it if the primary copy is lost or damaged.

And for bonus content, some services we might consider in a modern infrastructure context include Azure Backup or AWS Backup.

Backups must be verified and periodically tested in order to have any use.

Archiving information

When I wrote about backups above, I mentioned keeping multiple copies covering various points in time. Whilst some may consider this adequate for archival, archival is the storage of data for long-term preservation of read-only access – for example, documents that must be stored for an extended period of time (for example 7, 10, 25, 99 years). Once that would have been paper documents, in boxes. Now it might be digital files (or database contents) on tape or disk (potentially cloud storage).

Archival might still use backup software and associated retention policies, but we’ll think carefully about the medium we store it on. For very long term physical storage we might need to consider the media formats (paper is bulky and transferred to microfiche, or old magnetic media degrades, so it’s moved to optical storage – but the hardware becomes obsolete, so it’s moved to another format). If storing on disk (on-premises or in the cloud), we can use slower (cheaper) disks and accept that restoration from the archive may take additional time.

In summary, archival is about long-term data storage, generally measured in many years and archives might be stored off-line, or near-line.

Technologies we might use for archival are similar to backups, but we could consider lower-cost storage – e.g. Azure Storage‘s Cool or Archive tiers or Amazon S3 Glacier.

Keeping systems highly available

High Availability (HA) is about making sure that our systems are available for as much time as possible – or certainly within a given service level agreement (SLA).

Traditionally, we used technologies like a redundant array of inexpensive devices (RAID) for disks or memory, error checking memory, or redundant power supplies. We might also have created server clusters or farms. All of these methods have the intention of removing single points of failure (SPOFs).

In the cloud, we leave a lot of the infrastructure considerations to the cloud service provider and we design for failure in other ways.

We assume that virtual machines will fail and create availability sets.
We plan to scale out across multiple hosts for applications that can take advantage of that architecture.
We store data in multiple regions.
We may even consider multiple clouds.

Again, the level of redundancy built into the app and its supporting infrastructure must be designed according to requirements – as defined by the SLA. There may be no point in providing an expensive four nines uptime for an application that’s used once a month by one person, who works normal office hours. But, then again, what if that application is business critical – like payroll? Again, refer to the SLA – and maybe think about business continuity too… more on that in a moment.

Some of my clients have tried to implement Windows Server clusters in Azure. I’ve yet to be convinced and still consider that it’s old-world thinking applied in a contemporary scenario. There are better ways to design a highly available file service in 2020.

In summary, high availability is about ensuring that an application or service is available within the requirements of the associated service level agreement.

Technologies might include some of the hardware considerations I listed earlier, but these days we’re probably thinking more about:

Azure Virtual Machine Availability Sets.
Azure Virtual Machine Scale Sets.
Elastic database pools in Azure SQL Database or Amazon RDS.
Autoscaling and other capabilities in Azure App Service.
Azure Traffic Manager (DNS), Load Balancer (layer 4) or Application Gateway (layer 7)/AWS Elastic Load Balancing (various options).
AWS/Azure Availability Zones/Regions (e.g. for data replication).
Multi-cloud architectures (but think carefully).

Remember to also consider other applications/systems upon which an application relies.

Also, quoting from some of Microsoft’s training materials:

“To achieve four 9’s (99.99%), you probably can’t rely on manual intervention to recover from failures. The application must be self-diagnosing and self-healing.
Beyond four 9’s, it is challenging to detect outages quickly enough to meet the SLA.
Think about the time window that your SLA is measured against. The smaller the window, the tighter the tolerances. It probably doesn’t make sense to define your SLA in terms of hourly or daily uptime.”
Microsoft Learn: Design for recoverability and availability in Azure: High Availability

Disaster recovery

As the name suggests, Disaster Recovery (DR) is about recovering from a disaster, whatever that might be.

It could be physical damage to a piece of hardware (a switch, a server) that requires replacement or recovery from backup. It could be a whole server room or datacentre that’s been damaged or destroyed. It could be data loss as a result of malicious or accidental actions by an employee.

This is where DR plans come into play- firstly analysing the risks that might lead to disaster (including possible data loss and major downtime scenarios) and then looking at recovery objectives – the application’s recovery point objective (RPO) and recovery time objective (RTO).

Quoting Microsoft’s training materials again:

An illustration showing the duration, in hours, of the recovery point objective and recovery time objective from the time of the disaster.

“Recovery Point Objective (RPO): The maximum duration of acceptable data loss. RPO is measured in units of time, not volume: “30 minutes of data”, “four hours of data”, and so on. RPO is about limiting and recovering from data loss, not data theft.
Recovery Time Objective (RTO): The maximum duration of acceptable downtime, where “downtime” needs to be defined by your specification. For example, if the acceptable downtime duration is eight hours in the event of a disaster, then your RTO is eight hours.”
Microsoft Learn: Design for recoverability and availability in Azure: Disaster Recovery

For example, I may have a database that needs to be able to withstand no more than 15 minutes’ data loss and an associated SLA that dictates no more than 4 hours’ downtime in a given period. For that, my RPO is 15 minutes and the RTO is 4 hours. I need to make sure that I take snapshots (e.g. of transaction logs for replay) at least every 15 minutes and that my restoration process to get from offline to fully recovered takes no more than 4 hours (which will, of course, determine the technologies used).

Considerations when creating a DR plan might include:

What are the requirements for each application/service?
How are systems linked – what are the dependencies between applications/services?
How will you recover within the required RPO and RTO constraints?
How can replicated data be switched over?
Are there multiple environments (e.g. dev, test and production)?
How will you recover from logical errors in a database that might impact several generations of backup, or that may have spread through multiple data replicas?
What about cloud services – do you need to backup SaaS data (e.g. Office 365)? (Possibly not, if you’re happy with a retention-period based restoration from a “recycle bin” or similar but what if an administrator deletes some data?)

As can be seen, there are many factors here – more than I can go into in this blog post, but a disaster recovery strategy needs to consider backup/recovery, archive, availability (high or otherwise), technology and service (it may help to think about some of the ITIL service design processes).

In summary, disaster recovery is about having a plan to be able to recover from an event that results in downtime and data loss.

Technologies that might help include Azure Site Recovery. Applications can also be designed with data replication and recovery in mind, for example, using geo-replication capabilities in Azure Storage/Amazon S3, Azure SQL Server/Amazon RDS or using a globally-distributed database such as Azure Cosmos DB. And DR plans must be periodically tested.

Business continuity

Finally, Business Continuity (BC). This is something that many organisations will have had to contend with over the last few weeks and months.

BC is often confused with DR but they are different. Business continuity is about continuing to conduct business when something goes wrong. That may be how to carry on working whilst working on recovering from a disaster. Or it may be how to adapt processes to allow a workforce to continue functioning in compliance with social distancing regulations.

Again, BC needs a plan. But many of those plans will be reconsidered now – if your BC arrangements are that in the event of an office closure, people go to a hosted DR site with some spare equipment that will be made available within an agreed timescale, that might not help in the event of a global pandemic, when everyone else wants to use that facility. Instead, how will your workforce continue to work at home? Which systems are important?How will you provide secure remote access to those systems? (How will you serve customers whilst employees are also looking after children?) The list goes on.

Technology may help with BC, but technology alone will not provide a solution. The use of modern approaches to End User Computing will certainly make secure remote and mobile working a possibility (indeed, organisations that have taken a modern approach will probably already be familiar with those practices) but a lot of the issues will relate to people and process.

In summary, Business Continuity plans may be invoked if there is a disaster but they are about adapting business processes to maintain service in times of disruption.

Wrapping up

As I was writing this post, I thought about many tangents that I could go off and cover. I’m pretty sure the topic could be a book and this post scrapes the surface. Nevertheless, I hope my thoughts are useful and show that disaster recovery cannot be considered in isolation.

Weeknote 15/2020: a cancelled holiday, some new certifications and video conferencing fatigue

Posted on Sunday 12 April 2020Monday 13 April 2020 By Mark Wilson

Continuing the series of weekly blog posts, providing a brief summary of notable things from my week.

Cancelled holiday #1

I should have been in Snowdonia this week – taking a break with my family. Obviously that didn’t happen, with the UK’s social distancing in full effect but at least we were able to defer our accommodation booking.

It has been interesting though, being forced to be at home has helped me to learn to relax a little… there’s still a never-ending list of things that need to be done, but they can wait a while.

Learning and development

Last week, I mentioned studying for the AWS Cloud Practitioner Essentials Exam and this week saw me completing that training before attempting the exam.

It was my first online-proctored exam and I had some concerns about finding a suitable space. Even in a relatively large home (by UK standards), with a family of four (plus a dog) all at home, it’s can be difficult to find a room with a guarantee not to be disturbed. I’ve heard of people using the bathroom (and I thought about using my car). In the end, and thanks to some advice from colleagues – principally Steve Rush (@MrSteveRush) and Natalie Dellar (@NatalieDellar) – as well as some help from Twitter, I managed to cover the TV and some boxes in my loft room, banish the family, and successfully pass the test.

With exam 1 under my belt (I’m now an AWS Certified Cloud Practitioner), I decided to squeeze another in before the Easter break and successfully studied for, and passed, the Microsoft Power Platform Fundamentals exam, despite losing half a day to some internal sales training.

In both cases, I used the official study materials from Amazon/Microsoft and, although they were not everything that was needed to pass the exams, the combination of these and my experience from elsewhere helped (for example having already passed the Microsoft Azure Fundamentals exam meant that many of the concepts in the AWS exam were already familiar).

Thoughts on the current remote working situation

These should probably have been in last week’s weeknote (whilst it wasn’t the school holidays so we were trying to educate our children too) but recently it’s become particularly apparent to me that we are not living in times of “working from home” – this is “at home, during a crisis, trying to work”, which is very different:

I think I might have figured out why all of this “work from home” stuff is driving me crazy.

I’ve been a WFH employee for 10 years. But that’s not what’s happening now.

People are at home trying to work during a crisis.

That’s not the same as WFH.
— Chuck Gose ? (@chuckgose) April 2, 2020

Some other key points I’ve picked up include that:

Personal, physical and mental health is more important than anything else right now. (I was disappointed to find that even the local Police are referring to mythical time limits on allowed exercise here in the UK – and I’m really lucky to be able to get out to cycle/walk in open countryside from my home, unlike so many.)
We should not be trying to make up for lost productivity by working more hours. (This is particularly important for those who are not used to remote working.)
And, if you’re furloughed, use the time wisely. (See above re: learning and development!)

Video conference fatigue

Inspired by Matt Ballantine’s virally-successful flowchart of a few years ago, I tried sketching something. It didn’t catch on in quite the same way, but it does seem to resonate with people.

Video Conferencing Fatigue pic.twitter.com/8JSGWRWEJa
— Mark Wilson ???? (@markwilsonit) April 8, 2020

In spite of my feelings on social video conferencing, I still took part in two virtual pub quizzes this week (James May’s was awful whilst Nick’s Pub Quiz continues to be fun) together with trans-Atlantic family Zooming over the Easter weekend…

Podcast backlog

Not driving and not going out for lunchtime solo dog walks has had a big impact on my podcast-listening…

I now need to schedule some time for catching up on The Archers and the rest of my podcasts!

Just looked in my Podcasts app and realised how long the backlog is… that’ll be the drop off in travel/lunchtime dog-walking taking effect. Nice excuse to relax in the afternoon sun ????
— Mark Wilson ???? (@markwilsonit) April 11, 2020

Remote Work Survival Kit

In what spare time I’ve had, I’ve also been continuing to edit the Remote Work Survival Kit. It’s become a mammoth task, but there are relatively few updates arriving in the doc now. Some of the team have plans to move things forward, but I have a feeling it’s something that will never be “done”, will always be “good enough” and which I may step away from soon.

Possibly the best action film in the world…

My week finished with a family viewing of the 1988 film, “Die Hard”. I must admit it was “a bit more sweary” than I remembered (although nothing that my teenagers won’t already hear at school) but whilst researching the film classification it was interesting to read how it was changed from an 18 to a 15 with the passage of time…

Weeknote 14/2020: Podcasting, furlough and a socially-distanced birthday

Posted on Sunday 5 April 2020Tuesday 7 April 2020 By Mark Wilson

We’re living in strange times at the moment, so it seems as good as ever an opportunity to bring back my attempts to blog at least weekly with a brief precis of my week.

In the beginning

The week started as normal. Well, sort of. The new normal. Like everyone else in the UK, I’m living in times of enforced social distancing, with limited reasons to leave the house. Thankfully, I can still exercise once a day – which for me is either a dog walk, a run or a bike ride.

On the work front, I had a couple of conversations around potential client work, but was also grappling with recording Skills Framework for the Information Age (SFIA) skills for my team. Those who’ve known me since my Fujitsu days may know that I’m no fan of SFIA and it was part of the reason I chose to leave that company… but it seems I can’t escape it.

Podcasting

On Monday evening, I stood in for Chris Weston (@ChrisWeston) as a spare “W” on the WB-40 Podcast. Matt Ballantine (@Ballantine70) and I had a chat about the impact of mass remote working, and Matt quizzed me about retro computing. I was terrible in the quiz but I think I managed to sound reasonably coherent in the interview – which was a lot of fun!

This week is a bit different as @WB40Podcast is presented by @ballantine70 and @MarkWilsonIT as @ChrisWeston takes a well-deserved break. We also have a geeky history quiz… https://t.co/IczBBQVO5r pic.twitter.com/NLaz1GMZVI
— Matt Ballantine (@ballantine70) March 30, 2020

Furlough

A few weeks ago, most people in the UK would never have heard of “Furlough Leave”. For many, it’s become common parlance now, as the UK Government’s Job Retention Scheme becomes reality for hundreds of thousands, if not millions of employees. It’s a positive thing – it means that businesses can claim some cash from the Government to keep them afloat whilst staff who are unable to work due to the COVID-19/Coronavirus crisis restrictions are sent home. In theory, with businesses still liquid, we will all have jobs to go back to, once we’re allowed to return to some semblance of normality.

On Tuesday, I was part of a management team drawing up a list of potentially affected staff (including myself), based on strict criteria around individuals’ current workloads. On Wednesday it was confirmed that I would no longer be required to attend work for the next three weeks from that evening. I can’t provide any services for my employer – though I should stay in touch and personal development is encouraged.

Social distancing whilst shopping for immediate and extended family

So, Thursday morning, time to shop for provisions: stock is returning to the supermarket shelves after a relatively small shift in shopping habits completely disrupted the UK’s “just in time” supply chain. It’s hardly surprising as a nation prepared to stay in for a few weeks, with no more eating at school/work, no pubs/cafés/restaurants, and the media fuelling chaos with reports of “panic buying”.

Right now, after our excellent independent traders (like Olney Butchers), the weekly town market is the best place to go with plenty of produce, people keeping their distance, and fresh air. Unfortunately, with a family of four to feed (and elderly relatives to shop for too), it wasn’t enough – which meant trawling through two more supermarkets and a convenience store to find everything – and a whole morning gone. I’m not sure how many people I interacted with but it was probably too many, despite my best efforts.

Learning and development

With some provisions in the house, I spent a chunk of time researching Amazon Web Services certifications, before starting studying for the AWS Cloud Practitioner Essentials Exam. It should be a six hour course but I can’t speed up/slow down the video, so I keep on stopping and taking notes (depending on the presenter) which makes it slow going…

I did do some Googling though, and found that a combination of Soundflower and Google Docs could be used to transcribe the audio!

Struggling with AWS training: talking heads and dull content (plus stopping and starting video to take notes) so I tried Soundflower and Google Docs Voice Typing to create a transcription. It’s not bad for American presenters, less so for other dialects. No punctuation though… pic.twitter.com/uBawtXvROj
— Mark Wilson ???? (@markwilsonit) April 3, 2020

I also dropped into a Microsoft virtual launch event for the latest Microsoft Business Applications (Dynamics 365 and Power Platform) updates. There’s lots of good stuff happening there – hopefully I’ll turn it into a blog post soon…

#NicksPubQuiz

Saturday night was a repeat of the previous week, taking part in “Nick’s Pub Quiz”. For those who haven’t heard of it – Nick Heath (@NickHeathSport) is a sports commentator who, understandably, is a bit light on the work front right now so he’s started running Internet Pub Quizzes, streaming on YouTube, for a suggested £1/person donation. Saturday night was his sixth (and my family’s second) – with over 1500 attendees on the live stream. Just like last week, my friend James and his family also took part (in their house) with us comparing scores on WhatsApp for a bit of competition!

Another COVID19 Saturday Night, another #NicksPubQuiz! https://t.co/gscLst9oK1
— Mark Wilson ???? (@markwilsonit) April 4, 2020

Another year older

Ending the week on a high, Sunday saw my birthday arrive (48). We may not be able to go far, but I did manage a cycle ride with my eldest son, then back home for birthday cake (home-made Battenberg cake), and a family BBQ. And the sun shone. So, all in all, not a bad end to the week.

Birthday Battenberg pic.twitter.com/72U3Wpf9pF
— Mark Wilson ???? (@markwilsonit) April 5, 2020

A logical view on a virtual datacentre services architecture

Posted on Monday 15 July 2019Monday 15 July 2019 By Mark Wilson

This content is 6 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

A couple of years ago, I wrote a post about a logical view of an End-User Computing (EUC) architecture (which provides a platform for Modern Workplace). It’s served me well and the model continues to be developed (although the changes are subtle so it’s not really worth writing a new post for the 2019 version).

Building on the original EUC/Modern Workplace framework, I started to think what it might look like for datacentre services – and this is something I came up with last year that’s starting to take shape.

Just as for the EUC model, I’ve tried to step up a level from the technology – to get back to the logical building blocks of the solution so that I can apply them according to a specific client’s requirements. I know that it’s far from complete – just look at an Azure or AWS feature list and you can come up with many more classifications for cloud services – but I think it provides the basics and a starting point for a conversation:

Logical view of a virtual datacentre environment

Starting at the bottom left of the diagram, I’ll describe each of the main blocks in turn:

Whether hosted on-premises, co-located or making use of public cloud capabilities, Connectivity is a key consideration for datacentre services. This element of the solution includes the WAN connectivity between sites, site-to-site VPN connections to secure access to the datacentre, Internet breakout and network security at the endpoints – specifically the firewalls and other network security appliances in the datacentre.
Whilst many of the SBBs in the virtual datacentre services architecture are equally applicable for co-located or on-premises datacentres, there are some specific Cloud Considerations. Firstly, cloud solutions must be designed for failure – i.e. to design out any elements that may lead to non-availability of services (or at least to fail within agreed service levels). Depending on the organisation(s) consuming the services, there may also be considerations around data location. Finally, and most significantly, the cloud provider(s) must practice trustworthy computing and, ideally, will conform to the UK National Cyber Security Centre (NCSC)’s 14 cloud security principles (or equivalent).
Just as for the EUC/Modern Workplace architecture, Identity and Access is key to the provision of virtual datacentre services. A directory service is at the heart of the solution, combined with a model for limiting the scope of access to resources. Together with Role Based Access Control (RBAC), this allows for fine-grained access permissions to be defined. Some form of remote access is required – both to access services running in the datacentre and for management purposes. Meanwhile, identity integration is concerned with integrating the datacentre directory service with existing (on-premises) identity solutions and providing SSO for applications, both in the virtual datacentre and elsewhere in the cloud (i.e. SaaS applications).
Data Protection takes place throughout the solution – but key considerations include intrusion detection and endpoint security. Just as for end-user devices, endpoint security covers such aspects as firewalls, anti-virus/malware protection and encryption of data at rest.
In the centre of the diagram, the Fabric is based on the US National Institute of Standards and Technology (NIST)’s established definition of essential characteristics for cloud computing.
The NIST guidance referred to above also defines three service models for cloud computing: Infrastructure as a Service (IaaS); Platform as a Service (PaaS) and Software as a Service (SaaS).
In the case of IaaS, there are considerations around the choice of Operating System. Supported operating systems will depend on the cloud service provider.
Many cloud service providers will also provide one or more Marketplaces with both first and third-party (ISV) products ranging from firewalls and security appliances to pre-configured application servers.
Application Services are the real reason that the virtual datacentre services exist, and applications may be web, mobile or API-based. There may also be traditional hosted server applications – especially where IaaS is in use.
The whole stack is wrapped with a suite of Management Tools. These exist to ensure that the cloud services are effectively managed in line with expected practices and cover all of the operational tasks that would be expected for any datacentre including: licensing; resource management; billing; HA and disaster recovery/business continuity; backup and recovery; configuration management; software updates; automation; management policies and monitoring/alerting.

If you have feedback – for example, a glaring hole or suggestions for changes, please feel free to leave a comment below.

Amazon Web Services (AWS) Summit: London Recap

Posted on Monday 24 June 2019Wednesday 26 June 2019 By Mark Wilson

I’ve written previously about the couple of days I spent at ExCeL in February, learning about Microsoft’s latest developments at the Ignite Tour and, a few weeks later I found myself back at the same venue, this time focusing on Amazon Web Services (AWS) at the London AWS Summit (four years since my last visit).

Even with a predominantly Microsoft-focused client base, there are situations where a multi-cloud solution is required and so, it makes sense for me to expand my knowledge to include Amazon’s cloud offerings. I may not have the detail and experience that I have with Microsoft Azure, but certainly enough to make an informed choice within my Architect role.

One of the first things I noticed is that, for Amazon, it’s all about the numbers. The AWS Summit had a lot of attendees – 12000+ were claimed, for more than 60 technical sessions supported by 98 sponsoring partners. Frankly, it felt to me that there were a few too many people there at times…

When 4000 people all head for the same 10m wide door (next up will be queue for the gents!) #AWSSummit #ConferenceLogistics pic.twitter.com/sKIbkaOZbs
— Mark Wilson ???? (@markwilsonit) May 8, 2019

AWS is clearly growing – citing 41% growth comparing Q1 2019 with Q1 2018. And, whilst the comparisons with the industrial revolution and the LSE research that shows 95% of today’s startups would find traditional IT models limiting today were all good and valid, the keynote soon switched to focus on AWS claims of “more”. More services. More depth. More breadth.

I’m tired of rhetoric from cloud providers trying to demonstrate one-upmanship. Real customers don’t care about number of services: they want to know how to save money, increase agility whilst remaining secure and compliant. They can do that on AWS or Azure… select best fit…
— Mark Wilson ???? (@markwilsonit) May 8, 2019

There were some good customer slots in the keynote: Sainsbury’s Group CIO Phil Jordan and Group Digital Officer Clodagh Moriaty spoke about improving online experiences, integrating brands such as Nectar and Sainsbury’s, and using machine learning to re-plan retail space and to plan online deliveries. Ministry of Justice CDIO Tom Read talked about how the MOJ is moving to a microservice-based application architecture.

After the keynote, I immersed myself in technical sessions. In fact, I avoided the vendor booths completely because the room was absolutely packed when I tried to get near. My afternoon consisted of:

Driving digital transformation using artificial intelligence by Steven Bryen (@Steven_Bryen) and Bjoern Reinke.
AWS networking fundamentals by Perry Wald and Tom Adamski.
Creating resilience through destruction by Adrian Hornsby (@adhorn).
How to build an Alexa Skill in 30 minutes by Andrew Muttoni (@muttonia).

All of these were great technical sessions – and probably too much for a single blog post but, here goes anyway…

Driving digital transformation using artificial intelligence

Amazon thinks that driving better customer experience requires Artificial Intelligence (AI), specifically Machine Learning (ML). Using an old picture of London Underground workers sorting through used tickets in the 1950s to identify the most popular journeys, Steven Bryen suggested that more data leads to better analytics and better outcomes that can be applied in more ways (in a cyclical manner).

The term “artificial intelligence” has been used since John McCarthy coined it in 1955. The AWS view is that AI taking off because of:

Algorithms.
Data (specifically the ability to capture and store it at scale).
GPUs and acceleration.
Cloud computing.

Citing research from PwC [which I can’t find on the Internet], AWS claim that world GDP was $80Tn in 2018 and is expected to be $112Tn in 2030 ($15.7Tn of which can be attributed to AI).

Data science, artificial intelligence, machine learning and deep learning can be thought of as a series of concentric rings.

Machine learning can be supervised learning (betting better at finding targets); unsupervised (assume nothing and question everything); or reinforcement learning (rewarding high performing behaviour).

Amazon claims extensive AI experience through its own ML experience:

Recommendations Engine
Prime Air
Alexa
Go (checkoutless stores)
Robotic warehouses – taking trolleys to packer to scan and pack (using an IoT wristband to make sure robots avoid maintenance engineers).

Every day Amazon applies new AI/ML-based improvements to its business, at a global scale through AWS.

Challenges for organisations are that:

ML is rare
plus: Building and scaling ML technology is hard
plus: Deploying and operating models in production is time-consuming and expensive
equals: a lack of cost-effective easy-to-use and scalable ML services

Most time is spent getting data ready to get intelligence from it. Customers need a complete end-to-end ML stack and AWS provides that with edge technologies such as Greengrass for offline inference and modelling in SageMaker. The AWS view is that ML prediction becomes a RESTful API call.

With the scene set, Steven Bryen handed over to Bjoern Reinke, Drax Retail’s Director of Smart Metering.

Drax has converted former coal-fired power stations to use biomass: capturing carbon into biomass pellets, which are burned to create steam that drives turbines – representing 15% of the UK’s renewable energy.

Drax uses a systems thinking approach with systems of record, intelligence and engagement

System of intelligence need:

Trusted data.
Insight everywhere.
Enterprise automation.

Customers expect tailoring: efficiency; security; safety; and competitive advantage.

Systems of intelligence can be applied to team leaders, front line agents (so they already know that customer has just been online looking for a new tariff), leaders (for reliable data sources), and assistant-enabled recommendations (which are no longer futuristic).

Fragmented/conflicting data is pumped into a data lake from where ETL and data warehousing technologies are used for reporting and visualisation. But Drax also pull from the data lake to run analytics for data science (using Inawisdom technology).

The data science applications can monitor usage and see base load, holidays, etc. Then, they can look for anomalies – a deviation from an established time series. This might help to detect changes in tenants, etc. and the information can be surfaced to operations teams.

Björn Reinke from @DraxBiomass speaking about analysing energy consumption and looking for anomalies to improve systems of intelligence #MachineLearning #AWSSummit pic.twitter.com/mQOP298kjQ
— Mark Wilson ???? (@markwilsonit) May 8, 2019

AWS networking fundamentals

After hearing how AWS can be used to drive insight into customer activities, the next session was back to pure tech. Not just tech but infrastructure (all be it as a service). The following notes cover off some AWS IaaS concepts and fundamentals.

Customers deploy into virtual private cloud (VPC) environments within AWS:

For demonstration purposes, a private address range (CIDR) was used – 172.31.0.0/16 (a private IP range from RFC 1918). Importantly, AWS ranges should be selected to avoid potential conflicts with on-premises infrastructure. Amazon recommends using /16 (65536 addresses) but network teams may suggest something smaller.
AWS is dual-stack (IPv4 and IPv6) so even if an IPv6 CIDR is used, infrastructure will have both IPv4 and IPv6 addresses.
Each VPC should be broken into availability zones (AZs), which are risk domains on different power grids/flood profiles and a subnet placed in each (e.g. 172.31.0.0/24, 172.31.1.0/24, 172.31.2.0/24).
Each VPC has a default routing table but an administrator can create and assign different routing tables to different subnets.

To connect to the Internet you will need a connection, a route and a public address:

Create a public subnet (one with public and private IP addresses).
Then, create an Internet Gateway (IGW).
Finally, Create a route so that the default gateway is the IGW (172.31.0.0/16 local and 0.0.0.0/0 igw_id).
Alternatively, create a private subnet and use a NAT gateway for outbound only traffic and direct responses (172.31.0.0/16 local and 0.0.0.0/0 nat_gw_id).

Moving on to network security:

Network Security Groups (NSGs) provide a stateful distributed firewall so a request from one direction automatically sets up permissions for a response from the other (avoiding the need to set up separate rules for inbound and outbound traffic).
- Using an example VPC with 4 web servers and 3 back end servers:
  - Group into 2 security groups
  - Allow web traffic from anywhere to web servers (port 80 and source 0.0.0.0/0)
  - Only allow web servers to talk to back end servers (port 2345 and source security group ID)
Network Access Control Lists (NACLs) are stateless – they are just lists and need to be explicit to allow both directions.
Flow logs work at instance, subnet or VPC level and write output to S3 buckets or CloudWatch logs. They can be used for:
- Visibility
- Troubleshooting
- Analysing traffic flow (no payload, just metadata)
  - Network interface
  - Source IP and port
  - Destination IP and port
  - Bytes
  - Condition (accept/reject)
DNS in a VPC is switched on by default for resolution and assigning hostnames (rather than just using IP addresses).
- AWS also has the Route 53 service for customers who would like to manage their own DNS.

Finally, connectivity options include:

Peering for private communication between VPCs
- Peering is 1:1 and can be in different regions but the CIDR must not overlap
- Each VPC owner can send a request which is accepted by the owner on the other side. Then, update the routing tables on the other side.
- Peering can get complex if there are many VPCs. There is also a limit of 125 peerings so a Transit Gateway can be used to act as a central point but there are some limitations around regions.
- Each Transit Gateway can support up to 5000 connections.
AWS can be connected to on-premises infrastructure using a VPN or with AWS Direct Connect
- A VPN is established with a customer gateway and a virtual private gateway is created on the VPC side of the connection.
  - Each connection has 2 tunnels (2 endpoints in different AZs).
  - Update the routing table to define how to reach on-premises networks.
- Direct Connect
  - AWS services on public address space are outside the VPC.
  - Direct Connect locations have a customer or partner cage and an AWS cage.
  - Create a private virtual interface (VLAN) and a public virtual interface (VLAN) for access to VPC and to other AWS services.
  - A Direct Connect Gateway is used to connect to each VPC
- Before Transit Gateway customers needed a VPN per VPC.
  - Now they can consolidate on-premises connectivity
  - For Direct Connect it’s possible to have a single tunnel with a Transit Gateway between the customer gateway and AWS.
Route 53 Resolver service can be used for DNS forwarding on-premises to AWS and vice versa.
VPC Sharing provides separation of resources with:
- An Owner account to set up infrastructure/networking.
- Subnets shared with other AWS accounts so they can deploy into the subnet.
Interface endpoints make an API look as if it’s part of an organisation’s VPC.
- They override the public domain name for service.
- Using a private link can only expose a specific service port and control the direction of communications and no longer care about IP addresses.
Amazon Global Accelerator brings traffic onto the AWS backbone close to end users and then uses that backbone to provide access to services.

Creating resilience through destruction

Adrian Horn presenting at AWS Summit London

One of the most interesting sessions I saw at the AWS Summit was Adrian Horn’s session that talked about deliberately breaking things to create resilience – which is effectively the infrastructure version of test-driven development (TDD), I guess…

Actually, Adrian made the point that it’s not so much the issues that bringing things down causes as the complexity of bringing them back up.

“Failures are a given and everything will eventually fail over time”
Werner Vogels, CTO, Amazon.com

We may break a system into microservices to scale but we also need to think about resilience: the ability for a system to handle and eventually recover from unexpected conditions.

This needs to consider a stack that includes:

People
Application
Network and Data
Infrastructure

And building confidence through testing only takes us so far. Adrian referred to another presentation, by Jesse Robbins, where he talks about creating resilience through destruction.

Firefighters train to build intuition – so they know what to do in the event of a real emergency. In IT, we have the concept of chaos engineering – deliberately injecting failures into an environment:

Start small and build confidence:
- Application level
- Host failure
- Resource attacks (CPU, latency…)
- Network attacks (dependencies, latency…)
- Region attack
- Human attack (remove a key resource)
Then, build resilient systems:
- Steady state
- Hypothesis
- Design and run an experiment
- Verify and learn
- Fix
- (maybe go back to experiment or to start)
And use bulkheads to isolate parts of the system (as in shipping).

Think about:

Software:
- Certificate Expiry
- Memory leaks
- Licences
- Versioning
Infrastructure:
- Redundancy (multi-AZ)
- Use of managed services
- Bulkheads
- Infrastructure as code
Application:
- Timeouts
- Retries with back-offs (not infinite retries)
- Circuit breakers
- Load shedding
- Exception handing
Operations:
- Monitoring and observability
- Incident response
- Measure, measure, measure
- You build it, your run it

AWS’ Well Architected framework has been developed to help cloud architects build secure, high-performing, resilient, and efficient infrastructure for their applications, based on some of these principles.

Adrian then moved on to consider what a steady state looks like:

Normal behaviour of system
Business metric (e.g. pulse of Netflix – multiple clicks on play button if not working)
- Amazon extra 100ms load time led to 1% drop in sales (Greg Linden)
- Google extra 500ms of load time led to 20% fewer searches (Marissa Mayer)
- Yahoo extra 400ms of load time caused 5-9% increase in back clicks (Nicole Sullivan)

He suggests asking questions about “what if?” and following some rules of thumb:

Start very small
As close as possible to production
Minimise the blast radius
Have an emergency stop
- Be careful with state that can’t be rolled back (corrupt or incorrect data)

Use canary deployment with A-B testing via DNS or similar for chaos experiment (1%) or normal (99%).

Adrian then went on to demonstrate his approach to chaos engineering, including:

Fault injection queries for Amazon Aurora (can revert immediately)
- Crash a master instance
- Fail a replica
- Disk failure
- Disk congestion
DDoS yourself
- ~ wrk -t12 -c400 -d30s http://ipaddress/api/health
Add latency to network
- ~ tc qdisc add dev eth0 root netem delay 200ms
https://github.com/Netflix/SimianArmy
- Shut down services randomly
- Slow down performance
- Check conformity
- Break an entire region
- etc.
The chaos toolkit
Gremin
- Destruction as a service!
ToxiProxy
- Sit between components and add “toxics” to test impact of issues
Kube-Money project (for Kubernetes)
Pumba (for Docker)
Thundra (for Lambda)

Use post mortems for correction of errors – the 5 whys. Also, understand that there is no isolated “cause” of an accident.

My notes don’t do Adrian’s talk justice – there’s so much more that I could pick up from re-watching his presentation. Adrian tweeted a link to his slides and code – if you’d like to know more, check them out:

And Up! the slides from my talk “Creating Resiliency Through Destruction” are online https://t.co/5PpYEYL5qe and the code https://t.co/03vbnVjlUk #AWSSummit #chaosengineering #AWS pic.twitter.com/HC4v0AqqIl
— Adrian Hornsby (@adhorn) May 8, 2019

How to build an Alexa Skill in 30 minutes

Spoiler: I didn’t have a working Alexa skill at the end of my 30 minutes… nevertheless, here’s some info to get you started!

Amazon’s view is that technology tries to constrain us. Things got better with mobile and voice is the next step forward. With voice, we can express ourselves without having to understand a user interface [except we do, because we have to know how to issue commands in a format that’s understood – that’s the voice UI!].

I get the point being made – to add an item to a to-do list involves several steps:

Find phone
Unlock phone
Find app
Add item
etc.

Or, you could just say (for example) “Alexa, ask Ocado to add tuna to my trolley”.

Alexa is a service in the AWS cloud that understands request and acts upon them. There are two components:

Alexa voice service – how a device manufacturer adds Alexa to its products.
Alexa Skills Kit – to create skills that make something happen (and there are currently more than 80,000 skills available).

An Alexa-enabled device only needs to know to wake up, then stream some “mumbo jumbo” to the cloud, at which point:

Automatic speech recognition with translate text to speech
Natural language understanding will infer intent (not just text, but understanding…)

Creating skills is requires two parts:

Voice user interface:
- developer.amazon.com
Programming logic (Alexa Web Services):
- aws.amazon.com

Alexa-hosted skills use Lambda under the hood and creating the skill involves:

Give the skill a name.
Choose the development model.
Choose a hosting method.
Create a skill.
Test in a simulation environment.

Finally, some more links that may be useful:

Build and host your skill in the cloud: https://alexa.design/build.
Alexa SkillS Kit SDK for Node.JS: https://alexa.design/nodesdk.
Alexa CLI documentation: https://alexa.design/cli and https://bit.ly/cli-guide.
AWS promotional credits for Alexa: https://alexa.design/awspromo.

In summary

Looking back, the technical sessions made my visit to the AWS Summit worthwhile but overall, I was a little disappointed, as this tweet suggests:

End of day verdict on #AWSSummit: disappointing keynote (see https://t.co/BOFiQacH5f); good breakout sessions (at least the ones I went to); awful conference app (feedback is painfully slow); OK catering; too many people (so I skipped the expo). Learned lots but hoped for more… pic.twitter.com/QtUcaKqoAS
— Mark Wilson ???? (@markwilsonit) May 8, 2019

Would I recommend the AWS Summit to others? Maybe. Would I watch the keynote from home? No. Would I try to watch some more technical sessions? Absolutely, if they were of the quality I saw on the day. Would I bother to go to ExCeL with 12000 other delegates herded like cattle? Probably not…

Do we need another as-a-service to describe functions?

Posted on Thursday 4 May 2017Thursday 4 May 2017 By Mark Wilson

This content is 8 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

Last week saw quarterly earnings reports for major cloud vendors and this tweet caught my eye:

Synergy Research says “Amazon Cloud Growth is Hardly Hampered by the Chasing Pack” AWS in “a league of their own” after latest earnings pic.twitter.com/syqPS9rYRO

— Brandon Butler (@BButlerNWW) April 28, 2017

You see, despite Azure growing by 93%, this suggests that Amazon has the cloud market sewn up. Except I’m not sure they do…

I think it would be interesting to see this separated into infrastructure-, platform- and software-as-a-service (IaaS/PaaS/SaaS). I suggest that would present three very different stories. And I’d expect that Amazon would only really be way out front for IaaS.

My friend and former colleague, Garry Martin (@GarryMartin) questioned the relevance of those “legacy” distinctions but I think they still have value today.

@markwilsonit Depends how relevant you think those are; legacy differentiations and language. AWS isn’t just IaaS; take a look at AWS Lambda for example

— Garry Martin (@GarryMartin) April 28, 2017

In the early days of what we now recognise as cloud computing, every vendor was applying their own brand of cloud-washing. It still happens today, with vendors claiming to offer IaaS when really they have a hosted service and a traditional delivery model.

Back in 2011, the US National Institute of Standards and Technology (NIST) defined cloud computing, including the service models of IaaS, PaaS and SaaS. Those service models, along with the (also abused) deployment models (public cloud, private cloud, etc.) have served us well but are they really legacy?

I don’t think they are. Six years is a long time in IT, let alone the cloud but I think IaaS, PaaS and SaaS are as relevant today as they were when NIST wrote their definition.

When asked how “serverless” technologies like AWS Lambda, Azure Functions or Google Cloud Functions fit in, I say they’re just PaaS. Done right.

Some people want to add another service model/definition for Function-as-a-Service (FaaS). But why? What value does it add? Functions are just PaaS but we’ve finally evolved to a place where we are moving past the point of caring about what the code runs on and letting the cloud manage that for us. That’s what PaaS has supposed to have been doing for years (after all, should I really need to define a number of instances to run my web application – that all sounds a bit like virtual machines to me…)

To my mind, “serverless” is just the ultimate platform as a service and we really don’t need another service model to describe it.

To quote a haiku from Onsi Fakhouri (@onsijoe):

“Here is my code
Run it in the cloud for me
I don’t care how”

Or, as Simon Wardley (@swardley) “fixed” this Cloud Foundry diagram:

@wattersjames FTFY … :-) pic.twitter.com/sy1VcdtUwK

— swardley (@swardley) April 21, 2017

Designing for failure does not necessarily mean multi-cloud

Posted on Thursday 2 March 2017Friday 3 March 2017 By Mark Wilson

Earlier this week, Amazon Web Services’ S3 storage service suffered an outage that affected many websites (including popular sites to check if a website is down for everyone or just you!).

S3 is experiencing high error rates. We are working hard on recovering.

— Amazon Web Services (@awscloud) February 28, 2017

Unsurprisingly, this led to a lot of discussion about designing for failure – or not, it would seem in many cases, including the architecture behind Amazon’s own status pages:

The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates.

— Amazon Web Services (@awscloud) February 28, 2017

The Amazon and Azure models are slightly different but in the past we’ve seen outages to the Azure identity system (for example) impact on other Microsoft services (Office 365). When that happened, Microsoft’s Office 365 status page didn’t update because of a caching/CDN issue. It seems Amazon didn’t learn from Microsoft’s mistakes!

Randy Bias (@RandyBias) is a former Director at OpenStack and a respected expert on many cloud concepts. Randy and I exchanged many tweets on the topic of the AWS outage but, after multiple replies, I thought a blog post might be more appropriate. You see, I hold the view that not all systems need to be highly available. Sometimes, failure is OK. It all comes down to requirements:

@randybias Depends what the system is. Not everything needs to be highly available. There’s a requirements/cost/risk trade-off

— Mark Wilson (@markwilsonit) March 1, 2017

And, as my colleague Tim Siddle highlighted:

@markwilsonit DR/multi-region/multi-cloud is expensive – and it’s always a requirement, until the cost is laid bare…..

— Tim Siddle (@tim_siddle) February 28, 2017

I agree. 100%.

@markwilsonit and of course, much depends on the application architecture itself

— Tim Siddle (@tim_siddle) February 28, 2017

So, what does that architecture look like? Well, it will vary according to the provider:

For AWS we need to think about regions and availability zones. Each region is made up of a number of availability zones (at least two according to the AWS glossary).
For Azure there are more regions and these are paired for availability (for example when using geo-redundant storage). In addition, each region will consist of multiple datacentre facilities.

So, if we want to make sure our application can survive a region failure, there are ways to design around this. Just be ready for the solution we sold to the business based on using commodity cloud services to start to look rather expensive. Whereas on-premises we typically have two datacentres with resilient connections, then we’ll want to do the same in the cloud. But, just as not all systems are in all datacentres on-premises, that might also be the case in the cloud. If it’s a service for which some downtime can be tolerated, then we might not need to worry about a multi-region architecture. In cases where we’re not at all concerned about downtime we might not even use an availability set…

Other times – i.e. if the application is a web service for which an outage would cause reputational or financial damage – we may have a requirement for higher availability. That’s where so many of the services impacted by Tuesday’s AWS outage went wrong:

No one claims 100% up time, FOR A REASON

— Jeorry Balasabas (@jeorryb) February 28, 2017

And understand it when designing cloud solutions, still your responsibility to deliver resilience, can’t abdicate that to someone else https://t.co/VGdunBJqSH

— Paul Stringfellow (@techstringy) February 28, 2017

Amazon’s S3 outage is not just a case of getting what you paid for it’s also about getting what you designed for. Availability isn’t cheap.

— Mark Twomey (@Storagezilla) February 28, 2017

Of course, we might spread resources around regions for other reasons too – like placing them closer to users – but that comes back to my point about requirements. If there’s a requirement for fast, low-latency access then we need to design in the dedicated links (e.g. AWS Direct Connect or Azure ExpressRoute) and we’ll probably have more than one of them too, each terminating in a different region, with load balancers and all sorts of other considerations.

Because a cloud provider could be one of those single points of failure, many people are advocating multi-cloud architectures. But, if you think multi-region is expensive, get ready for some seriously complex architecture and associated costs in a multi-cloud environment. Just as in the on-premises world, many enterprises use a single managed services provider (albeit with multiple datacentres), in the cloud many of us will continue to use a single cloud provider. Designing for failure does not necessarily mean multi-cloud.

Of course, a single-cloud solution has its risks. Randy is absolutely spot on in his reply below:

@markwilsonit Public clouds are walled gardens and create significant points of lock-in. Long term AWS is no different than Oracle software.

— Randy Bias (@randybias) March 1, 2017

It could be argued that one man’s “lock-in” is another’s “making the most of our existing technology investments”. If I have a Microsoft Enterprise Agreement, I want to make sure that I use the software and services that I’m paying for. And running a parallel infrastructure on another cloud is probably not doing that. Not unless I can justify to the CFO why I’m running redundant systems just in case one goes down for a few hours.

That doesn’t mean we can avoid designing with the future in mind. We must always have an exit strategy and, where possible, think about designing systems with a level of abstraction to make them cloud-agnostic.

Ultimately though it all comes back to requirements – and the ability to pay. We might like an Aston Martin but if the budget is more BMW then we’ll need to make some compromises – with an associated risk, signed off by senior management, of course.

[Updated 2 March 2017 16:15 to include the Mark Twomey tweet that I missed out in the original edit]

Short takes: Amazon Web Services 101, Adobe Marketing Cloud and Milton Keynes Geek Night (#MKGN)

Posted on Friday 7 December 2012Wednesday 20 March 2013 By Mark Wilson

This content is 12 years old. I don't routinely update old blog posts as they are only intended to represent a view at a particular point in time. Please be warned that the information here may be out of date.

What a crazy week. On top of a busy work schedule, I’ve also found myself at some tech events that really deserve a full write-up but, for now, will have to make do with a summary…

Amazon Web Services 101

One of the events I attended this week was a “lunch and learn” session to give an introduction/overview of Amazon Web Services – kind of like a breakfast briefing, but at a more sociable hour of the day!

I already blogged about Amazon’s reference architecture for utility computing but I wanted to mention Ryan Shuttleworth’s (@RyanAWS) explaination of how Amazon Web Services (AWS) came about.

Contrary to popular belief, AWS didn’t grow out of spare capacity in the retail business but in building a service-oriented infrastructure for a scalable development environment to initially provide development services to internal teams and then to expose the amazon catalogue as a web service. Over time, Amazon found that developers were hungry for more and they moved towards the AWS mission to:

“Enable business and developers to use web services* to build scalable, sophisticated applications”

*What people now call “the cloud”

In fact, far from being the catalyst for AWS, Amazon’s retail business is just another AWS customer.

Adobe Marketing Cloud

Most people will be familiar with Adobe for their design and print products, whether that’s Photoshop, Lightroom, or a humble PDF reader. I was invited to attend an event earlier this week to hear about the Adobe Marketing Cloud, which aims to become for marketers what the Creative Suite has for design professionals. Whilst the use of “cloud” grates with me as a blatant abuse of a buzzword (if I’m generous, I suppose it is a SaaS suite of products…), Adobe has been acquiring companies (I think I heard $3bn mentioned as the total cost) and integrating technology to create a set of analytics, social, advertising, targeting and web experience management solutions and a real-time dashboard.

Milton Keynes Geek Night

The third event I attended this week was the quarterly Milton Keynes Geek Night (this was the third one) – and this did not disappoint – it was well up to the standard I’ve come to expect from David Hughes (@DavidHughes) and Richard Wiggins (@RichardWiggins).

The evening kicked off with Dave Addey (@DaveAddey) of UK Train Times app fame, talking about what makes a good mobile app. Starting out from a 2010 Sunday Times article about the app gold rush, Dave explained why few people become smartphone app millionaires, but how to see if your idea is:

Is your mobile app idea really a good idea? (i.e. is it universal, is it international, and does it have lasting appeal – or, put bluntly, will you sell enough copies to make it worthwhile?)
Is it suitable to become a mobile app? (will it fill “dead time”, does it know where you go and use that to add value, is it “always there”, does it have ongoing use)
And how should you make it? (cross platform framework, native app, HTML, or hybrid?)

Dave’s talk warrants a blog post of it’s own – and hopefully I’ll return to the subject one day – but, for now, that’s the highlights.

Next up were the 5 minute talks, with Matt Clements (@MattClementsUK) talking about empowering business with APIs to:

Increase sales by driving traffic.
Improve your brand awareness by working with others.
Increase innovation, by allowing others to interface with your platform.
Create partnerships, with symbiotic relationships to develop complimentary products.
Create satisfied customers – by focusing on the part you’re good at, and let others build on it with their expertise.

Then Adam Onishi (@OnishiWeb) gave a personal, and honest, talk about burnout, it’s effects, recognising the problem, and learning to deal with it.

And Jo Lankester (@JoSnow) talked about real-world responsive design and the lessons she has learned:

Improve the process – collaborate from the outset.
Don’t forget who you’re designing for – consider the users, in which context they will use a feature, and how they will use it.
Learn to let go – not everything can be perfect.

Then, there were the usual one-minute slots from sponsors and others with a quick message, before the second keynote – from Aral Balkan (@Aral), talking about the high cost of free.

In an entertaining talk, loaded with sarcasm, profanity (used to good effect) but, most of all, intelligent insight, Aral explained the various business models we follow in the world of consumer technology:

Free – with consequential loss of privacy.
Paid – with consequential loss of audience (i.e. niche) and user experience.
Open – with consequential loss of good user experience, and a propensity to allow OEMs and operators to mess things up.

This was another talk that warrants a blog post of its own (although I’m told the session audio was recorded – so hopefully I’ll be able to put up a link soon) but Aral moved on to talk about a real alternative with mainstream consumer appeal that happens to be open. To achieve this, Aral says we need a revolution in open source culture in that open source and great user experience do not have to be mutually exclusive. We must bring design thinking to open source. Design-led open source. Without this, Aral says, we don’t have an alternative to Twitter, Facebook, whatever-the-next-big-platform-is doing what they want to with our data. And that alternative needs to be open. Because if it’s just free, the cost is too high.

The next MK Geek Night will be on 21 March, and the date is already in my diary (just waiting for the Eventbrite notice!)

Photo credit: David Hughes, on Flickr. Used with permission.

markwilson.it

get-info -class technology | write-output > /dev/web

Amazon Web Services

Weeknote 2024/06: more playing with NFC; thoughts on QR code uses; and a trip to AWS’ UK HQ

This week at work

This week in tech

More fun with NFC tags

QR codes are not the answer to sharing every link…

More of my tech life

That visit to the AWS offices that I mentioned earlier…

This week’s reading, writing, watching and listening

This week in photos

This week at home

Weeknote 15/2020: a cancelled holiday, some new certifications and video conferencing fatigue

Cancelled holiday #1

Learning and development

Thoughts on the current remote working situation

Video conference fatigue

Podcast backlog

Remote Work Survival Kit

Possibly the best action film in the world…

Weeknote 14/2020: Podcasting, furlough and a socially-distanced birthday

In the beginning

Podcasting

Furlough

Social distancing whilst shopping for immediate and extended family

Learning and development

#NicksPubQuiz

Another year older