No one can deny the Office 365 momentum that continues to build in the industry as more organizations adopt cloud collaboration solutions. A recent survey by Gartner shows that 78% of organizations are either on Office 365 or are planning on using it in the next 6 months. Couple that with Microsoft saying that over 50,000 organizations adopt Office 365 each month and it’s clear that more users than ever rely on Office 365. So it’s logical that any service failure will impact a larger number of users. Given the timing of the June 30th event that caused mail delays, I’m sure the productivity impact on customers was painful.
What went wrong?
According to the Microsoft EX71674 Post Incident Report, the event was first noticed by customers at 9:18 am Eastern Time and fully resolved at 7:30 pm Eastern Time. That’s a time frame of approximately 10 hours during which email was delayed both inbound and outbound with external parties. Microsoft states that inter-company or inter-tenant messages were not delayed. It appears that most of the mail was delayed, causing queues to build that ultimately made customers and Microsoft aware of the problem. Microsoft did confirm that some non-Office 365 users received Non Delivery Reports (NDRs) for some messages.
Looking at sites like Reddit, you can get a sense for how the queues started to build. One user says, “I'm curious, how many emails does everyone have queued up at their local gateways? I'm at ~200,000.” The reason for the delay in mail flow was a problem with Exchange Online Protection (EOP). EOP is responsible for checking mail for spam and malware and according to Microsoft was recently updated. Unfortunately the update impacted the speed of the EOP message filtering services and messages started to queue. Office 365 engineers addressed the problem by restarting message transport services, routing connectivity to alternate infrastructure, increasing capacity and ultimately making a configuration change to optimize the message filtering services code. The event impacted Office 365 customers across the U.S. but did not impact other regions.
What are the takeaways from this event?
As Tony Redmond writes in his article, Exchange Online Protection Falls Over, EOP is potentially a weak component for Office 365. He also rightly points out that this isn’t the first time we’ve seen an incident like this and provides an Azure Active Directory failure as an example. Office 365 is a broad suite that requires a complex set of infrastructure and services to work together. This complexity makes it difficult to pinpoint and diagnose the problem and can result in the end-user being impacted for multiple hours.
It’s also clear that all outages have different impacts. An outage at 11 pm can impact some users, but based on the time, it’s likely only a small pool. In the case of the June 30th outage, it was the last day of the month, the last day of the fiscal quarter for many companies and it was during the middle of the work day for U.S. customers. All organizations need to determine how much downtime is acceptable for their unique requirements and business. Then it’s up to them to have the necessary solutions in place to meet them.
How did Mimecast Customers Keep Email Running?
The Mimecast Mailbox Continuity solution is designed to spring into action when there is a problem with Office 365 or an on-premise mail server. The Mimecast service can be used by both administrators and employees and in this case we saw organizations that used both methods. Mimecast doesn’t just spool email, it provides the ability for employees to keep sending and receiving mail using Outlook, mobile applications or web portal for remote users to stay connected when Outlook isn’t an option.
During the June 30th event, it’s clear that Mimecast customers rely on our solution to keep their businesses running. There were a number of instances where over 50% of the employees at the customer companies used the Mimecast portal to stay connected. There were also instances where hundreds of employees turned to Mimecast as Office 365 failed to deliver messages. In addition to the portal, administrators acted to initiate continuity events allowing employees to keep working right in Outlook. It’s completely transparent to the end user and after the event is over, Mimecast automatically syncs and deletes any duplicate messages so there are no extra steps for employees or admins.
It should be noted that Mimecast employees are among the now 70 million Office 365 users. It’s a phenomenal service but as with any cloud provider, it will likely have bad days – just like June 30th, a really bad time to have a bad day. Mimecast helps over 18,000 organizations manage the risks of email, including continuity, each day. We’re proud to help our customers make sure this important communication channel remains protected and available.