Recently, we have encountered a number of occasions where there were issues on the website and APIs (and at one point, some significant downtime), as a result of clustering issues on our API servers. Firstly, we would like to apologise to everyone who was affected by these issues. We would like to take this opportunity to explain what the issue was, what we have done to fix it, and what future work we will be undertaking in order to improve the stability of the platform.
The first time this issue reared its ugly head was around 7pm on a Friday night, just (typically) when the Ops team were about to call it a night. A number of our alarms were triggered within seconds of each other and we all hurriedly got back to our machines to work on identifying the problem.
As the issue occurred a couple of times between the first occurrence and the fix being deployed, there were a number of ways it could have appeared to anyone using the import•io platform.
First, upon total cluster failure, the servers would not be able to recover themselves with auto-scaling, even though instances were all reporting out of service, and this would result in the HTTP 503s and downtime we saw.
Second, a partial cluster failure was also possible, and in this case we had to launch a new cluster and transition over to that one using the elastic load balancers, resulting in a period of partial instability on the APIs followed by everyone being logged out when we transitioned from the old, partially unavailable cluster, to the new one.
Finally, in one occurrence, the auto-scaling policy happened to take out the right node at the right time, and it recovered on its own having failed just a few API calls.
What was particularly puzzling about this issue was that it started out of the blue – we had not shipped any release to the API servers at all for a few weeks, and it did not correlate with any particular increase in request or query volume.
In addition to the general platform problems, integrators using a few of our client libraries experienced a bug which was triggered when our session cache was reset (which is what happened when we transitioned from the old cluster to the new one). This bug caused the client to go into an infinite loop of requests upon losing a previously valid auth cookie when using username and password based authentication.
The root cause
Following significant efforts to reproduce the issue with our staging environment (it took some time, and in the end we only actually managed to reproduce it a few times) – in addition to the times that the problem occurred in production – we were able to identify the root cause.
The root cause of the failure state was one of the servers running out of PermGen space in the JVM. If you are unfamiliar with JVM memory profiles and wish to find out more about PermGen, I strongly recommend you check out section “3. Generations” in this Oracle article.
Once a server runs out of PermGen space the JVM is instantly killed. Hazelcast – the clustering technology we use to sync sessions, caches and queues between our API servers – was not responding quickly enough to these nodes being removed, taking up to 5 minutes to realise the node was disconnected from the cluster. Until that time, the lack of responses to Hazelcast requests was causing all kinds of API calls and queries on other nodes in the cluster to fail.
The issue at this point was compounded by incorrect Hazelcast configuration, with some cluster recovery timeouts set too high to work effectively in conjunction with our other health checks and auto-scaling configuration.
Finally, our EC2 auto-scaling configuration was not able to respond to servers all failing quickly enough to bring the APIs back on its own. Complete cluster failure is rather unusual in terms of error states, but still the auto-scaling configuration was not giving servers enough time to start, and conversely not killing them quickly enough either.
Given the scope of the separate problems noted above which unfortunately worked together to bring about our recent problems, we have introduced a number of changes.
Firstly, we have increased the Java PermGen space. We plan to also introduce dedicated monitoring of our PermGen capacity so as to alert us if this becomes an issue in the future, and added PermGen to the list of Ops configurations for regular review.
We have upgraded Hazelcast to the latest version of their 2.x branch (which was what triggered our requirement for a cluster restart, Hazelcast is only minor-version wire-compatible from 3.1 onwards) and reviewed the available documentation on their configuration options. We carefully tuned and tested the settings against a number of failure case tests on our staging environment and we’ve now got the most aggressive settings possible in order to enable rapid recovery in the case of any such issues in the future.
We have also made a change to our load balancer configuration, by re-tuning the settings to make sure we are able to cycle servers quickly enough to address this problem in the future. We have also changed the health checks so that they include checking Hazelcast’s state when checking a server’s health. This will eliminate issues whereby some servers are up and working, but others which report being alive, are actually experiencing cluster problems and causing API calls to error.
Lastly, we are fast-tracking our client library upgrades, which have been in the works for some time. We are bringing all of the libraries up to the standard that high-volume, complex integrations are demanding. This is currently scheduled for our next major release, O’Neill.
Going forwards, the import•io platform and operations teams are working on increasing our overall platform reliability by making significant changes to our API server architecture, including splitting our “standard” API servers out from servers which handle data source queries for clients. This will provide us with a highly tolerant regular API and website, along with a rapidly scalable and very high throughput query platform for servicing all client data source queries as quickly as possible.
Once again, we would like to apologise on behalf of the import•io team for any issues you may have experienced as a result of this problem. Our uptime and reliability is typically one of our best metrics, and one we are most proud of, so to have let you down on this count is a blow to all of us. However, we have used this issue as a learning opportunity and leveraged it to continue to provide you with the most robust, bulletproof platform possible.
– The import•io Platform and Ops Team
Turn the web into data for free
Create your own datasets in minutes, no coding required
Powerful data extraction platform
Point and click interface
Export your data in any format
Unlimited queries and APIs