View Full Version : Flexiscale Down Time
rossbug
27th August 2008, 14:14
I'm surprised by the lack of talk on forum regarding the loss of flexiscale today. I'm also surprised by the fact that there was no effort to contact flexiscale users considering the magnitude of the problem.
I've talked to customer support who have given no guarantees of when the service will be up. I'm pretty sure that the length of time my servers have been down has exceeded the 0.01% downtime limit specified by the 99.99% SLA.
I've been a great fan of flexiscale since I joined at the start of the year - so much so that I've got numerous clients whose services are being hosted on the flexiscale platform.
I feel totally let down and embarrassed as I have to explain to clients that the platform that I have raved about is down and I have no idea when it will be back up.
Virtualisation always appealed to me because of the fact the hardware failure is all but removed as a problem. With automatic switching of virtual servers between physical ones, the Achilles heel is the integrity of the data store. So why then has it taken a problem of this magnitude for it to show up? Shouldn't the data store be 100% redundant - to the point that there is two identically mirrored data stores? Or at least a data store with x hours old data ready to bring online in case of data corruption?
I won't start to talk about the random server restarts and full failures of a month ago, or the poor I/O performance, or the very slow response from job requests to start/stop/kill a server...
I hope to hear some good news soon...
redcarrot
27th August 2008, 15:10
Well, I've not been with Flexiscale long. Actually, I am still an Amazon AWS customer, but I am currently giving the Flexiscale service a trial with a view to moving all of our customers sites over from Amazon.
I guess it is just my bad luck that the system goes down just when I had a spare couple of days to play around.
Ironically, one of my reasons for moving to Flexiscale is the SLA, which Amazon do not provide. I hope that this is part of the 0.01% and not a regular occurance!
tonylucas
27th August 2008, 15:19
Rossbug/Redcarrot,
I will be sending an e-mail update to all customers shortly.
This will also explain the recent problems we have been having with I/O etc and how these are also being resolved.
We are not now expecting to get the servers up and running until tomorrow, although again I will explain that in the e-mail
I apologise for the obvious problems that this causes.
Tony Lucas
Chief Executive Officer
XCalibre Communications Ltd
http://www.xcalibre.co.uk
rossbug
27th August 2008, 15:22
How can you come out with a quote of "At no point today will individual Virtual server's be available" ?!
I don't understand how you can expect to survive in this market when your product is going to be unavalible for what will be 24 hours. Thats allready going to bring your SLA for the year to below 99.7.
I hope that there is going to be a detailed explaination of whats gone seriously wrong today and whats being put in place to stop problems of this magnitude happening again in the future.
gssavage
27th August 2008, 15:45
Tony, when you send the email can you give an indication of when tomorrow the servers will be back? Our board and their investors are starting to get very concerned (our sites have been unavailable for over 28 hours now). I need something concrete to take back to them.
Cheers
- Graham
tonylucas
27th August 2008, 16:30
Graham,
The e-mail I have just sent out to all customers explains the current situation, although I will copy it here in full.
<quote>
Dear Customer,
I felt it important to send you all an e-mail explaining the current problems our FlexiScale service is having, what we are doing about it & when we expect it to be resolved.
As some of you are aware, we have been having issues with I/O (disk speed) in recent weeks. We identified short term and long term measures to eliminate these problems. The short team measures involved reorganising how data was stored across our storage network in a more efficient manner, and the long term measure was to increase the overall I/O capacity of the platform.
As a preparatory step to adding additional capacity one of our engineers was reorganising the data structure on the storage network and whilst cleaning up the snapshots we use as our backup process accidentally deleted one of the main storage volumes. This caused an immediate outage to a large amount of our customers
We immediately took action to take the entire disk structure offline (which caused the remaining customers to be taken offline) as it was the only way to preserve the integrity of the data on the system. Work then commenced with our storage vendor to restore this data.
Although we have now successfully gained read-only access to everyones data, a bug in the storage platforms operating system has prevented us from providing read-write access to it. This was discovered at 11pm last night, just when we thought we were about to bring the entire disk structure back online.
After consulting with our storage vendor it was agreed the most sensible option would be to copy the entire volume to a new disk structure (still maintaining it's integrity and structure), from where we could re-mount it correctly. Unfortunately due to it's size we didn't have spare capacity on the platform to create a complete duplicate of it.
An investigation of other ways of restoring the data then was undertaken but all options were considered too risky, and although downtime is a major problem for everyone, we felt the integrity of the data was the most important factor.
The decision was then taken to get additional capacity in from the storage vendor as soon as possible so that we could then increase the capacity to a sufficient level to allow us to copy the volume and successfully restore it. We originally thought we would be able to get this today, but unfortunately it will not arrive until mid-morning tommorow, although we have done (and will continue to do) everything we can to speed this up.
At this time we are assisting customers who need access to specific files to get this, and we will continue this as long as we can into the night as resources allow.
Tomorrow morning once the storage arrives and is online, we will copy the data across and then begin to restart the entire platform as quickly as possible, but as the system wasn't designed to restart everything at once, this will take time.
We will be offering credits against our SLA, which will be determined once everyone is back up and running, as I'm sure you can appreciate all resources are being focused on that at this moment.
I, and all my staff are well aware of the potential impact this will be causing to you our customers, and we are doing everything we can to help in that respect. We will also be undertaking an investigation to ensure additional safeguards are put in place to prevent this happening again.
Sincerely,
Tony Lucas
Chief Executive Officer
XCalibre/FlexiScale
</quote>
musical
27th August 2008, 16:34
Graham,
You are not alone.
georgema
27th August 2008, 17:17
At this point do you know if any data has been lost or have the steps you have taken ensured that all the data is safe and simply needs to be made available on the new storage platform.
Problems like this can destroy businesses, both yours and your customers, so I hope for everyone involved that this is resolved quickly. One of the things that would help rebuild my confidence is more information/transparency on the underlying infrastructure, details on the fault tolerance architecture and regular performance metrics, e.g. I/O's, CPU of the host servers and SAN. Although in many ways a 'black box' service more information can only help build confidence.
Anyway, thanks for the update on the issue and I hope that between your operations team and your storage vendor this can be resolved ASAP.
George M-A
musical
27th August 2008, 19:22
I'd like to know who the storage provider is simply to make sure I never buy any of its products for a mission-critical application...
georgema
27th August 2008, 21:33
To be fair, without knowing all the details of the problem it's not fair to single out the storage vendor as the culprit here. In fact getting into that would simple get in the way of resolving the issue.
In terms of system infrastructure the storage subsystem is one of the most complex areas to setup and administer (IMO). There seems to be a dearth of people who really understand storage, and I'll include myself in that list, so storage problems are one of the most common ones that I see on my own customer sites.
With CPU power increasing beyond what most people need, 64 bit architectures and cheaper memory two of the most common bottlenecks are put to one side. What we have now are complex filesystems with logical layers of abstraction, clustered file systems with shared controlled access to resources, network based storage with block protocols, fibre channel controllers with redundant HBA's and switches. Throw in a bit of RAID and it all adds up to a complex enviroment which needs managed and monitored. Which is why companies like HP, IBM, EMC all make shedloads of money selling storage and the software to manage it.
Sounds like flexiscale were getting on top of the I/O issues but then good old Murphy stepped in. Reminds me of the times an engineer, not at flexiscale I hasten to add, popped the wrong disk on a failed RAID 5 array which they could have recovered from if stage 1 of the 2 stage backup process they used had ever worked properly (they only ever checked the return status of stage 2).
Sorry I'm rambling here but I/O or storage problems are usually a combination of factors and not simply one piece of hardware.
George M-A
musical
28th August 2008, 07:58
Fair enough. Human error started the chain of events, as is often the case, and storage solutions are complex. But that bug in the storage system's operating system turned a very serious problem into a catastrophe. I don't think much of the storage vendor's UK stock levels either (I know I am guessing but I assume that the reason for the delay is shipping time from either Europe or the US).
As well as details of the new architecture, I want to see new internal SOPs which attempt to eliminate the human error too.
tonylucas
28th August 2008, 10:03
As a brief update we have seen very little data corruption of the data we have examined although we can't rule it out completely.
Significant changes will be made internally to ensure issues like this can't happen again. It would be unwise of me to comment on them at this point though.
I intend to offer as much transparency and openness as I can in the future, without the bounds of security and commercial sensitivity of course.
I understand that customers need FlexiScale to work as a 'black box' but further information can't do any harm.
Regards,
Tony.
georgema
28th August 2008, 11:06
Tony, thanks for the update. The improved openness sounds good but I also understand the need for comercial sensitivity as well. For customers who do need detailed information perhaps you could have some kind of NDA in place.
Musical, I think that any customer running their critical/important business systems on this, or any other kind of cloud/virtual infrastructure, needs to go through a due diligence process whereby the assess the suitability of the service for their business. This would include both the hardware infrastructure and, as you rightly point out, things like SOP's. Something to look at when the systems are available again.
At the moment I'll be looking at how to verify the integrity of my systems through visual inspection of the file system, database verify procedures, etc. So although the systems should be back today I'd budget additional time for verification before releasing back to the business. How long this will take would depend heavily on the size and complexity of the system and associated database.
I'm hoping as well that the new SAN will have given the flexiscale team an opportunity to fix any other I/O performance issues.
Cheers, George.
georgema
28th August 2008, 16:17
Any chance of a status update on the storage issue?
Thanks, George
tonylucas
28th August 2008, 18:00
We're hoping to bring the first customers back online very shortly, we're literally finalising the final issues with the FlexiScale management system before doing so, although it will take until tomorrow probably to bring everyone back online.
Regards,
Tony.
georgema
28th August 2008, 18:37
Thanks Tony,
How do you intend to inform customers when their systems are available? For me it's not that important as they are only test servers. For other customers I don't know if you have a list of productive servers and can prioritise those first.
Thanks,
George
gssavage
29th August 2008, 08:27
Do we need to do anything to get our production servers back online? Do we have to go into the control panel and stop & start them?
georgema
29th August 2008, 11:41
Here's a link to the network status page of the Xcalibre hosting site: -
http://www.xcalibre.co.uk/status/status.php
Last update on the page as of 11:40 today was at 09:35 BST
xcalibrecustomer
29th August 2008, 12:17
My company has a shared package with Xcalibre that uses a server on the Flexiscale platform. We were told that we are not a Flexiscale customer and as such were not worthy of any updates given.
We still have an un-restored website for more than 3 days now.
On the topic in this thread about giving server information out, until this week we were under the impression that we were on a normal shared server and we didn't even know that any past issue with the Flexiscale platform would of affected us.
I appreciate we are not a big customer but we still have clients to appease and this could very well ruin my business.
Dean
30th August 2008, 13:35
Has anybody actually had their server turned back on?
I've chased repeatedly and told one of my servers (I have four) was on the priority list - but as yet I've heard or seen nothing. All 4 are still down - 6 days later.
The updates also seem to have stopped on both the twitter link and on the status page.
The office is closed and nobody is answering the phones?
Whats happening?
Regards
Dean
rossbug
1st September 2008, 09:18
A few of my servers are back on - thanks. But the most important one to me is still off. I've sent in a ticket - number 575196. Can we get this sorted ASAP please, thanks.
georgema
2nd September 2008, 13:52
Any ETA on when the 'stopped' servers will be made available and able to start. Just checked and there's still no start option on the control panel.
Thanks,
George.
johnrigby
2nd September 2008, 16:37
Have just got off the phone to support and been told that the Stop/Start Server option will not be back in the Control Panel for some time - no ETA yet.
If you wish to stop/start the server you have to send an email to support asking them to do this !
They are still busy dealing with the current problems so this may present a delay in a response to the email request to stop/start the server.
Oh well - might have to conceed defeat with my first venture into "Virtual Server World".
rossbug
2nd September 2008, 17:59
got a few windows servers back now but a major one has IIS that is totally currupt - the WWW publishing won't start. I've removed IIS, i went to re-enable it but the share for the I386's has vanished - xanthos.xcalibre.co.uk isn't accessable at all?
7 days later.....
tonylucas
4th September 2008, 16:34
Apologies for not updating the forum thread, for some wierd reason it doesn't always e-mail me when someone else posts.
The updates over the weekend weren't as frequent as they should have been, but we did have staff working constantly to restore servers during that time.
99% of customers were up and running by close of play Monday, but there is still a few isolated servers that have had corruption issues that we are working to resolve as quickly as possible, and if you haven't yet had an update from support on your server in particular, please do e-mail in to flexiscale@xcalibre.co.uk to check on it.
I will be sending a further update out when all known issues are resolved with a summary of the issues and details of what we will be doing in the future to ensure this can't happen again.
Regards,
Tony
adamcharnock
6th September 2008, 10:54
Hey Everyone,
I was hoping to prompt someone at xcalibre to have a look into ticket 577026. It seems my database server has become unresponsive.
On a side now, I still have a another server down after 11 days of down time :eek:
Also, why hasn't the status page been updated for 6 days (!) when there are still servers offline?
Adam
georgema
6th September 2008, 14:52
FYI: I had a message back from customer service saying that the Start/Stop functions on the VM menu's would be back at the end of next week, i.e. the 12th. If they were re-enabled earlier then a note would be sent to all customers.
I only really need my VM's up for a couple of hours at a time so trying to stop/start via the helpdesk is a bit of a non starter, no pun intended, for me. Looks like I'll be waiting until the end of next week before I get back on-line.
Cheers, George M-A
adamcharnock
7th September 2008, 16:05
Can someone at xcalibre _please_ have a look at ticket 577026? This is getting really urgent. I have tried emailing, phoning and twittering to no avail.
Has there been another SAN failure or something?
tonylucas
10th September 2008, 13:07
George,
Start/stop is back for most customers, you should have recieved confirmation of this in the e-mail I sent out yesterday.
Adam, I believe your other ticket has now been looked at, my apologies for the delay.
Regards,
Tony.
vBulletin® v3.7.4, Copyright ©2000-2009, Jelsoft Enterprises Ltd.