The 2nd day after this year’s MVP Summit in Redmond, just 14 hours after I got back from the U.S, I took another flight from Beijing to South China for a job. The job turned out to be one of the most challenging one that I’ve never done, pushing the performance limit of Team Foundation Server.
Figure: with Brian Harry @ MVP Summit 2011
Some background information first, the client is one of the biggest who also owns one of the biggest development team all around the world, with over 20,000 developers. The system we are developing has to be able to cope such amount of requests, sometime concurrent requests in a timely fashion. We are using TFS Object Model mainly to implement most of the functionalities, and we initially thought TFS should be able to handle the load without problem because Microsoft has been dogfooding TFS () for a long time in their even bigger and much more demanding environment, and TFS survived. However, like every story, things will never run as you expected; with all the data access, business logic and interfacing implementations on top of TFS Object Model, the initial end result was kind of disappointing.
Test Case | Before tuning | Target |
Create Operation | ||
No concurrency | 1200 ms | 200 ms |
10 concurrency | 2700 ms | 500 ms |
100 concurrency | >3000 ms | 1500 ms |
Query Operation | ||
No concurrency | 234 ms | 200 ms |
10 concurrency | 565 ms | 500 ms |
100 concurrency | 3000 ms | 1500 ms |
Figure: before tuning the performance was kind of disappointing
As you can see from above, all of the benchmarks are offsetting from the target, some of them are even 5-6 time bigger than the target. These results were collected on a SQL database (Tfs Collection DB) with around 500 million data entries.
Shocking? Isn’t it? Yes, I was when I first saw these. But like every story, there is a happy ending, so here is the result after our tuning.
Test Case | Target | After Tuning |
Create Operation | ||
No concurrency | 200 ms | 87 ms |
10 concurrency | 500 ms | 399 ms |
100 concurrency | 1500 ms | 2500 ms (single server) 1500 ms (NLB) |
Query Operation | ||
No concurrency | 200 ms | 28 ms |
10 concurrency | 500 ms | 32 ms |
100 concurrency | 1500 ms | 200 ms |
Figure: The client is happy with the results, we made our target in each case. That said, I think we are at TFS 2010 limits
So, back to the topic, the lessons that we learned:
Lesson 1: Team Work
I put this as the top one like always, software developing is ever a one man job especially when you are dealing a complex system, like TFS. You need to know Windows Server 2008, IIS 7, SQL Server 2008 R2 and TFS web services and Object Model. You need expert in every part of these in order to combine them together to achieve your goal.
During the whole project, I got help from Microsoft, my boss and many other MVPs around the world; including Brian Harry (Microsoft Technical Fellow, father of TFS), Aaron Hallberg (TFS DevTeam), Tiago Pascoal (MVP), Adam Cogan (Microsoft Regional Director, MVP, my boss ;) ), Ramesh Rajagopal (DevDiv from MS Shanghai Dev Center), Julia Liuson (Manager of MS Shanghai Dev Center) and Yongming Yi (MS Technical Specialist) … and there is a long list that I couldn’t include all of them here.
To be short, if you want to do the job right, you need the right people. Having such a great team is essential for the end result.
Lesson 2: Performance testing as earlier as possible
We adopted scrum in the whole project and we have built in unit testing and load testing at the very 1st sprint of our development cycle; however because of resource limitation, we were not able to achieve that much concurrency and gain enough hardware until we are at the very end of the whole project.
Although we were not able to do tunning along the way, the testing results have been given enough transparency to the client, so when the 1st official test result on top of the staging hardware out, the client was not too surprise and gave us enough time and understanding. To me, transparency is always the biggest benefit of using agile methodologies; having your client support is sometime the best resource you can have in a project.
Another benefit of having performance testing in early was that I had enough time to contact helpful people and gain enough support beforehand.
Lesson 3: Avoid virtualization in demanding production system
To be continued …
ref: