An article from 10 years ago (2013, original at robbinfan.com now inaccessible, the earliest trace found is this one on Cnblogs), still deeply resonates upon re-reading.
By Fan Kai
In the internet industry, Unix/Linux-based website system architecture is undoubtedly the mainstream solution today. This is not only because of Linux's inherent openness, but also due to the vast ecosystem of mature open-source solutions surrounding the traditional Unix/Linux community, covering every aspect of website application scaling.
I recall that during the first wave of the internet boom over a decade ago, large websites using the Windows/.NET architecture were quite common. Today, however, notable websites built on .NET are few and far between. Especially apart from Microsoft's own sites MSN and Hotmail, many large websites using .NET have faced significant architectural scaling issues, making the scalability of .NET a contentious topic:
For instance, the foreign social networking site MySpace encountered severe performance scaling problems, and the domestic e-commerce site JD.com has repeatedly struggled with performance bottlenecks caused by its architecture. Consequently, some websites primarily built on .NET have had to consider "moving away from .NET," abandoning it entirely and migrating to a primarily Java-based architecture.
However, migrating the architecture of a large website is a monumental and risky undertaking. Success stories are rare, while failures are abundant. This is because:
The migration is carried out on an existing business system, not a leisurely development of a new product with ample time for testing and refinement. It is akin to changing the engine of an airplane mid-flight—a single, flawless switch is required, with no second chance. Even the slightest misstep can lead to catastrophe. And we all know that developing a large system from scratch cannot guarantee 100% perfection without going through the process of going live, iterating, and maturing.
Architecture migration often means phasing out the old R&D team, which, as the company grows, has likely gained significant influence. The new team, meanwhile, has yet to establish a firm foothold. Driven by instinctive self-preservation, political infighting and mutual exclusion between the old and new teams can easily erupt, ultimately leading to the failure of the migration.
Lessons from 5173's "Move Away from .NET"
The 5173 website started with game item trading. During the golden age of client-based online games, it experienced explosive business growth. At that time, the website built on .NET architecture was already buckling under the load. As the existing .NET R&D team was unable to resolve the website's performance issues over the long term, 5173 decided to completely migrate the website from .NET to a primarily Java-based architecture. To this end, 5173 invested heavily in recruiting engineers from Taobao and Sun Microsystems, forming a Java architecture R&D department of sixty to seventy people.
The new Java R&D department was tasked with developing a new version of the 5173 website, while the old .NET R&D department maintained the existing one. The two departments worked in parallel, and two versions of the website ran simultaneously. This led to intense internal political strife. The newly developed Java version of the 5173 website was never officially launched. Facing a serious existential threat, the old .NET R&D team managed to solve some core availability and stability issues. Meanwhile, as the golden age of PC client games waned, the website's performance problems gradually became less critical.
Regarding whether the new version of the website should officially replace the old one, various stakeholders engaged in protracted and indecisive battles. The newly appointed female CTO, who belonged to no faction, remained ambiguous. Ultimately, the .NET faction prevailed over the Java faction, and the Java R&D department was disbanded.
My Experiences and Lessons with "Moving Away from .NET"
Three years ago, when I first took over CSDN’s product and R&D teams, about two-thirds of CSDN's core systems were based on .NET, and one-third on the LAMP stack. The R&D team was very small at the time: only 2 .NET developers, 3 PHP developers, and later 1 Ruby developer I brought with me. The plan was to shift the overall website architecture towards Linux, gradually replacing the existing .NET systems. Consequently, we decided not to recruit or supplement the .NET team, leaving the existing .NET developers to maintain the old core systems.
However, both of the remaining two .NET developers left within less than six months. One left to start a venture with an executive from Microsoft, and the other switched to Baidu as a frontend engineer. The reason was simple: since the company planned to move away from .NET, the .NET engineers worried about their future value in the company once all .NET systems were replaced. Would they not be completely marginalized?
Of course, when I formulated the architecture migration plan, I had considered this. I had mapped out a career path for the more senior .NET engineer to become the company's overall chief architect, and for the other .NET engineer, who was proficient in JavaScript, to become the future leader of the frontend team. However, uncertain promises never outweigh immediate threats. So, I fully understood their reasons for leaving.
At this point, I found myself in a dilemma:
If we continued with the "move away from .NET" plan, even if I recruited and built a new .NET R&D team, it was unlikely to be stable. Since .NET was destined to be replaced, new members would likely realize the situation and leave after a month or two. Meanwhile, the .NET core systems, which constituted a large, highly complex portion of the website, would be left unattended. Any problem would leave us helpless without skilled personnel to handle it. I had already started reviewing the .NET core system code myself, preparing to take over.
If we abandoned the "move away from .NET" plan, it might solve some immediate system maintenance headaches, but it would create many long-term disadvantages for the website's development, such as security issues, architectural scaling problems, and the high cost of fully licensing server-side software. If I didn't firmly commit to moving away from .NET then, doing so later would only become more costly.
My initial idea was to recruit two .NET programmers who were reasonably competent but, crucially, lacked strong ambition, could be content with the status quo, and would faithfully maintain the old .NET core system. Simultaneously, I would recruit and build a Ruby R&D team. Leveraging my past experience with the impressive development speed of Ruby for websites, I aimed to buy time and rewrite the old .NET core systems one by one. However, this approach carried significant risks:
- I hadn't been at CSDN for very long. The online products were numerous and complex—over a hundred—and I was unfamiliar with many of the systems.
- Company leadership was not very supportive of my moving so quickly. They were concerned that drastic website renovations might completely crash the already fragile system.
- I had only brought one Ruby developer with me to Beijing. Building a Ruby team, integrating them, and developing core systems takes time; it's not something that can be rushed.
Fortunately, during my recruitment process, I interviewed two excellent .NET engineers. They showed strong ambition, solid programming fundamentals, and quick learning abilities. Although they didn't fit my initial criterion of finding engineers content with the status quo, I didn't want to miss out on good talent. So, I changed my mind on the spot, hired them, and formed a new .NET team.
To avoid the previous issue of .NET team attrition and to create opportunities for growth for the new team, I decided on a compromise solution: retain and continue using the .NET programming language and framework, but still "move the architecture away from .NET." In summary:
- Data Layer: Abandon SQL Server databases and stored procedures, migrating entirely to MySQL databases on the Linux platform.
- Caching: No longer rely on .NET's own caching mechanisms; migrate to a distributed Redis deployment on the Linux platform.
- Inter-service Communication: Avoid using .NET-specific protocols for service calls; switch to RESTful HTTP Web API calls.
- Static Resources: Stop serving static resource requests directly via IIS; offload them to Nginx running on the Linux platform.
- File System: Change file system reads to access a distributed file system on the Linux platform.
- Deployment: Place Windows servers running .NET code behind LVS, using LVS for load balancing and failover.
In simple terms, this approach limits .NET to being just the application-layer programming language and framework, while everything else is handled by open-source solutions on the Linux platform. When .NET functions purely at the application layer, both the development efficiency of ASP.NET MVC and the runtime efficiency of the .NET CLR are excellent. Currently, a single Windows server can handle millions of dynamic requests without pressure, and the application layer is horizontally scalable: if the request load becomes very high, you simply add more Windows servers. In short, we leveraged strengths and avoided weaknesses.
Furthermore, I focused on fostering communication between R&D teams using different programming languages, encouraging .NET engineers to become familiar with the Linux OS and cultivating their overall architectural awareness. Our key .NET core member once told me that the biggest technical growth he felt here was the sudden broadening of his horizons.
Over the following two years of website restructuring, this approach proved quite successful:
- The .NET team remained stable and became one of the highest-performing teams in the entire R&D department.
- The entire system restructuring process was remarkably stable and smooth, encountering minimal risk.
- The impact on website users was minimal; essentially, the entire website product was revamped gradually and imperceptibly.
- It caused no disruption to the company's online business. In fact, as the system was continuously improved, support for the business grew stronger.
Once the website architecture was fully Linux-based and the architectural solutions were unified, the choice of programming language for the application layer became less critical. Currently, our application layer product lines include code written in .NET, PHP, and Ruby, but they all share a uniform architecture. The choice of programming language depends mainly on the resource allocation of the R&D teams.
In conclusion, based on my experience, architecturally transforming a website traditionally heavily reliant on Microsoft solutions to a Linux platform, potentially rewriting it in other languages, is never purely a technical issue. It involves at least the following dimensions:
- How to protect the interests of the old system's R&D team, and how to motivate them to participate in and share the success of the architecture transformation. This is the most important factor and often the fatal problem in architecture migration. If the transformation is destined to sacrifice the old team without considering their interests, I believe it will inevitably lead to brutal political conflict and ultimately fail.
- How to ensure the existing business system continues to operate normally and transitions smoothly. If the architecture transformation affects the business, it will surely be abandoned.
- How to ensure a smooth transition and improvement of the user experience. If the transformation degrades the fundamental user experience, it will certainly be halted.
- Leadership's tolerance for risks during the transformation and their patience with an extended transformation timeline.
A Side Note
I feel there's a negative phenomenon in our internet industry: when some websites crash during promotional events or experience frequent instability, company bosses like to post harsh words on social media, "invite subordinates for tea," or hastily announce a million-dollar annual salary to hire a new CTO. This is akin to a person who maintains poor lifestyle habits, indulges excessively, and never focuses on health, but when they eventually fall ill after years of abuse, they frantically wave a checkbook, shouting, "Which famous doctor can cure me instantly? I'll pay you a million!"
Therefore, when a website encounters severe technical problems, the root cause is often not purely technical, nor can it be solved simply by replacing the CTO. One must reflect on whether the company's culture has ever truly valued R&D. Was the technology team adequately motivated? Were the opinions of architects taken seriously? Was there long-term R&D investment in potential future technical hurdles?
Regarding this phenomenon, I recall a very insightful remark by Fenng: "Technical problems are always overestimated in the short term and underestimated in the long term." I'd like to add: "When technical problems arise, they are never solely caused by technical issues."
From: robbinfan.com