IntroductionTo do a good job, one must first sharpen the tools; to propagate righteousness, one must first read the books. Backend development, as a gem in the internet technology field, has always been a peak pursued by developers. This article will start from the technical terms involved in backend development, covering system development, architecture design, network communication, and other aspects to give everyone a clear understanding of backend development, with comprehensive and easy-to-understand explanations. (Author: willlv)
System Development
- High Cohesion / Low Coupling
High cohesion means that a software module is composed of highly relevant code and is responsible for only one task, which is the well-known Single Responsibility Principle. The cohesion of a module reflects the degree of closeness of internal connections within the module.
The tighter the connection between modules, the stronger their coupling and the weaker their independence. The degree of coupling between modules depends on the complexity of the interface between modules, the calling method, and the information passed. In a complete system, modules should be as independent as possible. Generally, the higher the cohesion within each module in a program structure, the lower the coupling between modules.
- Over-engineering
Over-engineering means designing too much for the future or overcomplicating relatively simple things, excessively pursuing modularity, extensibility, design patterns, etc., adding unnecessary complexity to the system.
- Premature Optimization
Premature does not refer to early in the development process, but rather before understanding the future direction of requirements. Your optimization may not only prevent you from properly implementing new requirements, but your guesses about optimization gains might be wrong, resulting in nothing but added complexity to the code.
The correct approach is to first implement your requirements with quality, write enough test cases, then profile to find performance bottlenecks, and only then optimize.
- Refactoring
Refactoring is the process of adjusting program code to improve the quality and performance of software, making its design patterns and architecture more reasonable, and enhancing the software's extensibility and maintainability.
- Broken Windows Theory
Also known as the broken windows theory, it is a criminological theory. It posits that if disorderly behavior in the environment is left unchecked, it will encourage imitation and even escalation. For example, a building with a few broken windows; if those windows are not repaired, vandals might break more windows. Eventually, they may even break into the building, and if it appears unoccupied, they might settle in or set fire to it.
Applied to software engineering, this means never let the seeds of code or architectural design flaws sprout; otherwise, they will worsen over time. Conversely, a high-quality system naturally encourages people to write high-quality code.
- Principle of Mutual Distrust
This means that throughout the entire upstream and downstream chain of program execution, no single point can be guaranteed to be absolutely reliable. Any point may fail or exhibit unpredictable behavior at any time, including machine networks, the service itself, the dependency environment, inputs, and requests. Therefore, defenses must be put in place at every step.
- Persistence
Persistence is the mechanism for converting program data between a transient state and a persistent state. In simple terms, it means converting temporary data (e.g., data in memory, which cannot be stored permanently) into persistent data (e.g., stored in a database or local disk for long-term retention).
- Critical Section
A critical section refers to a shared resource or shared data that can be used by multiple threads, but only one thread can use it at a time. Once a critical section resource is occupied, other threads that need this resource must wait.
- Blocking / Non-blocking
Blocking and non-blocking typically describe the mutual influence between multiple threads. For example, if one thread occupies a critical section resource, all other threads needing that resource must wait in the critical section, causing the threads to be suspended. This is blocking. In this case, if the thread occupying the resource does not release it, all other threads blocked in the critical section cannot proceed. Non-blocking allows multiple threads to enter the critical section simultaneously.
- Synchronous / Asynchronous
Synchronous and asynchronous usually refer to function/method calls.
Synchronous means that when a function call is made, the call does not return until the result is obtained. Asynchronous calls return immediately, but that does not mean the task is complete; it spawns a background thread to continue the task, and when the task finishes, it notifies the caller via a callback or other means.
- Concurrency / Parallelism
Parallel
Refers to multiple instructions being executed simultaneously on multiple processors at the same instant. So both microscopically and macroscopically, they are executed together.
Concurrency
Refers to only one instruction being executed at any given instant, but multiple process instructions are rapidly switched, giving the macroscopic effect of multiple processes executing simultaneously, though not simultaneously at the microscopic level. Time is divided into segments to allow multiple processes to alternate execution quickly.
Architecture Design
- High Concurrency
With the advent of distributed systems, high concurrency typically refers to designing a system to simultaneously process many requests in parallel. In simple terms, high concurrency means that at the same point in time, many users are accessing the same API interface or URL address. This often occurs in business scenarios with large active user volumes and high user aggregation.
- High Availability (HA)
High Availability (HA) is a factor that must be considered in distributed system architecture design. It usually means that a system is specially designed to reduce downtime and maintain a high degree of service availability.
- Read/Write Splitting
To ensure database product stability, many databases have a dual-server hot standby feature. That is, the first database server is the production server handling additions, deletions, and modifications externally; the second database server is primarily used for read operations.
- Cold Backup / Hot Backup
Cold backup: Two servers, one running and one kept as a backup without running.Once the running server goes down, the backup server is started. Cold backup is relatively easy to implement, but the disadvantage is that when the primary fails, the standby does not automatically take over; manual service switching is required.Hot backup: This is typically the active/standby mode, where server data, including database data, is written to two or more servers simultaneously. When the active server fails, software detection (usually through heartbeat) activates the standby machine, ensuring that the application returns to normal within a short time. When one server goes down, it automatically switches to the other standby server.
- Multi-site Active-Active
Multi-site active-active generally means establishing independent data centers in different cities. "Active" is relative to cold backup, where backups store full data but do not support business operations normally; they are only used when the primary data center fails. In contrast, active-active means these data centers also handle traffic and support business operations in daily operations.
- Load Balancing
Load balancing is a service that distributes traffic across multiple servers. It automatically allocates application service capabilities among multiple instances, improving application system availability by eliminating single points of failure, allowing you to achieve higher levels of application fault tolerance, seamlessly providing the load balancing capacity needed to distribute application traffic, offering efficient, stable, and secure service.
- Separation of Static and Dynamic Content
Separation of static and dynamic content is an architectural design method in web server architecture that separates static pages from dynamic pages or static content interfaces from dynamic content interfaces, directing them to different systems. This improves overall service access performance and maintainability.
- Cluster
A single server's concurrency capacity is always limited. When a single server's processing capability reaches a performance bottleneck, multiple servers are combined to provide services. This combination is called a cluster. Each server in the cluster is called a "node," and each node provides the same service, thereby multiplying the entire system's concurrent processing capacity.
- Distributed System
A distributed system splits a complete system into many independent subsystems based on business functions. Each subsystem is called a "service." The distributed system sorts and distributes requests to different subsystems, letting different services handle different requests. In a distributed system, subsystems run independently, communicating through network connections to achieve data exchange and combined services.
- CAP Theorem
The CAP theorem states that in a distributed system, Consistency, Availability, and Partition Tolerance cannot all be achieved simultaneously.
Consistency: It requires that at the same point in time, all data replicas in the distributed system are identical or in the same state.
Availability: The system can still correctly respond to user requests even if some nodes in the cluster are down.
Partition Tolerance: The system can tolerate network communication failures between nodes.
Simply put, a distributed system can support at most two of the above three properties. But obviously, since it is distributed, partitioning is inevitable, and we cannot completely avoid partition errors. Therefore, we must choose between consistency and availability.
In distributed systems, we often pursue availability, which is considered more important than consistency. To achieve high availability, there is another theory: the BASE theory, which further extends the CAP theorem.
- BASE Theory
The BASE theory states:
Basically Available
Soft state
Eventually consistent
The BASE theory is a trade-off between consistency and availability in CAP. The core idea is that we cannot achieve strong consistency, but each application can use appropriate methods based on its own business characteristics to achieve eventual consistency.
- Horizontal Scaling / Vertical Scaling
Horizontal Scaling (Scale Out)Increases storage and computing capacity by adding more servers or program instances to distribute the load.Vertical Scaling (Scale Up)Increases the processing capacity of a single machine.
Vertical scaling can be achieved in two ways:
(1) Enhancing single-machine hardware performance, for example: increasing CPU cores (e.g., 32 cores), upgrading to better network cards (e.g., 10GbE), upgrading to better hard drives (e.g., SSD), expanding hard drive capacity (e.g., 2T), expanding system memory (e.g., 128G);
(2) Improving single-machine software or architecture performance, for example: using Cache to reduce IO times, using asynchrony to increase single-service throughput, using lock-free data structures to reduce response time;
- Horizontal Expansion
Similar to horizontal scaling. Nodes in a cluster server are all peers. When expansion is needed, more nodes can be added to improve the cluster's service capability. Generally, key paths in a server (such as login, payment, core business logic, etc.) need to support dynamic horizontal expansion at runtime.
- Elastic Scaling
Refers to dynamically scaling a deployed cluster online. An elastic scaling system can automatically add more nodes (including storage nodes, compute nodes, and network nodes) according to a certain strategy based on the actual business environment to increase system capacity, improve system performance, or enhance system reliability, or all three simultaneously.
- State Synchronization / Frame Synchronization
- State Synchronization: The server is responsible for computing all game logic and broadcasting the results. The client is only responsible for sending player operations and displaying the received game results.
Characteristics: State synchronization has high security, is convenient for logic updates, and supports fast reconnection after disconnection, but development efficiency is low, network traffic increases with game complexity, and the server bears greater pressure.
Frame Synchronization: The server only forwards messages without any logic processing. All clients have the same frames per second, processing the same input data in each frame.
Characteristics: Frame synchronization ensures that the system produces the same output given the same input. Development efficiency is high, traffic consumption is low and stable, and the server load is very low. However, network requirements are high, reconnection after disconnection takes longer, and client-side computing pressure is high.
Network Communication
- Connection Pool
A pre-established connection buffer pool, along with a set of connection usage, allocation, and management strategies, allows connections in the pool to be efficiently and safely reused, avoiding the overhead of frequent connection establishment and closure.
- Reconnection after Disconnection
Due to network fluctuations, a user may intermittently disconnect from the server. After the network recovers, the server attempts to reconnect the user to the state and data at the time of disconnection.
- Session Persistence
Session persistence is a mechanism on a load balancer that identifies the correlation of interactions between a client and a server. While load balancing, it ensures that a series of related access requests are all directed to the same machine. In plain terms: multiple requests during a single session will land on the same machine.
- Long Connection / Short Connection
Usually refers to TCP long and short connections. A long connection is established and maintained. Typically, heartbeats are sent between parties to confirm existence, and multiple business data transmissions occur in between, generally without actively closing the connection. A short connection is established, performs one transaction (e.g., an HTTP request), and then closes the connection.
- Flow Control / Congestion Control
Flow Controlprevents the sender from sending too fast, exhausting the receiver's resources, so that the receiver cannot handle the data in time.Congestion Controlprevents the sender from sending too fast, causing the network to become congested and unable to handle data in time, leading to degraded performance of this part or even the entire network, and in severe cases, causing network communication to stall.
- Thundering Herd Problem
Also called the thundering herd effect. In short, it occurs when multiple processes (or threads) are simultaneously blocking waiting for the same event (in a sleeping state). When the event occurs, all waiting processes (or threads) are awakened, but ultimately only one process (or thread) can gain "control" of the event and handle it, while the others fail to gain control and have to re-enter the sleep state. This phenomenon and performance waste is called the thundering herd problem.
- NAT (Network Address Translation)
NAT is the process of replacing address information in IP packet headers. NAT is usually deployed at the network egress of an organization, providing public network reachability and upper-layer protocol connectivity by replacing internal network IP addresses with the egress IP address.
Faults and Anomalies
- Crash / Downtime
Crash generally refers to an unexpected failure of a computer host that results in a shutdown. Additionally, some servers, such as those experiencing a database deadlock, can also be described as crashed. If some server services are down, it can be said accordingly.
- Core Dump
When a program crashes due to an error, the OS stores the current working state of the program in a core dump file. Typically, the core dump file contains the program's runtime memory, register state, stack pointer, memory management information, etc.
- Cache Penetration / Cache Breakdown / Cache Avalanche
Cache Penetration: Cache penetration refers to querying data that definitely does not exist. Since the cache is missed and the database is queried, the data is not written to the cache upon not being found. This results in every request for this non-existent data hitting the database, putting pressure on it.
Cache Breakdown: Cache breakdown occurs when a hot key expires at a particular point in time, and at that exact time, a large number of concurrent requests for that key arrive, overwhelming the database.
Cache Avalanche: Cache avalanche occurs when a large batch of data in the cache expires at the same time, and the volume of queries is huge, causing excessive database pressure or even downtime.
The difference from cache breakdown is: cache breakdown involves a single hot key expiring, while cache avalanche involves many keys expiring simultaneously.
- 500 / 501 / 502 / 503 / 504 / 505
500 Internal Server Error: Internal server error, generally indicates that the server encountered an unexpected condition that prevented it from fulfilling the request. Possible causes: 1. Program error, e.g., ASP or PHP syntax errors; 2. High concurrency leading to system resource limitations preventing the opening of too many files.501 Not Implemented: The server does not understand or support the HTTP request method.502 Bad Gateway: WEB server fault, possibly due to insufficient program processes. The requested php-fpm has started but did not complete execution for some reason, leading to php-fpm process termination. Possible causes: 1. Nginx server, insufficient php-cgi processes; 2. PHP execution time too long; 3. php-cgi process dies.503 Service Unavailable: The server is currently unable to handle the request. The system is temporarily unable to process client requests due to maintenance. This is a temporary state. You can contact the server provider.504 Gateway Timeout: Server 504 error indicates timeout, meaning the request sent by the client did not reach the gateway, or did not reach the executable php-fpm. This is usually related to the nginx.conf configuration.505 HTTP Version Not Supported: The server does not support the HTTP protocol version used in the request.
Except for the 500 error which may be a programming language error, the other errors can generally be interpreted as server or server configuration issues.
- Memory Overflow / Memory Leak
Memory Overflow (Out Of Memory): Memory overflow occurs when a program requests memory but there is not enough memory available. Or, if you are given a memory space to store an int type, but you store a long type instead, the memory will be insufficient, resulting in an OOM error.Memory Leak: A memory leak occurs when dynamically allocated heap memory in a program is not released or cannot be released for some reason, causing waste of system memory, slowing down program execution, or even system crash.
- Handle Leak
A handle leak occurs when a process calls a system file but does not release the opened file handle. Common symptoms after a handle leak are: machine slowdown, CPU spike, and increased CPU usage of the CGI or server experiencing the leak.
- Deadlock
Deadlock refers to a situation where two or more threads are blocked during execution due to competing for resources or communicating with each other. Without external intervention, they remain in a blocked state and cannot proceed. The system is then said to be in a deadlock state or has a deadlock.
- Soft Interrupt / Hard Interrupt
Hard Interrupt: What we usually call an interrupt refers to a hard interrupt (hardirq).
It is automatically generated by peripherals connected to the system (e.g., network cards, hard drives).
It is mainly used to notify the operating system of changes in the state of peripherals.
Soft Interrupt: 1. Usually an interrupt to the kernel from a hard interrupt service routine; 2. To meet the requirements of real-time systems, interrupt handling should be as fast as possible.
Linux implements this by having hard interrupts handle tasks that can be completed in a short time, while tasks that take longer are postponed to after the interrupt, handled by soft interrupts (softirq).
- Spike
At a brief moment, a server performance metric (such as traffic, disk I/O, CPU usage, etc.) is much higher than in the time periods before and after. Spikes indicate uneven and insufficient utilization of server resources and can easily trigger more serious problems.
- Replay Attack
An attacker sends a packet that the destination host has already received to deceive the system, mainly used in authentication processes to compromise the correctness of authentication. It is a type of attack that maliciously or fraudulently repeats a valid data transmission. Replay attacks can be carried out by the originator or by an adversary who intercepts and replays the data. The attacker steals authentication credentials through network eavesdropping or other means and then resends them to the authentication server.
- Network Partition / Split-brain
A network partition refers to a situation in a cluster environment where some machines lose network connectivity with the rest of the cluster, splitting into a small cluster and creating data inconsistency.
- Data Skew
For a cluster system, caches are usually distributed, meaning different nodes are responsible for certain ranges of cached data. When cache data is not sufficiently dispersed, leading to a large amount of cache data being concentrated on one or a few service nodes, it is called data skew. Data skew is generally caused by poor load balancing implementation.
- Brain Split
Brain split occurs in a cluster system when some nodes are unreachable via the network, causing the system to split. Different split small clusters provide services according to their own states, leading to inconsistent responses from the original cluster, resulting in nodes competing for resources, system chaos, and data corruption.
Monitoring and Alerting
- Service Monitoring
The main purpose of service monitoring is to accurately and quickly detect problems when a service is having issues or is about to have issues, in order to minimize the impact. Service monitoring generally uses multiple methods, which can be categorized by level:
System level (CPU, network status, I/O, machine load, etc.)
Application level (process status, error logs, throughput, etc.)
Business level (service/interface error codes, response time)
User level (user behavior, public opinion monitoring, front-end tracking)
- Full Link Monitoring
Service Probing: Service probing is a monitoring method to detect the availability of a service (application). It periodically probes the target service from probing nodes, mainly measured by availability and response time. Probing nodes are usually distributed in multiple locations.
Node Probing: Node probing is a monitoring method used to discover and track network availability and connectivity between different data center nodes. It is mainly measured by response time, packet loss rate, and hop count. Probing methods are typically ping, mtr, or other private protocols.
Alert Filtering: Filtering out certain predictable alerts so they are not included in alert statistics, such as occasional HTTP 500 errors caused by crawler visits, custom exception information from business systems, etc.
Alert Deduplication: Once an alert is sent to the person in charge, they will not receive the same alert again until the alert is resolved.
Alert Suppression: To reduce interference from system jitter, suppression is also needed. For example, a momentary high load on a server may be normal; only a sustained high load requires attention.
Alert Recovery: Developers/operations personnel not only need to receive alert notifications but also need to be notified when the fault is eliminated and the alert is restored to normal.
Alert Merging: Merging multiple identical alerts generated at the same time. For example, if a microservice cluster has multiple sub-services with high load alerts at the same time, they should be merged into one alert.
Alert Convergence: Sometimes when an alert is generated, it is often accompanied by other alerts. In this case, only the alert for the root cause may be sent, while other alerts are converged as sub-alerts and sent together. For example, a CPU load alert on a cloud server is often accompanied by availability alerts for all systems running on it.
Fault Self-healing: Real-time detection of alerts, pre-diagnosis analysis, automatic fault recovery, and integration with surrounding systems to achieve a closed-loop process.
Service Governance
- Microservices
Microservice architecture is an architectural pattern that advocates dividing a single application into a set of small services. Services coordinate and cooperate with each other to provide ultimate value to users. Each service runs in its own independent process, and services communicate with each other through lightweight communication mechanisms (usually HTTP-based RESTful APIs). Each service is built around a specific business and can be independently deployed to production environments, staging environments, etc.
- Service Discovery
Service discovery uses a registry to record information about all services in a distributed system so that other services can quickly find these registered services. Service discovery is the core module supporting large-scale SOA and microservice architectures and should strive to be highly available.
- Traffic Smoothing / Peak Shaving
If you observe the request monitoring curve of a lottery or flash sale system, you will see a peak during the activity's open period. When the activity is not open, the system's request volume and machine load are generally stable. To save machine resources, it is not possible to always provide the maximum resource capacity to support short-term peak requests. Therefore, technical methods are needed to smooth out instantaneous request peaks, keeping the system's throughput under control during peak loads. Peak shaving can also be used to eliminate spikes, making server resource utilization more balanced and efficient. Common peak shaving strategies include queues, rate limiting, layered filtering, and multi-level caching.
- Version Compatibility
When upgrading versions, it is necessary to consider whether the new data structure can understand and parse old data, and whether the newly modified protocol can understand the old protocol and handle it appropriately as expected. This requires designing services with version compatibility in mind.
- Overload Protection
Overload means that the current load has exceeded the system's maximum processing capacity. Overload can cause some services to become unavailable, and if not handled properly, it can easily lead to total service unavailability or even a cascade failure. Overload protection is a measure taken for such abnormal situations to prevent complete service unavailability.
- Circuit Breaker
The circuit breaker function is similar to a household fuse. When a service becomes unavailable or responds with a timeout, in order to prevent a cascade failure of the entire system, calls to that service are temporarily stopped.
- Service Degradation
Service degradation is the process of strategically degrading some services and pages based on the current business situation and traffic when the server is under severe pressure, thereby releasing server resources to ensure the normal operation of core tasks. Degradation often specifies different levels, and different handling is performed for different exception levels.
By service method: It can reject service, delay service, or sometimes randomly serve.
By service scope: It can cut off a certain function or cut off certain modules.
In short, service degradation requires different degradation strategies based on different business needs. The main purpose is a degraded service is better than no service at all.
- Circuit Breaker vs. Service Degradation
Similarities:
Same goal, both start from availability and reliability to prevent system crashes;Similar user experience, ultimately the user experiences that some functions are temporarily unavailable;Differences:
Different triggers, circuit breaking is generally caused by a fault in a specific service (downstream service), while service degradation is generally considered from an overall load perspective;
- Rate Limiting
Rate limiting can be considered a form of service degradation. Rate limiting restricts the system's input and output traffic to protect the system. Generally, the system's throughput can be estimated. To ensure stable operation, once a threshold that needs to be limited is reached, traffic must be restricted and measures taken to achieve the purpose of limiting traffic. For example: delayed processing, rejection processing, or partial rejection processing, etc.
- Fault Isolation / Shielding
Removing faulty machines from the cluster to ensure that new requests are not dispatched to those faulty machines.
Testing Methods
- Black Box / White Box Testing
Black box testing does not consider the internal structure and logic of the program. It is mainly used to test whether the system functions meet the requirements specification. Generally, there is an input value, an output value, and comparison with an expected value.
White box testing is mainly applied in the unit testing phase. It is a code-level test targeting the internal logic structure of the program. Testing methods include: statement coverage, decision coverage, condition coverage, path coverage, condition combination coverage.
- Unit / Integration / System / Acceptance Testing
Software testing is generally divided into 4 stages: unit testing, integration testing, system testing, acceptance testing.
Unit Testing: Unit testing checks and verifies the smallest verifiable unit in the software, such as a module, a procedure, a method, etc.Unit testing has the smallest granularity, usually performed by the development team using white box methods, mainly to test whether the unit conforms to the "design."Integration Testing: Also called assembly testing. Usually based on unit testing, it involves orderly and incremental testing of all program modules.Integration testing sits between unit testing and system testing, acting as a "bridge." It is usually performed by the development team using both white box and black box methods, verifying both "design" and "requirements."System Testing: System testing takes the software that has passed integration testing, combines it with other parts of the computer system, and performs a series of strict and effective tests in the actual operating environment to discover potential problems in the software and ensure the system runs normally.System testing has the largest granularity, usually performed by an independent testing group using black box methods, mainly to test whether the system conforms to the "requirements specification."Acceptance Testing: Also called delivery testing, it is a formal test targeting user requirements and business processes to determine whether the system meets acceptance criteria. The user, client, or other authorized body decides whether to accept the system. Acceptance testing is similar to system testing, the main difference being the testers;acceptance testing is performed by the user.
- Regression Testing
After defects are found and fixed, or new features are added to the software, re-testing is done to check that the defects have been corrected and that the changes have not introduced new problems.
- Smoke Testing
This term originates from the hardware industry. After making a change or repair to a hardware component, power is applied directly. If there is no smoke, the component passes the test. In software, the term "smoke testing" describes the process of verifying code changes before integrating them into the product's source tree.
Smoke testing is a strategy for rapid basic functionality verification of a software version package during the software development process. It is a means of confirming and verifying basic software functions, not an in-depth test of the version package.
For example: smoke testing a login system only requires testing the core function of logging in with the correct username and password. Input fields, special characters, etc., can be tested after the smoke test.
- Performance Testing
Performance testing uses automated testing tools to simulate various normal, peak, and abnormal load conditions to test the system's performance indicators. Load testing and stress testing are both types of performance testing and can be combined.
Through load testing, the system's performance under various workloads is determined. The goal is to observe how performance indicators change as the load gradually increases.
Stress testing determines the system's maximum service level by identifying a bottleneck or the point at which performance becomes unacceptable.
- Benchmark Testing
Benchmark testing is also a performance testing method. It measures the maximum actual hardware performance of a machine and the performance improvement effects of software optimization. It can also be used to identify CPU or memory efficiency issues in a section of code. Many developers use benchmarks to test different concurrency patterns or to configure the number of worker pools to maximize system throughput.
- A/B Testing
A/B testing uses two or more randomly assigned, similarly sized sample groups for comparison. If the experimental group and the control group show statistically significant differences in the target metrics, it can be concluded that the function in the experimental group leads to the desired result, thereby helping to validate hypotheses or make product decisions.
- Code Coverage Testing
Code coverage is a measure in software testing that describes the proportion and extent to which the source code of a program is tested. The resulting proportion is called the code coverage rate. When doing unit testing, code coverage is often used as a metric for test quality. Sometimes, code coverage is used to assess the completion of testing tasks, for example, requiring code coverage to reach 80% or 90%. As a result, testers spend effort designing test cases to cover the code.
Release and Deployment
- DEV / PRO / FAT / UAT
DEV (Development environment): Used by developers for debugging. Version changes are frequent.FAT (Feature Acceptance Test environment): Used by software testers for testing.UAT (User Acceptance Test environment): Used for functional verification in a production-like environment. Can serve as a pre-release environment.PRO (Production environment): The formal online environment.
- Canary Release / Gray Release
Gray release refers to a strategy during version upgrades where, through zonal control, whitelist control, etc., a portion of users are first upgraded to the new product features, while the rest remain unchanged. After a period of time, if no issues are reported by the upgraded users, the scope is gradually expanded until all users receive the new version features. Gray release ensures the stability of the overall system; problems can be discovered and corrected during the initial gray phase to limit their impact.
- Rollback
Refers to the act of restoring a program or data to the previous correct state (or the previous stable version) when a program or data processing error occurs.