DEV Community

Jing
Jing

Posted on

Best Practice for Production Traffic Replication in Trip.com

Intro

AREX is an open-source API testing platform developed by Trip.com using the concept of production traffic replication. It focuses on the construction of core linkages for recording and playback, evolving from basic solution architecture to deep implementation verification across the corporation’s core business lines. Through continuous iterations and optimizations amidst the group’s complex business scenarios, AREX has accumulated a vast amount of experience and achieved tangible results. Since its adoption by Trip.com, over 4000 applications have been integrated with AREX, leading to improved delivery rates and a reduction in defects.

This article primarily discusses the range of challenges encountered and the solutions devised during the implementation of AREX within Trip.com, as well as how to utilize AREX to rapidly deploy a one-stop traffic recording and playback solution to reduce integration costs and accelerate implementation.

What is production traffic replication and replay?

Production traffic replication and replay is a technique used to capture network traffic in one environment — typically a production environment — and then replay that traffic in another environment.

It has a promising future for the performance, regression and automated testing. It provides a crucial solution for the technology team to cope with complex business and system architectures while ensuring system stability and improving efficiency in the R&D process.

However, deploying technological solutions is not without challenges. Teams often encounter difficulties in infrastructure development, disproportionate initial investment costs relative to early returns, and ambiguities surrounding practical application scenarios.

Sharp tools make good work

The most common open-source solution for “Production traffic replication” is based on Jvm-Sandbox-Repeater secondary development and transformation. The core principle is to record the real traffic on the line and then playback in the test environment to verify the correctness of the code logic. You may ask: Since there are mature solutions, why are we still “reinventing the wheel”?

Firstly, the components supported by JVM Sandbox are limited and cannot meet the needs of the middleware and frameworks widely used within Trip.com. Additionally, the underlying support of the JDK is not thorough enough, such as the asynchronous thread context transfer, which requires reliance on other third-party components.

Moreover, while Jvm-Sandbox-Repeater provides basic recording and playback functionality, to build a comprehensive regression testing platform, we also need a robust backend support system responsible for data collection, storage, and comparison tasks.

Lastly, the lack of official documentation and community activity put us at risk of not being able to get official support in time for subsequent secondary development.

So we decided to independently develop the traffic recording and replay tool:

  1. Support for more middleware and components recording and replay, and the capability to simulate a variety of complex business cases, such as local caching, the time, and so on.
  2. As a comprehensive solution, it also comes equipped with a complete set of supporting facilities, including a frontend interface, playback service, and report analysis, to achieve an all-in-one workflow from traffic collection and playback to comparison verification and report generation (as shown in the figure below).

img

Next, we will delve into the challenges encountered during the implementation process, targeted solutions, and internal application examples within our group, in the hope of providing you with substantial help and guidance.

Challenges Faced in Development

How to capture traffic in cross-thread and asynchronous cases

First of all, it needs to be made clear that the captured traffic we are referring to here is all the data involved in sending an API request, including not only the main endpoint but also the internal requests and responses of various frameworks such as Mybatis, Redis, Dubbo, and more.

However, many projects in our company utilize thread pools and asynchronous programming scenarios. For example, in a single request, the main process may fork multiple subtasks/threads to work in parallel. Some tasks may query Redis, some may call RPC interfaces, and others may perform database operations to fulfill different business scenarios. This also involves significant thread switching at the underlying level.

To ensure that operations executed in different threads within a single request are captured, we address this issue by using the approach of Trace propagation. This involves decorating various thread pools and asynchronous frameworks, using a recordId to pass between threads, thereby linking them together to complete a complete recording of a test case. For example, in Java, this can be achieved by decorating CompletableFuture, ThreadPoolExecutor, ForkJoinPool, as well as third-party thread pools used by Tomcat, Jetty, Netty, and asynchronous frameworks like Reactor and RXJava. This enables the propagation of the recordId across different threads.

How to avoid writing dirty data to the database during replay?

For example, in critical scenarios such as generating orders to the database and calling of third-party payment interfaces, it is necessary to use “mock” data during traffic replay to avoid actual data interactions. This approach prevents the generation of unnecessary data during testing and avoids disrupting normal business processes.

To achieve this, framework calls need to be intercepted and mocked, using recorded data instead of real data requests. This ensures that no actual external interactions, such as database writes or third-party service calls, occur during the testing process, effectively preventing the writing of dirty data during replay testing.

AREX Java Agent has supported most open-source frameworks such as Spring, Dubbo, Redis, Mybatis, and more. Please refer to the list below for the complete list:

img

Replay response differs from the recorded one, but not caused by a code bug

  1. Login authentication and token expiration

During the actual process of traffic replay, we often encounter an issue: many web applications implement login authentication checks before accessing their interfaces. If the authentication fails or the login token has expired, the interface access will be denied, resulting in a large number of test cases failing during replay, which means replay response differs from the recorded one. While it is possible to address some of these issues by configuring a whitelist, we are seeking a more general solution.

The ideal solution would be to mock authentication frameworks such as Spring Security, Apache Shiro, JWT, etc., during the replay process. This would bypass the authentication and token verification steps, ensuring that the interfaces can be executed smoothly in the replay environment.

2. Time inconsistencies cause payment timeouts

If the current time during recording and replay are not consistent, it may result in unexpected differences in timeout logic. For example, in cases where we determine if an order has timed out and not been paid, we often use the condition “currentTime — orderCreateTime > 30 minutes” as the basis for judgment. If the order has not timed out during recording but during replay, half an hour later, due to the change in the system’s current time, it may mistakenly trigger the payment timeout processing logic.

To address this issue, we have proposed a solution: during the recording process, record the current time at that moment and only record it once. During the replay process, we use Mock to simulate classes related to the current time, such as Date, Calendar, LocalTime, joda.time, etc., so that the current time used during replay is actually the time recorded during the recording process. This ensures that time-related logic during replay is consistent with the recording, ensuring the accuracy and reliability of the test results.

3. Local cache

In applications, it is common practice to optimize performance by storing frequently accessed data in local caches for faster retrieval. However, in traffic recording and replay scenarios, the behavior of the cache can introduce variations in the replay results.

During recording, if the requested data is already cached, the system retrieves it directly from the cache, avoiding queries to the database or external interfaces. However, in the replay environment, without preloaded cached data, the same request may trigger database queries or external interface calls, resulting in new calls and causing replay failures.

To address this, we have supported for popular caching frameworks such as Guava Cache and Caffeine Cache, ensuring the requests to the cache during replay can return the expected results based on the recording state, avoiding unnecessary new calls.

For cases where custom caching frameworks are used, the AREX platform provides flexible configuration options that allow adaptation through dynamic classes. This means that even non-standard caching implementations can be compatible with the AREX platform and correctly support traffic replay.

Challenges Faced in Implementation

  1. Installation and deployment should be simple, convenient, and easy to get started with

AREX is a comprehensive solution that includes not only the core recording and replay functionality but also complementary services such as frontend, scheduling, report analysis, and storage. Following the principle of out-of-the-box usability and quick integration, we offer multiple deployment options, including one-click deployment, non-container deployment, and private cloud deployment. Once installed, you only need to configure some basic parameters to automatically capture traffic and replay it to compare differences.

img

img

In addition, AREX also support a standalone mode, which allows you to quickly get started and experience the platform without the need for installation on your local machine.

2. Complying with the company’s risk control and data security requirements

When recording real production traffic, it is necessary to apply data masking rules to sensitive information to ensure reliable protection of sensitive and private data, especially in cases involving information security or sensitive commercial data. This involves transforming sensitive data into a masked or altered form to prevent unauthorized access and protect the privacy of sensitive information.

img

We have chosen to perform data masking during the data storage process to ensure the security of sensitive information. The specific implementation involves using the SPI (Service Provider Interface) mechanism to load external JAR packages and dynamically load encryption methods.

By utilizing the SPI mechanism, we can extend the functionality of the data storage process by loading custom encryption modules from external JAR packages. These modules can then be dynamically loaded and applied to the sensitive data before it is stored in the database.

img

3. Improve user experience and identify issues quickly

In actual usage, there is a huge number of test cases for recording and replaying. To reduce the workload of users when analyzing differences, we have implemented aggregation for test cases with the same differences. This speeds up the process of troubleshooting and issue identification.

img

With call chains, it is possible to quickly identify the scope of the problem and reduce interference by removing noise nodes such as timestamps, UUIDs, and IP addresses. This helps to minimize distractions and focus on the relevant information for issue identification.

If it is difficult to reproduce online issues locally in complex business applications, AREX also supports local debugging to quickly troubleshoot problems.

4. Is AREX mature, secure, and reliable.

AREX is based on Java Agent technology and utilizes the mature bytecode modification framework, ByteBuddy. It is secure, stable, and features code isolation and self-protection mechanisms. It intelligently reduces or disables data collection frequency during system busy periods. Moreover, it has been running steadily within Ctrip Group for over two years and has been thoroughly validated in production environments.

Best Practice

Currently, the AREX traffic recording and replay platform has been integrated as a standalone option into the company’s CI/CD system.

  1. The initial onboarding process for AREX: When users onboard the traffic recording and replay feature for the first time, they simply need to select the Flight AREX Agent service in the CI Pipeline. This ensures that during the application packaging process into an image, the AREX startup script, arex-agent.sh, will be included in the release package.
  2. Deployment and Agent loading: During the application deployment process, the previous script will first pull the latest arex-agent.jar and mount the AREX Agent by modifying the JVM options (-javaagent:/arex-agent.jar).
  3. Version Control and Canary Release: After the startup script is executed, the corresponding arex-agent.jar version will be pulled based on the AppId of the application to achieve canary release and on-demand loading. For example, only certain specific applications will load the new Agent features.

img

Similarly, if it is the first replay, the operation is also simple:

  1. Create a Pipeline: In GitLab or Jenkins, create a pipeline and in the ArexTest job script, call the playback URL provided by AREX and schedule the pipeline to run at regular intervals.

img

2. Automatically triggered traffic replay: Traffic replay is automatically triggered by developers after committing code.

3. Push the replay report and control release: After replay, AREX will push metrics such as the number of test cases, pass rate, failure rate, etc., to the relevant personnel for statistical analysis. Only when the pass rate meets the predetermined criteria, the code is allowed to be deployed to the production environment.

Landed results

Under the continuous iteration and optimization in complex business scenarios within the group, the AREX platform has accumulated a wealth of experience and achieved visible results. Since its implementation in Trip.com, it has been adopted by over 4000 applications, resulting in improved delivery rates and reduced defect counts.

img

Embracing Open Source

After long-term stable operation and verification of its reliability within Trip.com, we have decided to open-source the AREX platform in 2023, with the aim of helping more enterprises efficiently and cost-effectively implement traffic recording and replay technology solutions. You can find the open-source project at AREX GitHub repository.

In the past year, we have also been committed to building an open-source community. Currently, there are thousands of external users who have adopted and are using AREX, and we have received positive feedback from these users.

In the end

AREX is dedicated to ensure quality, reduce costs, and improve efficiency while meeting the demands of rapid iteration. This vision has been validated through the practices of Trip.com and numerous open-source users, bringing significant business value.

Looking ahead, we will continue to rely on the active community to respond to and address user inquiries, continuously optimizing AREX. We sincerely invite every developer to join the community, try it out, and witness the growth and progress of AREX together.

Community⤵️

⭐ Star us on GitHub
🐦 Follow us on Twitter
📝 Join AREX Slack
📧 Join the Mailing List

Top comments (0)