How Synthesized Can Help Populate Your Testcontainers Databases

synthesized.io
7 min readNov 17, 2022

--

If you have ever needed to run automated tests, it is likely that at some point you will have run into Testcontainers. When this Java library was introduced in 2015 by AtomicJar, it transformed test automation by allowing developers to run integration tests in a CI/CD pipeline or even on their own machines, rapidly decreasing the time needed to run tests.

Input data is vital to conducting reliable integration testing. Test data often needs to be prepared before running test cases and, with growing volumes of data and data compliance regulations, using copies of production data becomes expensive and sometimes simply impossible. With DevOps automation, there is a need for an API-driven test data generation directly in a test environment as part of a CI/CD pipeline.

By combining Testcontainers with Synthesized’s Testing Data Kit (TDK), developers can populate any Testcontainers database with synthetically generated data, enabling rapid development of tests for logic which involves interaction with the database, while avoiding the need to develop and maintain huge amounts of code.

How It Works

To illustrate how the integration of Synthesized TDK and Testcontainers work, we are going to consider a toy backend project built with a common technology stack: Java, SpringBoot, and Spring JDBC module which uses PostgreSQL as its database. Our toy application is going to automate scheduling of talks and participants for a large tech conference, and hence store the information about conferences, speakers and their talks. The PostgreSQL database has a rather simple schema:

Despite its simplicity, the schema requires careful work with the insertion order of records in order to satisfy the constraints of all the foreign keys. If, for example, we are need one entry in the talk table, we need to insert related entries in all the tables first. In real conditions, the complexity of creation of test data is even higher. I think many are familiar with the situation that to fill in the “Address” field, records are needed in the “Street”, “City”, “Region” and “Country” tables.

What approaches exist for preparing test data for integration tests? Traditionally, there were two of them: 1. Maintaining an initialization script which pre-populates a database with records, and 2. Inserting test-specific records during the setup phase of the test using the so-called ‘fixture factories’.

Both approaches require maintenance of the significant amount of code and thus time to write the actual tests. And as the database evolves, this code needs to be maintained, requiring more time. Now let’s look how instead of maintaining an ‘initialization script’, Synthesized TDK is able to initialize all the tables in the database with the records.

First, add a test dependency via Maven Central.

<dependency>
<groupId>io.synthesized</groupId>
<artifactId>tdk-tc</artifactId>
<version>1.03</version>
<scope>test</scope>
</dependency>

tdk-tc is a small MIT-licensed library which acts as a wrapper for a freely-distributed version of Synthesized TDK running in a Docker container. It provides a simple class SynthesizedTDK which requires two JdbcDatabaseContainer to prepare a test database. If we are using SpringBootTest we can make the preparation of the test database as a part of our @TestConfiguration.

Now, you should define a configuration to prepare your test database.

Given that PostgreSQLContainer<?> input is an empty database with deployed schema (you can obtain such container using a simple DDL script or migrations library, such as Flyway or Liquibase) and PostgreSQLContainer<?> output is a totally clean database, you can produce a database pre-filled with random data using SynthesizedTDK (note that both containers are created in the same network):

private PostgreSQLContainer<?> getContainer(String name, boolean initData) {
...
}

network = Network.newNetwork();
input = getContainer("input", true);
output = getContainer("output", false);
Startables.deepStart(input, output).join();
new SynthesizedTDK()
.transform(input, output,
"""
default_config:
mode: "GENERATION"
target_row_number: 10
tables:
- table_name_with_schema: "public.talk"
transformations:
- columns: [ "status" ]
params:
type: "categorical_generator"
categories:
type: string
values:
- "NEW"
- "IN_REVIEW"
- "ACCEPTED"
- "REJECTED"
probabilities:
- 0.25
- 0.25
- 0.25
- 0.25
global_seed: 42
""");

The first two arguments of the transform method are input and output containers. The third one is a YAML string containing parameters of data generation. Synthesized TDK requires a configuration which is described in detail here. You can set up generation parameters for each specific table and field.

For example, in our scenario the library “doesn't know” what values are appropriate for the status field. So we have to give it a hint by configuring a categorical generator (about twenty various generators and maskers are available in total).

However, if you don't provide a setup for any of the tables, Synthesized TDK will use reasonable defaults based on column data types and table level constraints. Most likely the given example without tables section will work for your database schema.

Two essential parameters that we need to understand are:

  • target_row_number defines the desired number of records generated for each table.
  • global_seed is a seed for random value generators. The result of generation will be the same each time the generation is being run with the same seed, schema and workflow configuration.

Note that in order to have a non-empty test database we don’t need a full insertion script anymore! Just by choosing a constant RNG seed we can be sure that the resulting data in the database will be the same each time SynthesizedTDK.transform(...​) is being executed.

When output database is ready, we can create a data source that will be used in the test, e. g. declaring a respective @Bean in our @TestConfiguration:

@Bean 
public DataSource dataSource() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl(output.getJdbcUrl());
config.setUsername(output.getUsername());
config.setPassword(output.getPassword());
return new HikariDataSource(config);
}

Now let’s get consider our method under test. As its name implies, it returns the information about talks of the given conference:

public Set<Talk> getTalksByConference(Conference conference)

We need a conference object as an input, and then we need to compare a result with some reference value.

Since our database is non-empty and consistent, we may use our own DAO classes to get a first conference we come across:

//The object under test 
@Autowired
private TalkDao dao;

//The object needed to get a conference
@Autowired
ConferenceDao conferenceDao;

private Conference conference;

@BeforeEach void init() throws SQLException {
conference = conferenceDao.getConferences().iterator().next();
}

This is effectively all the “arrange” part of the test. Since data generation is deterministic, the conference (and its relation with all the other objects) will be the same across runs of the test.

The “act” part is a single-liner:

@Test void getTalksByConference() {
//Act
Set<Talk> talks = dao.getTalksByConference(conference);
//Assert
JsonApprovals.verifyAsJson(talks);
}

What about the “assert” part? As the generation is deterministic, we may figure out the actual properties of the returned Set<Talk>, and then add assertions. However, there is a simpler way to do this with Approval tests. In short, Approvals library creates a snapshot of our Set<Talk> serialized as JSON and stores it in a text file in your test code folder. In our case, the output looks like this:

[
{
"id": 3,
"name": "MnkVLBcSGJeelU190EZAwq",
"conference": {
"id": 5,
"name": "9zx3i8oNspCHrkIhneNYG18"
},
"status": "NEW",
"feedback": "RoopSXMfpkPYSNA1W4N",
"speakers": [
{
"id": 2,
"name": "c"
}
]
}
]

Judging from this file, we may conclude that our method indeed returns a set of talks with “conference” and “speakers” properties set. This file should be committed to the source control and it will be used each time the test is run to ensure that the result is not changed.

Note that we put very little or zero effort in order to write ‘arrange’ and ‘assert’. Thus using Synthesized TDK we can significantly cut time on writing the tests.

Of course, we can assert more on the returned value, e.g. check that the returned talks are actually belong to the conference:

for (Talk talk : talks) { 
assertThat(talk.getConference()).isEqualTo(conference);
}

In the example above we just verified the ability of the method to extract data from the database, which doesn’t involve complex business logic.

We can also look at another interesting use case. Imagine that we have a service class TalkService that deals with statuses of talks, and we want to check the fact that no talk can be moved to "rejected" status without attaching a feedback.

For this test scenario we need a talk with a predefined state. We can do the following: pick up any talk from the database and change its state to the desired. Here we test that a talk with non-empty feedback can be moved to “rejected” status:

@Test 
void rejectInReviewWithFeedback() {
//Arrange
dao.updateTalk(talk.withStatus(Status.IN_REVIEW)
.withFeedback("feedback"));
//Act
service.changeStatus(talk.getId(), Status.REJECTED);
//Assert
Assertions.assertThat(dao.getTalkById(talk.getId()).getStatus())
.isEqualTo(Status.REJECTED);
}

In this test we verify that an attempt to reject a talk with an empty feedback raises an exception:

@Test void doNotRejectInReviewWithoutFeedback() { 
//Arrange
dao.updateTalk(talk.withStatus(Status.IN_REVIEW)
.withFeedback(""));
//Act, Assert
Assertions.assertThatThrownBy(() ->
service.changeStatus(talk.getId(), Status.REJECTED))
.hasMessageContaining("feedback");
Assertions.assertThat(dao.getTalkById(talk.getId())
.getStatus()).isEqualTo(Status.IN_REVIEW);
}

Conclusion

To try Synthesized TDK with Testcontainers, check out the demo project here. All code samples are taken from that project. Follow a readme file in order to run demo tests.

To use TDK as a standalone application, read the docs here. You can embed the TDK into your CI/CD pipeline to create test databases based on sample production databases. You can choose between a number of techniques to prepare data:

  • Subsetting
  • Masking
  • Generation

There is a number of masking and generation methods available, you can find them out here.‍

Originally published at https://www.synthesized.io on November 17, 2022.

--

--

synthesized.io
synthesized.io

Written by synthesized.io

Synthesized delivers the fastest way to create and share trusted data.

No responses yet