DBSubsetter is a tool I wrote for taking a logically consistent subset of a relational database. It is written in Scala and relies heavily on the excellent Akka Streams library. It is free and open source software, with the source code and documentation available here.
To use DBSubsetter, you provide it with a SQL “where clause” which specifies a set of rows from one or more tables. It then introspects the schema of your database and follows foreign key constraints, recursively fetching the “parents” and, optionally, the “children” of those rows.
This is mainly useful for being able to give each member of a team of developers their own, smaller copy of a production database for them to develop against, although there are some other uses for it as well including data archival, etc.
Consistent subsetting is a problem that is trickier to solve than it seems at first, especially if you want the solution to be highly performant. The following are a couple of interesting points about the project.
In order to achieve a “recursive” algorithm without resorting to actual recursion, DBSubsetter contains a “feedback loop” in Akka Streams. The main challenges of feedback loops are preventing deadlocks due to backpressure and getting the stream to properly terminate when all outstanding work is done. Getting the program to complete was an interesting exercise that turned out to be possible using a combination of
mapConcat graph stages. I wrote more about this here on the Akka User Mailing List.
It attempts to be vendor-agnostic and support MySQL, Postgres, and Microsoft SQL Server out of the box. This turns out to be possible but slightly difficult! One of the interesting parts of this project was using Docker to run the same tests against all three types of database, which often turned up issues present in just one of the three databases that would otherwise have gone unnoticed.