When I moved to big tech the rules against doing this were honestly one of the biggest drivers of reduced velocity I encountered. Many, many bugs and customer issues are very data dependent and can’t easily be reproduced without access to the actual customer data.
Obviously I get why the rules against data access like that exist and yes, many companies have ways to get consent for this access but it tends to be cumbersome and last-resortish. I think it’s under-appreciated how much it slows down the real world progress of fixing customer-reported issues.
Uncontrolled access, inability to comply with "right to be forgotten" legislation, visibility of personal information, including purchases, physical locations, etc etc.
Of course sales, trading, inventory, etc data, even with no customer info is still valuable.
Attempts to anonymise are often incomplete, with various techniques to de-anonymise available.
Database separation, designed to make sure that certain things stay in different domains and cant be combined, also falls apart if you have both the databases on your laptop.
Of course, any threat actor will be happy that prod data is available in dev environments, as security is often much lower in dev environments.
Caveat emptor.
I haven’t looked at the code too much(yet). I’d be curious to know how you’re handling some of the more wiry edge cases when it comes to following foreign key constraints. Things like circular dependencies come to mind. As well as complex joins.
I feel ok posting this because it’s archived, but this problem is basically what we designed for with Neosync [1]. It was probably the hardest feature to fully solve for the customers that needed it the most, which were the ones with the most complex data sets and foreign key dependencies.
To the point where it was almost impossible to do this, at least with syncing it directly to another Postgres database with everything in tact. Meaning that if on the other side you want another pg database that has all of the same constraints, it is difficult to ensure you got the full sliced dataset. At least the way we were thinking about it.
I'll surely try this. Thanks for posting it here.