with movieIds as (select id from Movie where title = $1),
actorIds as (select Actor.id from Actor join ActorMovie on [...]
where ActorMovie.Movie in movieId),
alsoActedIn as (select id from ActorMovie where actor in actorId),
movieResults as (select * from Movie where id in movieIds),
actorResults as (select * from Actor where id in actorIds),
alsoActedInResults as (select * from Movie join ActorMovie on [...]
where ActorMovie.id in alsoActedIn)
select * from movieResults
full outer join actorResults on false
full outer join alsoActedInResults on false;
Not every database supports "full outer join on false," and sometimes you have to add "as not materialized" to the "with" subqueries in order for it to be performant, but it works in Postgres, and you end up with a results table that looks something like this: MovieId,MovieTitle,ActorId,ActorName,MovieId,MovieTitle,ActorMovie.actor
0,"Indiana Jones",null,null,null,null,null
null,null,0,"Harrison Ford",null,null,null
null,null,null,null,0,"Indiana Jones",0
null,null,null,null,1,"Star Wars",0
null,null,null,null,2,"The Fugitive",0
With a bit of client-side logic, it's easy to transform that into this: [{
id: 0,
title: "Indiana Jones",
actors: [{
id: 0,
name: "Harrison Ford",
alsoActedIn: [
{id: 1, title: "Star Wars"},
{id: 2, title: "The Fugitive"},
]
}]
}]
without the need to use json as a middleman.This pattern has saved me from some truly awful query logic.
I'd expect to talk about anything in a DB as "structured data". Flexible serialization formats such as JSON or XML are "semi-structured". And something from e.g., an office document is "unstructured".
This is not a complaint or criticism. I understand the article just fine. It's just interesting how different perspectives can affect what words mean (:
SQL, as a language is clunky, true. It can be patched here are there, either by PipeSQL or by various ORMs. I agree, that it would be wonderful to have standardized tooling for generating JSON like in the post.
Yet, with relational databased you can separate concerns of: what is your data and what you want to display. If you use JSON-like way to store data, it can do the job until you want to change data or queries.
Codd was right in that if you want transactional semantics that are both quick and flexible, you'll need to _store_ your data in normalized relations. The system of record is unwieldly otherwise.
The article is right that this idea was taken too far - queries do not need to be restricted to flat relations. In fact the application, for any given view, loves heirarchical orginization. It's my opinion that application views have more in common with analytics (OLAP) except perhaps latency requirements - they need internally consistent snapshots (and ideally the corresponding trx id) but it's the "command" in CQRS that demands the normalized OLTP database (and so long as the view can pass along the trx id as a kind of "lease" version for any causally connected user command, as in git push --force-with-lease, the two together work quite well).
This issue is of course that SQL eshews hierarchical data even in ephemeral queries. It's really unfortuante that we generate jsonb aggregates to do this instead of first-class nested relations a la Dee [1] / "third manifesto" [2]. Jamie Brandon has clearly been thinking about this a long time and I generally find myself nodding along with the conclusions, but IMO the issue is that SQL poorly expresses nested relations and this has been the root cause of object-relation impedence since (AFAICT) before either of us were born.
[1] https://github.com/ggaughan/dee [2] https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf
Classes/Structs know about their children.
Relations know about their parents. If you want crazier m*n relationships you use an association table.
Did the author just not know? Or he didn't see it worthy of dismissal?
> Doing this transformation by hand is tedious and error-prone. We call this tedium "the object-relational mismatch" but it isn't really about objects or relations.
It really is.
> The fundamental problem is that fitting complex relationships to human vision usually requires constructing some visual hierarchy, but different tasks require different hierarchies.
Yes. Different tasks require different hierarchies. One particular way of doing things should not baked into your 1970s relational model.
This is NOT the case with modern SQL, as it supports JSON. The typical Postgres use-case would be to produce denormalised views for the specific relationships and/or aggregates in JSON form. LATERAL, the correlated sub-query, is really convenient in avoiding double-counting. The rest of the article deals with object-mapping, which is really a poor man's VIEW. Time and time again, software engineers reinvent the data-access wheel out of ignorance to database capabilities.
> A relation [table] whose domains [column types] are all simple [not relations] can be represented in storage by a two-dimensional column-homogeneous array of the kind discussed above. Some more complicated data structure is necessary for a relation with one or more nonsimple domains. For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating.⁴ There is, in fact, a very simple elimination procedure, which we shall call normalization.
But non-normalized relations support the kind of nested structure the eminent Dr. Brandon wants, without resorting to JSON or abandoning the strong uniform typing we have with SQL. Darwen & Date's The Third Manifesto https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf includes group and ungroup operations which translate back and forth between normalized and non-normalized relations.
I've been playing around with some of these ideas using some rather horrific atrocities perpetuated against Lua operator overloading in http://canonical.org/~kragen/sw/dev3/cripes.lua. I haven't added joins or ungrouping to it yet, but I think it might be a prototype of something promising for this kind of use case.
"What if an SQL statement returned a database?"
- cte
- recursive
- with (in SQL code, not prose)
- connect (as in Oracle's CONNECT BY)
nothing. Nothing!SQL deals with hierarchical data just fine. You just have to learn about how (recursive CTEs).
So one can technically create a projection/view that is tailor-made for a query that needs to display some data. Of course it is no often possible to retrieve all the data with a single select command.
So joins and multiple queries are simply inherent to complexity of data we store nowadays.
Anyway, years ago, i have moved to a db model where every entity is stored as a blob in a dedicated column and every additional column, beside id, is indexed. So there is no wasted space and a ton of columns that only hold data but are not used for filtering. I can run data-efficient queries that yield a list of ids of blobs to load or the blobs themselves and then i extract any necessary data out of those blobs(entities) on the application level. So the database us purely a blob store + few fast indices.
With its JSON arrow operators and indexed expressions, SQLite makes a really nice JSON document store.
Assume you have a "cars" table with nothing but a "cars" column that is a JSON blob. Then selecting the models of all cars is just:
SELECT cars->>'model' FROM cars
I wish MariaDB would follow suit and also support arrow operators. You can do it in MariaDB, but it becomes more cumbersome: SELECT JSON_UNQUOTE(JSON_EXTRACT(cars, '$.model')) FROM cars
Another example is UUIDs. Instead of transferring 16 bytes, the libraries deal with wasteful string representation. I'm sure you can bring another examples.
Nonetheless, for majority of data JSON as DB output format is alright.
Well, LEFT JOIN is the basic SQL knowledge, surprising the author doesn't know that.
> because sql wasn't designed to produce hierarchical data
Well, SQL was designed _because_ of the shortcomings of hierarchical DBMSes.
Also, author uses PostgreSQL, so could easily use ARRAY_AGG(), and Composite Types to get the exact answer they need in one query, no need for hacks. They can even make their own custom type for type safety, if needed.
select primaryTitle, person.primaryName
from title, principal, person
where title.tconst = 'tt3890160'
and title.tconst = principal.tconst
and principal.nconst = person.nconst
and principal.category = 'director';
are much nicer to write as select primaryTitle, person.primaryName
from title
join principal on principal.tconst = title.tconst
join person on person.nconst = principal.nconst
where title.tconst = 'tt3890160'
and principal.category = 'director';
Because you separate the filter from the joining, and you're much less likely to forget a join conditionIf ambiguity of column names are low enough you could also use `join person using(nconst)`
A query like
select users.*, orders.* from users left join orders on orders.user_id = users.id
Should always have returned a structure like: type SingleReturnRow = { users: {id: ..., }[], orders: {user_id: ..., id: ..., }[]}
type Return = SingleReturnRow[]
Mangling the columns together and _removing_ groupings that naturally appear is just so unnecessary.I don't think a larger change in the query language would even be needed.
Even better of course would be a return value like
type SingleReturnRow = { user: User, orders: Order[] }
But I see how that would require a fundamental change in the language.Of course in PG now you can use
select users.*, json_agg(orders.*) as orders from users left join orders on orders.user_id = users.id group by users.id
but using JSON as intermediate steps just feels unnatural.It often becomes more interesting to find what bits are reused in several places and can get broken out into a useful cache (like person id → person name, thumbnail url, person url, …) than making a custom query for each view
That completely changes how you do queries as you can cut a bunch of joins. Instead you get a partial data set from the query that you then throw on a function or service that populates all the referenced ids from a cache, and backfills (and caches) any missing ones (preferably with a single query for a list of people in the example).
We built a GraphQL / SQL hybrid expression language that does just this here: https://docs.gadget.dev/guides/data-access/gelly#gelly , curious what yall think.
Like, left joins have been around since SQL-92, so the query not returning a result when there is no director is a skill issue.
Nowadays, you can run a JSON or XML query and have the DB return your results in a hierarchical format.
Also, often, the transactional database servers is more difficult to scale than application servers so from a technical standpoint, it makes sense to do this glue work in app code.
+-------------------+
| |
| |
+------------------+ | Genere |
| | | |
| | +-------------------+
| Movies |
| |
| | +-------------------+
| | | |
| | | Actors |
+------------------+ | |
| |
+-------------------+ +
Even then, if you really wanted to, you can absolutely make a tree-like structure in SQL in a variety of ways: adjacency lists, nested sets, closure tables, etc.
* Store relation-on-cell: All the trick of so called `nosql` is that they can do nested data. THAT IS ALL
* Then, `join` is already inline
* Then, instead of the very poorly named `GROUP BY` it could has REAL `GROUP BY`!
That is all is need at the core
This is a very interesting area for exploring a new kind of ORM. I find the whole "CRUD monkey" shtick preached by DHH et al quite lacking for expressing the kind of hierarchical data structures developers need.
Don't get me wrong, JSONB is great but I see a lot of people simply slap on a JSON column to anything that requires more complexity without trying to first solve it through RDBMS relations.
And, in practice a lot of these frontenders end up storing a JSON blob with either no nesting or just one level of nesting that looks a lot like a database schema built by someone being stubborn about not normalising data. In some sense that is fine, databases seem to be learning to just treat simple JSON blobs as normalised data but it is hard to make that as efficient as a guaranteed schema. Plus larger companies often end up having to hire a dedicated team of engineers to properly normalise data so other people can use it.