FRESH

Hacker News

SQL needed structure

146 points by todsacerdoti

by bccdee

4 subcomments

My favourite way of doing this is

  with movieIds as (select id from Movie where title = $1),
       actorIds as (select Actor.id from Actor join ActorMovie on [...]
                    where ActorMovie.Movie in movieId),
       alsoActedIn as (select id from ActorMovie where actor in actorId),
       movieResults as (select * from Movie where id in movieIds),
       actorResults as (select * from Actor where id in actorIds),
       alsoActedInResults as (select * from Movie join ActorMovie on [...]
                              where ActorMovie.id in alsoActedIn)
  select * from movieResults
      full outer join actorResults on false
      full outer join alsoActedInResults on false;

Not every database supports "full outer join on false," and sometimes you have to add "as not materialized" to the "with" subqueries in order for it to be performant, but it works in Postgres, and you end up with a results table that looks something like this:

  MovieId,MovieTitle,ActorId,ActorName,MovieId,MovieTitle,ActorMovie.actor
  0,"Indiana Jones",null,null,null,null,null
  null,null,0,"Harrison Ford",null,null,null
  null,null,null,null,0,"Indiana Jones",0
  null,null,null,null,1,"Star Wars",0
  null,null,null,null,2,"The Fugitive",0

With a bit of client-side logic, it's easy to transform that into this:

  [{
    id: 0,
    title: "Indiana Jones",
    actors: [{
      id: 0,
      name: "Harrison Ford",
      alsoActedIn: [
        {id: 1, title: "Star Wars"},
        {id: 2, title: "The Fugitive"},
      ]
    }]
  }]

without the need to use json as a middleman.

This pattern has saved me from some truly awful query logic.

by greggyb

3 subcomments

Coming from a data engineering and business analytics background, the terms "structured" and "unstructured" seem to be swapped in meaning here.
I'd expect to talk about anything in a DB as "structured data". Flexible serialization formats such as JSON or XML are "semi-structured". And something from e.g., an office document is "unstructured".
This is not a complaint or criticism. I understand the article just fine. It's just interesting how different perspectives can affect what words mean (:

by stared

4 subcomments

I used to work bit with MongoDB (over 10 years ago) and it did wonders in making me fall back into love with relational databases. I finally saw the light of Codd!
SQL, as a language is clunky, true. It can be patched here are there, either by PipeSQL or by various ORMs. I agree, that it would be wonderful to have standardized tooling for generating JSON like in the post.
Yet, with relational databased you can separate concerns of: what is your data and what you want to display. If you use JSON-like way to store data, it can do the job until you want to change data or queries.

by andyferris

3 subcomments

Yes - I have to agree.
Codd was right in that if you want transactional semantics that are both quick and flexible, you'll need to _store_ your data in normalized relations. The system of record is unwieldly otherwise.
The article is right that this idea was taken too far - queries do not need to be restricted to flat relations. In fact the application, for any given view, loves heirarchical orginization. It's my opinion that application views have more in common with analytics (OLAP) except perhaps latency requirements - they need internally consistent snapshots (and ideally the corresponding trx id) but it's the "command" in CQRS that demands the normalized OLTP database (and so long as the view can pass along the trx id as a kind of "lease" version for any causally connected user command, as in git push --force-with-lease, the two together work quite well).
This issue is of course that SQL eshews hierarchical data even in ephemeral queries. It's really unfortuante that we generate jsonb aggregates to do this instead of first-class nested relations a la Dee [1] / "third manifesto" [2]. Jamie Brandon has clearly been thinking about this a long time and I generally find myself nodding along with the conclusions, but IMO the issue is that SQL poorly expresses nested relations and this has been the root cause of object-relation impedence since (AFAICT) before either of us were born.
[1] https://github.com/ggaughan/dee [2] https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf

by mrkeen

0 subcomment

The author didn't see fit to mention that you just flip the arrows.
Classes/Structs know about their children.
Relations know about their parents. If you want crazier m*n relationships you use an association table.
Did the author just not know? Or he didn't see it worthy of dismissal?
> Doing this transformation by hand is tedious and error-prone. We call this tedium "the object-relational mismatch" but it isn't really about objects or relations.
It really is.
> The fundamental problem is that fitting complex relationships to human vision usually requires constructing some visual hierarchy, but different tasks require different hierarchies.
Yes. Different tasks require different hierarchies. One particular way of doing things should not baked into your 1970s relational model.

by tucnak

5 subcomments

> All that's left to do now is... the same joins, but inside the backend web server. Because we have to re-assemble these flat outputs into the structure of the page.
This is NOT the case with modern SQL, as it supports JSON. The typical Postgres use-case would be to produce denormalised views for the specific relationships and/or aggregates in JSON form. LATERAL, the correlated sub-query, is really convenient in avoiding double-counting. The rest of the article deals with object-mapping, which is really a poor man's VIEW. Time and time again, software engineers reinvent the data-access wheel out of ignorance to database capabilities.

by kragen

0 subcomment

An interesting thing about "A Relational Model of Data for Large Shared Data Banks" (the stone tablets handed down from Codd https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf) is that it starts out by considering N-ary relations whose values may themselves be relations, and only then switches to considering "normalized" relations whose values are not relations:
> A relation [table] whose domains [column types] are all simple [not relations] can be represented in storage by a two-dimensional column-homogeneous array of the kind discussed above. Some more complicated data structure is necessary for a relation with one or more nonsimple domains. For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating.⁴ There is, in fact, a very simple elimination procedure, which we shall call normalization.
But non-normalized relations support the kind of nested structure the eminent Dr. Brandon wants, without resorting to JSON or abandoning the strong uniform typing we have with SQL. Darwen & Date's The Third Manifesto https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf includes group and ungroup operations which translate back and forth between normalized and non-normalized relations.
I've been playing around with some of these ideas using some rather horrific atrocities perpetuated against Lua operator overloading in http://canonical.org/~kragen/sw/dev3/cripes.lua. I haven't added joins or ungrouping to it yet, but I think it might be a prototype of something promising for this kind of use case.

by Mikhail_Edoshin

0 subcomment

There was a discussion about a relevant paper here at HN:
"What if an SQL statement returned a database?"
https://news.ycombinator.com/item?id=38606146

by cryptonector

1 subcomments

^F case-insensitive search for:
```
  - cte
  - recursive
  - with (in SQL code, not prose)
  - connect (as in Oracle's CONNECT BY)
```
nothing. Nothing!
SQL deals with hierarchical data just fine. You just have to learn about how (recursive CTEs).

by tedggh

0 subcomment

In this particular case I much rather use an ORM with caching of view models. I understand the article and the problem, but it doesn’t seem to be a good case for the proposed solution here. This data doesn’t change often and it’s not critical for a user to get the latest data when immediately available. So why you want to write complex, hard to test and debug SQL code when you can have that logic neatly handled in the backend? It is far easier to just build the page structure once in your backend code, cache it, and serve it from memory. The row-explosion problem disappears because you’re not re-running the join logic constantly. The “too many queries” problem disappears because the expensive work was already done when you filled the cache. And the ORM inconsistency worry doesn’t apply because you build the page in a single pass and keep that snapshot until the cache expires.

by gethly

2 subcomments

I am using event sourcing with cqrs and that means i also utilise projections, which the db itself can replicate less efficiently as a "view" or virtual table.
So one can technically create a projection/view that is tailor-made for a query that needs to display some data. Of course it is no often possible to retrieve all the data with a single select command.
So joins and multiple queries are simply inherent to complexity of data we store nowadays.
Anyway, years ago, i have moved to a db model where every entity is stored as a blob in a dedicated column and every additional column, beside id, is indexed. So there is no wasted space and a ton of columns that only hold data but are not used for filtering. I can run data-efficient queries that yield a list of ids of blobs to load or the blobs themselves and then i extract any necessary data out of those blobs(entities) on the application level. So the database us purely a blob store + few fast indices.

by cousin_it

3 subcomments

Nobody mentioned nested data in BigQuery? It's a superset of SQL where fields can be arrays of structs (so basically nested tables) and there's some extra syntax to nest/unnest things, so you can often get everything you wanted in one result set.

by mg

2 subcomments

I am becoming more and more comfortable with storing everything in JSON.
With its JSON arrow operators and indexed expressions, SQLite makes a really nice JSON document store.
Assume you have a "cars" table with nothing but a "cars" column that is a JSON blob. Then selecting the models of all cars is just:
```
    SELECT cars->>'model' FROM cars
```
I wish MariaDB would follow suit and also support arrow operators. You can do it in MariaDB, but it becomes more cumbersome:
```
    SELECT JSON_UNQUOTE(JSON_EXTRACT(cars, '$.model')) FROM cars
```

by ivanb

1 subcomments

One limitation of JSON is its limited set of types. For example, for decimal numbers one has to resort to stringly typing representation because DB connection libraries assume that JSON numbers are floating point. Note that JSON numbers are just sequences of digits, nothing more. There is no attached precision semantic.
Another example is UUIDs. Instead of transferring 16 bytes, the libraries deal with wasteful string representation. I'm sure you can bring another examples.
Nonetheless, for majority of data JSON as DB output format is alright.

by deepsun

1 subcomments

> If there was no director in the database then this query would return 0 rows.
Well, LEFT JOIN is the basic SQL knowledge, surprising the author doesn't know that.
> because sql wasn't designed to produce hierarchical data
Well, SQL was designed _because_ of the shortcomings of hierarchical DBMSes.
Also, author uses PostgreSQL, so could easily use ARRAY_AGG(), and Composite Types to get the exact answer they need in one query, no need for hacks. They can even make their own custom type for type safety, if needed.

by anewhnaccount2

0 subcomment

I think EdgeDB/gel maybe addresses this problem. It has nice syntax for returning hierarchical resultsets from a relational database.

by procaryote

0 subcomment

Queries like these

    select primaryTitle, person.primaryName 
    from title, principal, person
    where title.tconst = 'tt3890160' 
    and title.tconst = principal.tconst 
    and principal.nconst = person.nconst 
    and principal.category = 'director';

are much nicer to write as

    select primaryTitle, person.primaryName 
    from title
      join principal on principal.tconst = title.tconst
      join person on person.nconst = principal.nconst
    where title.tconst = 'tt3890160' 
      and principal.category = 'director';

Because you separate the filter from the joining, and you're much less likely to forget a join condition

If ambiguity of column names are low enough you could also use `join person using(nconst)`

by phiresky

2 subcomments

I think relational databases are great, but this is my biggest problem with SQL, even before the ridiculous syntax.
A query like
```
    select users.*, orders.* from users left join orders on orders.user_id = users.id
```
Should always have returned a structure like:
```
    type SingleReturnRow = { users: {id: ..., }[], orders: {user_id: ..., id: ..., }[]}
    type Return = SingleReturnRow[]
```
Mangling the columns together and _removing_ groupings that naturally appear is just so unnecessary.
I don't think a larger change in the query language would even be needed.
Even better of course would be a return value like
```
    type SingleReturnRow = { user: User, orders: Order[] }
```
But I see how that would require a fundamental change in the language.
Of course in PG now you can use
```
    select users.*, json_agg(orders.*) as orders from users left join orders on orders.user_id = users.id group by users.id
```
but using JSON as intermediate steps just feels unnatural.

by procaryote

0 subcomment

Having built a few high load things with underlying SQL, I've found that in practice, big joins don't scale well, and caching becomes pretty important.
It often becomes more interesting to find what bits are reused in several places and can get broken out into a useful cache (like person id → person name, thumbnail url, person url, …) than making a custom query for each view
That completely changes how you do queries as you can cut a bunch of joins. Instead you get a partial data set from the query that you then throw on a function or service that populates all the referenced ids from a cache, and backfills (and caches) any missing ones (preferably with a single query for a list of people in the example).

by hbrundage

1 subcomments

I agree and would take it one step further — the structure of the joins is something that should most often come from the schema, not the query. In the same way that the attributes of an entity should be modelled out ahead of time, the relationships between the entities should be as well, and that yields more productive querying and better performance.
We built a GraphQL / SQL hybrid expression language that does just this here: https://docs.gadget.dev/guides/data-access/gelly#gelly , curious what yall think.

by mcdonje

2 subcomments

This was enlightening. I've often wondered why ORMs exist, and it turns out a bunch of devs don't understand tabular data. I was going to say they don't understand SQL, but apparently it goes deeper than that.
Like, left joins have been around since SQL-92, so the query not returning a result when there is no director is a skill issue.
Nowadays, you can run a JSON or XML query and have the DB return your results in a hierarchical format.

0 subcomment

by Spixel_

3 subcomments

I don't get the transaction bit. At least with postgres, a transaction doesn't guarantee that all statements in it see the data at the same point in time (actually, it's not even guaranteed for subqueries).
Also, often, the transactional database servers is more difficult to scale than application servers so from a technical standpoint, it makes sense to do this glue work in app code.

by ThinkBeat

0 subcomment

Would this not become a bit simpler if you populate the sql schema like this?
```
                            +-------------------+                                                
                            |                   |                                                
                            |                   |                                                
```
+------------------+ | Genere | | | | | | | +-------------------+ | Movies | | | | | +-------------------+ | | | | | | | Actors | +------------------+ | | | | +-------------------+
```
                                                                                                +
```

by redwood

1 subcomments

This is the core reason why MongoDB is popular. And while the HN crowd fell into a trough of dissolutionment with it 10 years ago since it wasn't ready then, a bunch of folks found success with it more recently. At my company we had legacy implementations of it and modern ones and the juxtaposition was significant.

by zigzag312

1 subcomments

I think data definition schema needs to have better type system which not only defines database schema, but contains all necessary data and allows annotations to be able to generate quality language specific code from it. Single source of truth for data structure, which is used both by database and code.

by sakesun

1 subcomments

When something in SQL becomes cumbersome or difficult to work with, people often tend to create entirely new solutions rather than improving the existing ones. Few developers seem interested in refining SQL itself.

by silon42

0 subcomment

I've seen 'decorellation' mentioned... 95% of the time I am fighting the query optimizer is to get the correlation and a loop over few items working instead of a table scan.

by sgarland

1 subcomments

The only reason this post has any truth is that the IMDB dataset is horrendously unnormalized. It also points out at the end that you can in fact use a combination of aggregations to return the desired JSON object.
Even then, if you really wanted to, you can absolutely make a tree-like structure in SQL in a variety of ways: adjacency lists, nested sets, closure tables, etc.

by mamcx

0 subcomment

The things `SQL` fail to do that enable this sort of things are:
* Store relation-on-cell: All the trick of so called `nosql` is that they can do nested data. THAT IS ALL
* Then, `join` is already inline
* Then, instead of the very poorly named `GROUP BY` it could has REAL `GROUP BY`!
That is all is need at the core

by ciconia

1 subcomments

This is a problem that ActiveRecord-style ORM's have failed to solve. One possible solution is writing queries like in the article that transform normalized data from the DB into a JSON structure. Another possibility is to design an ORM that is actually capable of taking a set of relations and dynamically building queries with joins, then transforming the returned rows into tree like structures.
This is a very interesting area for exploring a new kind of ORM. I find the whole "CRUD monkey" shtick preached by DHH et al quite lacking for expressing the kind of hierarchical data structures developers need.

by dcreater

0 subcomment

Terrible title. SQL is the very definition of structured data. I understand the author means SQL needed tiers/hierarchy/grouping but titling it as he has done just contributes to information pollution / future training data adulteration

by maweki

1 subcomments

Isn't the author just proposing GraphQL?

by h4ch1

2 subcomments

I mean for highly connected data shouldn't the solution be to ideally use a graph database from the get go instead of generating JSON aggregates in the database?
Don't get me wrong, JSONB is great but I see a lot of people simply slap on a JSON column to anything that requires more complexity without trying to first solve it through RDBMS relations.

by jbverschoor

3 subcomments

Time to start taking that those CS classes before blogging

by roenxi

0 subcomment

The real counterargument here is that the DB systems that stored nested data mostly lose in competition with relational systems over time as it turns out they don't satisfy the large number of people who use data for things other than UIs - they aren't suitable for long term storage and don't separate concerns in a way that allows generalists like database engineers to optimise how data gets too and from a storage location into memory.
And, in practice a lot of these frontenders end up storing a JSON blob with either no nesting or just one level of nesting that looks a lot like a database schema built by someone being stubborn about not normalising data. In some sense that is fine, databases seem to be learning to just treat simple JSON blobs as normalised data but it is hard to make that as efficient as a guaranteed schema. Plus larger companies often end up having to hire a dedicated team of engineers to properly normalise data so other people can use it.