Skip Navigation

Is there something better than SQL?

If you're modeling relational data, it doesn't seem like you can get around using a DB that uses SQL, which to me is the worst: most programmers aren't DB experts and the SQL they output is quite often terrible.

Not to dunk on the lemmy devs, they do a good job, but they themselves know that their SQL is bad. Luckily there are community members who stepped up and are doing a great job at fixing the numerous performance issues and tuning the DB settings, but not everybody has that kind of support, nor time.

Also, the translation step from binary (program) -> text (SQL) -> binary (server), just feels quite wrong. For HTML and CSS, it's fine, but for SQL, where injection is still in the top 10 security risks, is there something better?

Yes, there are ORMs, but some languages don't have them (rust has diesel for example, which still requires you to write SQL) and it would be great to "just" have a DB with a binary protocol that makes it unnecessary to write an ORM.

Does such a thing exist? Is there something better than SQL out there?

71 comments
  • SQL, where injection is still in the top 10 security risks

    This is absolutely true, but it's not what it looks like on the surface, and if you dig into the OWASP entry for this, you'll see they talk about mitigation.

    You can completely eliminate the possibility of injection attacks using well-understood technologies such as bind variables, which an ORM will usually use under the covers but which you can also use with your own queries. There are many, many database applications that have never once had a SQL injection vulnerability and never will.

    The reason SQL injection is a widespread security risk, to be blunt, is that there are astonishingly large numbers of inexperienced and/or low-skill developers out there who haven't learned how to use the tools at their disposal. The techniques for avoiding injection vulnerability are simple and have been well-documented for literally decades but they can't help if a lousy dev decides to ignore them.

    Now, a case could be made that it'd be better if instead, we were using a query language (maybe even a variant of SQL) that made injection attacks impossible. I agree in principle, but (a) I think this ends up being a lot harder than it looks if you want to maintain the same expressive power and flexibility SQL has, (b) given that SQL exists, "get bad devs to stop using SQL" doesn't seem any more likely to succeed than "get bad devs to use bind variables," and (c) I have too much faith in the ability of devs to introduce security vulnerabilities against all odds.

  • I have found that usually the problem isn’t SQL itself, it’s the design of the database or the application or both that makes it slow or hard to use because databases are tricky and it’s very easy to make them slow and difficult to optimize. This goes for any database, SQL or otherwise.

    I think the reason that you see so many people use relational databases for relational data is that they were designed for it and have been optimized to do that job for almost 50 years at this point, the right tool for the job if you will. That’s also why I don’t think any language builds it in, there isn’t one type of database that is good at every scenario so it would be difficult to pick a default as a language. Also keep in mind, most large applications are deployed across many servers so anything that uses a local file is out of the question.

    As far as performance directly though, I’m definitely not a DBA but I have spent a lot of time helping people troubleshoot slow databases and it seems many people write apps and design databases based only on how they want to store the data and not how the actual data is consumed. The other thing I think most folks don’t seem to realize (including myself for a long time) know the need for creating good indexes based on how you are accessing the data. It’s not an easy problem to solve for sure though.

    • Database optimization goes from the design of the schemas to the queries used to the configuration to the actual hardware it’s running on. It’s one of the hardest things to optimize in all of tech.

  • most programmers aren’t DB experts and the SQL they output is quite often terrible.

    Isn't that looking at it the wrong way / missing the point?

    If you're fine with simple queries, use an ORM of your tech stack. Once you have to understand querying performance and weigh usage patterns, constraints, and cache performance and size, then it's not about SQL anymore. It's about understanding the DBMS you use.

    You may ask "why can't I use a different language for the querying part of it". But I haven't seen anything better yet.

    Having a common, well-understood, established, documented, inter-product compatible language has a lot of value. Using a different language to the well-established industry standard is in itself an increased hurdle to adoption.

    Getting back to the original quote: I don't think anything else would serve bad developers / non-experts any better.

    • If you’re fine with simple queries, use an ORM of your tech stack. Once you have to understand querying performance and weigh usage patterns, constraints, and cache performance and size, then it’s not about SQL anymore. It’s about understanding the DBMS you use.

      Simple queries don't result in simple SQL. How many joins and subqueries do you think an SQL query would require in order fulfill "Give me the top 10 artists of the 90s whose albums were nominated for the MTV awards but didn't win"?

      In Django looks something like

       undefined
          
      nineties = (date(1,1,1990), date(31, 12, 1999)
      album_range=Q(albums__release_date__range=nineties)
      artists = Artists.objects.annotate(
        albums_sold=Sum("albums__sales", filter=album_range)),
      ).filter(
        album_range,
        nominations__date__range=nineties,
        nominations__won=False
      ).order_by("-albums_sold")
      top_artists = artists[:10]
      
        

      What if one method wants the result of that but only wants the artists' names, but another one wanted additional or other fields? In django you could simply use artists.only(*field_names) and each method would provide a different set of field names. What would that look like without a capable ORM? Do you think somebody would refactor the method to add a field_names argument? In my experience the result is a bunch of copy pasted queries that modify the query itself to add the fieldnames.

      Another common thing is querying related objects. Say you simply wanted to have information about the record label of the aforementioned artists while handling the artists. A many-to-one relationship (artist has one record label, record label has many artists). You could either artist.record_label while in your for-loop, but that would trigger an query for every artist (1+n problem). Or in django that's artists.select_related("record_label") and it will get all the record_labels in the same query.
      If it's a many-to-many relationship for example "festivals", then .prefetch_related() will first select the artists, then make a second query of festivals of those artists, and artist.festivals would be available.

      An ORM like django makes that simple. SQL, does not.

      So, before we even get to the DB optimisation part (which indices to create, whether a view is better or now, which storage engine to use, WAL size, yadayadayada), there's an entire interface / language that makes writing bad code very easy.

      • I'm too lazy to convert that by hand, but here's what chatgpt converted that to for SQL, for the sake of discussion:

         undefined
            
        SELECT 
            a.id,
            a.artist_name -- or whatever the name column is in the 'artists' table
        FROM artists a
        JOIN albums al ON a.id = al.artist_id
        JOIN nominations n ON al.id = n.album_id -- assuming nominations are for albums
        WHERE al.release_date BETWEEN '1990-01-01' AND '1999-12-31'
        AND n.award = 'MTV' -- assuming there's a column that specifies the award name
        AND n.won = FALSE
        GROUP BY a.id, a.artist_name -- or whatever the name column is in the 'artists' table
        ORDER BY COUNT(DISTINCT n.id) DESC, a.artist_name -- ordering by the number of nominations, then by artist name
        LIMIT 10;
        
          

        I like Django's ORM just fine, but that SQL isn't too bad (it's also slightly different than your version though, but works fine as an example). I also like PyPika sometimes for building queries when I'm not using Django or SQLAlchemy, and here's that version:

         undefined
            
        q = (
            Query
            .from_(artists)
            .join(albums).on(artists.id == albums.artist_id)
            .join(nominations).on(albums.id == nominations.album_id)
            .select(artists.id, artists.artist_name)  # assuming the column is named artist_name
            .where(albums.release_date.between('1990-01-01', '1999-12-31'))
            .where(nominations.award == 'MTV')
            .where(nominations.won == False)
            .groupby(artists.id, artists.artist_name)
            .orderby(fn.Count(nominations.id).desc(), artists.artist_name)
            .limit(10)
        )
        
          

        I think PyPika answers your concerns about

        What if one method wants the result of that but only wants the artists’ names, but another one wanted additional or other fields?

        It's just regular Python code, same as the Django ORM.

  • I am both a (T-)SQL expert and a language design enthusiast. IMO, SQL the language is mediocre in its grammar and extremely resistant to cleanliness. Once you get past that, the things you can actually do with it are extremely useful.

    I'd love for a better syntax to exist, but it's a Herculean task to make one. Modern SQL dialects have gargantuan, labyrinthine grammars, and they grow with each new product version. It's a lot easier to keep adding to that than to build a feature-complete replacement. This is also the reason why most ORMs are so frustratingly limiting: it's too much work to support the advanced features of one SQL dialect, let alone multiple.

  • it would be great to “just” have a DB with a binary protocol that makes it unnecessary to write an ORM.

    Other people have talked about other parts of the post so I want to focus on this one.

    The problem an ORM solves is not a problem of SQL being textual. Just switching to a binary representation will have little or no impact on the need for an ORM. The ORM is solving the problem that's in its name: bridging the conceptual gap between an object-oriented data model and a relational data model. "A relational data model" isn't about how queries are represented in a wire protocol; instead, it is about how data, and relationships between pieces of data, are organized.

    So, okay, what if you get rid of the relational data model and make your database store objects directly? You can! NoSQL databases had a surge in popularity not too long ago, and before that, there have been lots of object databases.

    What you're likely to discover in an application of any real complexity, though, and the reason the industry has cooled somewhat on NoSQL databases after the initial hype cycle, is that the relational model turns out to be popular for a reason: it is extremely useful, and some of its useful properties are awkward to express in terms of operations on objects. True, you can ditch the ORM, but often you end up introducing complex queries to do things that are simple in SQL and the net result is more complex and harder to maintain than when you started. (Note "often" here; sometimes non-relational databases are the best tool for the job.)

    And even in an object database, you still have to know what you're doing! Storing objects instead of relational tuples won't magically cause all your previously-slow queries to become lightning-fast. You will still need to think about data access patterns and indexes and caching and the rest. If the problem you're trying to solve is "my queries are inefficient," fixing the queries is a much better first step than ditching the entire database and starting over.

  • SQL is the industry standard for a reason, it's well known and it does the job quite well. The important part of any technology is to use it when it's advantageous, not to use it for everything. SQL works great for looking up relational data, but isn't a replacement for a filesystem. I'll try to address each concern separately, and this is only my opinion and not some consensus:

    Most programmers aren't DB experts: Most programmers aren't "experts", period, so we need to work with this. IT is a wide and varied field that requires a vast depth of knowledge in specific domains to be an "expert" in just that domain. This is why teams break up responsibilities, the fact the community came in and fixed the issues doesn't change the fact the program did work before. This is all normal in development, you get things working in an acceptable manner and when the requirements change (in the lemmy example, this would be scaling requirements) you fix those problems.

    translation step from binary (program): If you are using SQL to store binary data, this might cause performance issues. SQL isn't an all in one data store, it's a database for running queries against relational data. I would say this is an architecture problem, as there are better methods for storing and distributing binary blobs of data. If you are talking about parsing strings, string parsing is probably one of the least demanding parts of a SQL query. Prepared statements can also be used to separate the query logic from the data and alleviate the SQL injection attack vector.

    Yes, there are ORMs: And you'll see a ton of developers despise ORMs. They is an additional layer of abstraction that can either help or hinder depending on the application. Sure, they make things real easy but they can also cause many of the problems you are mentioning, like performance bottlenecks. Query builders can also be used to create SQL queries in a manner similar to an ORM if writing plain string-based queries isn't ideal.

  • The point about a binary protocol is interesting, because it would inherently solve the injection issue.

    However, constructing an ad-hoc query becomes tedious, as you're now dealing with bytes and text together. Doing so in a terminal can be pretty tedious, and most people would require a tool to do so. Compare this against SQL, where you can easily build a query in your terminal. I think the tradeoff is similar to protobuf vs json.

    You could do a text representation (like textproto), but guess what? Now injection is an issue again.

    Another thing would be the complexity of client libraries. With SQL client libraries, the library doesn't need to parse or know SQL - it can send off the prepared statement as-is. With a binary protocol, the client libraries will likely need to include a query builder that builds the byte representation since no developers are going to be concatenating bytes by hand, which makes the bar higher for open-source libraries. This also means that if you add a new query feature to your DB, all client libraries will likely need to be updated to use the feature.

    And you're still going to need to tune and optimize queries for this new DB. That's just the nature of the beast: scaling is hard especially when you can't throw money at the problem.

    Quite frankly, it's a lot of hard tradeoffs to not need to use prepared statements or query builders. Injection is still is an issue for SQL today, but it's been "solved" as much as it possibly can.

71 comments