ORMs are Leaky Abstractions

Leaky🚰? How So?

Never heard of the term leaky abstraction? Here's quick def from wikipedia.

"A leaky abstraction in software development refers to a design flaw where an abstraction, intended to simplify and hide the underlying complexity of a system, fails to completely do so. This results in some of the implementation details becoming exposed or 'leaking' through the abstraction, forcing users to have knowledge of these underlying complexities to effectively use or troubleshoot the system."

IMO calling all ORMs leaky abstractions is not really a hot take or a slight at all. With a little experience in that domain, it becomes overwhelmingly obvious. Sure, you can accomplish lots of stuff by blindly treating ORM objects like plain code objects, but if you are not aware of how traversing those objects produces SQL statements, you are going to have a really, really poor performing application.

What do I mean? Well let's take a look at some code. I could totally write paragraph after paragraph smearing ORMs for there potential for misuse and how they're awful for SQL beginners, but without any code, we'll have nothing to center this post on. Here's a little base model to go by here. The gist of the data models is that we have users, teams, and roles. This data model supports users having their own user-specific roles and also teams having roles as well that the user inherits down from the team - if they are apart of that team.

I also included a bit of silly seed data at the bottom that should hopefully paint the full picture of how things are connected here.


import logging
from sqlalchemy import (
    ForeignKey,
    create_engine,
    event,
)
from sqlalchemy.ext.hybrid import hybrid_property
from sqlalchemy.orm import (
    DeclarativeBase,
    Mapped,
    Session,
    foreign,
    join,
    object_session,
    mapped_column,
    relationship,
    remote,
)
logging.basicConfig(level=logging.INFO)

class Base(DeclarativeBase): ...

class User(Base):
    __tablename__ = "users"
    id: Mapped[int] = mapped_column(primary_key=True)
    email: Mapped[str] = mapped_column()
    teams: Mapped[list["Team"]] = relationship(secondary=lambda: TeamUser.__table__, back_populates="users")
    
    @hybrid_property
    def roles(self):
        session: Session = object_session(self)
        direct_role_ids = [
            row.role_id for row in
            session.query(UserRole.role_id)
            .filter(UserRole.user_id==self.id)
            .all()
        ]
        team_given_role_ids = [
            row.role_id for row in
            session.query(TeamRole.role_id)
            .join(TeamUser, TeamUser.team_id==TeamRole.team_id)
            .filter(TeamUser.user_id==user.id)
            .all()
        ]
        return (
            session.query(Role)
            .filter(Role.id.in_(direct_role_ids + team_given_role_ids))
        )

class Team(Base):
    __tablename__ = "teams"
    id: Mapped[int] = mapped_column(primary_key=True)
    name: Mapped[str] = mapped_column()
    users: Mapped[list["User"]] = relationship(
        secondary=lambda: TeamUser.__table__,
        primaryjoin=lambda: Team.id == TeamUser.team_id,
        secondaryjoin=lambda: User.id == TeamUser.user_id,
        back_populates="teams",
        viewonly=True,
    )
    roles: Mapped["Role"] = relationship(secondary=lambda: TeamRole.__table__)

class Role(Base):
    __tablename__ = "roles"
    id: Mapped[int] = mapped_column(primary_key=True)
    name: Mapped[str] = mapped_column()

class TeamUser(Base):
    __tablename__ = "team_users"
    user_id: Mapped[int] = mapped_column(ForeignKey("users.id"), primary_key=True)
    user: Mapped["User"] = relationship(overlaps="teams")  # make insert easier
    team_id: Mapped[int] = mapped_column(ForeignKey("teams.id"), primary_key=True)

class UserRole(Base):
    __tablename__ = "user_roles"
    user_id: Mapped[int] = mapped_column(ForeignKey("users.id"), primary_key=True)
    role_id: Mapped[int] = mapped_column(ForeignKey("roles.id"), primary_key=True)

class TeamRole(Base):
    __tablename__ = "team_roles"
    team_id: Mapped[int] = mapped_column(ForeignKey("teams.id"), primary_key=True)
    role_id: Mapped[int] = mapped_column(ForeignKey("roles.id"), primary_key=True)

def _get_seed_objs() -> list[Base]:
    founder, friend, foe, bball, skater = range(1, 6)  # ids for roles
    nobara, gojo, lakers = range(1, 4)  # ids for teams
    return [
        Role(id=founder, name="founder"),
        Role(id=friend, name="friend"),
        Role(id=foe, name="foe"),
        Role(id=bball, name="hooper"),
        Role(id=skater, name="skater"),

        Team(id=nobara, name="Nobara Labs"),
            TeamRole(team_id=nobara, role_id=friend),
            TeamUser(team_id=nobara, user=User(id=1, email="orion@nobaralabs.com")),
                UserRole(user_id=1, role_id=founder),
                UserRole(user_id=1, role_id=skater),
            TeamUser(team_id=nobara, user=User(id=2, email="andrew@nobaralabs.com")),
                UserRole(user_id=2, role_id=founder),
                UserRole(user_id=2, role_id=bball),
        User(id=3, email="sukuna@world.slash"),
            UserRole(user_id=3, role_id=foe),
        Team(id=gojo, name="Team Gojo"),
            TeamRole(team_id=gojo, role_id=friend),
            TeamUser(team_id=gojo, user=User(id=4, email="satoru@gojo.com")),
            TeamUser(team_id=gojo, user=User(id=5, email="yuji@itadori.com")),
                UserRole(user_id=1, role_id=friend),
            TeamUser(team_id=gojo, user=User(id=6, email="nobara@hammer.nails")),
            TeamUser(team_id=gojo, user=User(id=7, email="megumi@fushi.guro")),
        Team(id=lakers, name="LA Lakers"),
            TeamRole(team_id=lakers, role_id=bball),
            TeamUser(team_id=lakers, user=User(id=8, email="lebron@king.com")),
            TeamUser(team_id=lakers, user=User(id=9, email="luka@doncic.com")),
            TeamUser(team_id=lakers, user=User(id=10, email="austin@lemon.daddy")),
            TeamUser(team_id=lakers, user_id=2),  # in my dreams ;)
    ]

Ok now we're going to run this little script against this sqlite engine (FYI because the engine just points to sqlite with no database file specification, everything just happens in memory and is not persisted). Sqlite is so great for so many things—especially in this use case!


import logging

if __name__ == "__main__":
    roundtrip_count = 0
    engine = create_engine("sqlite://", echo=False)
    event.listen(engine, "before_cursor_execute", before_cursor_execute)
    Base.metadata.create_all(engine) # create schema
    with Session(engine) as session:
        session.add_all(_get_seed_objs())
        session.flush()
        logging.info(f"For those curious, we have done: {roundtrip_count} db trips so far, but we're about to reset the counter.")
        roundtrip_count = 0 # reset counter for fresh count
        map_of_user_roles = {}
        for team in session.query(Team):
            for user in team.users:
                map_of_user_roles[user.id] = [role.id for role in user.roles]
        session.commit()
        print(map_of_user_roles)
        logging.info(f"Just for that little map. We somehow managed to generate {roundtrip_count} db trips.")

# The logs for this script output:
-------------------------
"""
INFO:root:For those curious, we have done: 12 db trips so far, but we're about to reset the counter.
{1: [1, 2, 5], 2: [1, 2, 4], 4: [2], 5: [2], 6: [2], 7: [2], 8: [4], 9: [4], 10: [4]}
INFO:root:Just for that little map. We somehow managed to generate 34 db trips.
"""

The naive programmer who just wants to get the feature up and wants to crush the problem creates this problem. The dev just needed a list of these specific users roles, with team roles included, and they unfortunately unknowingly created a travesty that just won't scale. All for this simple dictionary object: {1: [1, 2, 5], 2: [1, 2, 4], 4: [2], 5: [2], 6: [2], 7: [2], 8: [4], 9: [4], 10: [4]}. That was hardly any data at all. Techniques like this in production at sufficient scale can start producing thousands of db trips per request if you're not careful.

A) So what happened? B) And is this really realistic?

A) So first, we have query for session.query(Team). That's 1 query for 3 teams. Then, we loop through each team to gets its users. That's 1 query for each team. 3 more = 4 in total. Then we have the real killer, which is making that unfortunate user.roles hybrid_property call. Now for each TeamUser entry, we are then making 3 more DB calls per user. That's how we get to 34.

B) Yes, absolutely it is. At work, I have seen requests doing over 2,500 roundtrips in a single request. It's absolutely possible, and it's so easy to do in a large codebase with lots of properties and methods treating every ORM object like a regular python without any awareness of the actual SQL that your code is actually emitting.

DB trips are so fast nowadays (usually). So does this really even matter? Let's say we have 1,200 trips made for a given requests. Let's say our DB and the server running our code are sitting right next to each other, so close that a roundtrip is around a millisecond (which isn't always the case). Now you're spending 1200ms just on network time. The queries themselves take time too - let's say 4ms. That's 6 seconds in total with network time included, which for a high-performing api is an eternity. You probably have to validate and process that data too, maybe loop around and make an INSERT/UPDATE/DELETE. If more than a couple seconds is just your journey to and fro your data store, frankly you're cooked.

Conclusion

You might be tempted to see this kind of thing and be like, no shit, this is just a Skill Issue, and sure, to the person that knows SQL well (and tbh I haven't found that many so far in my career), you're of course correct. You're correct, but you're also just proving my point. The ORM God can do a ton with an ORM: dynamic filters, crazy eagerloading, gather new instances into single inserts, etc., but that's because you understand what is going on under the hood. And unfortunately all these different ORMs are all different! Relatively recent versions of sqlalchemy, probably the most popular python ORM, differ amongst each other. It has this conflict of session flushing where you can do a bunch of creating and mutating to your ORM objects and then bam you session.flush() and everything goes to the DB "for efficiency", but the funny thing is that sqla v1.3 had pretty much no performance advantages, sqla v1.4 at least collected new objects for bulk INSERTs and DELETEs, and meanwhile the newest edition v2+ still does not have bulk UPDATEs (admittedly this is not an easy challenge - especially if you're trying to make the solution agnostic to any database you use).

Anyways, all I mean to show here is that using what the industry dictates you use, doesn't mean you don't have to learn what's going on the under the hood, and frankly, doesn't meant that your application will be any better. With leaky abstractions like this, learn your tool really well. Setup utilities to watch db roundtrips during tests and in production. Understand what your ORM is doing. Don't trust that the 3rd party library - just because it has a ton of medium articles and stars on GitHub about it is going to solve your woes without understanding and learning.