Better Queries, Less Code - How to Use COALESE, IFNULL and NULLIF
So I wouldn’t advocate putting raw SQL in your code in general. But there are some functions there that can be worth it, especially compared to the hassles that translating things into Activerecord and Arel can get you into sometimes.
Especially when that translation effort leaves you with more code – and more tangled queries – than really seem worth it.
All of this and more was at play recently when I started using COALESCE
, NULLIF
and IFNULL
while querying around in my db browser.
What I wound up needing to do was going to be a giant pain to translate into ActiveRecord, and likely would have left me with something like this:
some_records.joins(other_records: :still_more).
where(boolean: true, StillMore.arel_table[:this_thing].gt(1.week.ago)).
select("some_records.name, other_records.description, still_more.preferred_attribute as preferred, still_more.default_attribute as default")
Kind of real ugly, and that’s about half of what I was doing. More to the point, each of the objects returned would need to be sorted through so I could display and/or use the correct attribute
. And it would have left each of the return objects in an unpredictable state – maybe both attributes would have a value, maybe one or the other would be null.
So I’d still have more coding (and likely iterating) to do to deal with all that.
And that’s just for a simple select. If I needed to join there, it woulda gotten messier.
If only I could let the database figure this out for me, so that whatever needed these attributes had a single, predictable, non-null value to read when the time came.
Short takeaway: These MySQL functions move some routine, low-level and potentially n+1 logic having to do with null-handling and preferred vs. default values down into the database level, leveraging the approximately 200 years of db development and production to make your queries more useful for whatever other classes, views, helpers, methods or what-have-you need that info.
Read on for a run-down of these functions and, if yer really digging this, some skeletal use-cases afterward.
ENTER COALESCE, IFNULL, NULLIF
The documentation on this stuff is actually pretty good, but there aren’t a lot of Ruby examples there. I come from a Ruby background, so this matters to me.
Anyway, here’s what’s going on:
COALESCE(this_column, that_column)
: This returns the value from the first argument, if there is one. If it’sNULL
, return the second argument (in this case, the value of a different column). Best used when you’re assured of at least one non-null value and extremely handy with anas attribute_name
to resolve the ambiguity around preferred/default attributes.
Also, this is by no means the sole, or even primary, use of this function. It’s the one I’m mostly looking at in this case, though. It’s often going to be useful to have your db compare some values and return the first non-null one.
For my purposes here, you can think of this as operating much like the pipes (||
), only on the db level. If you plan on using this value in another query right away, it also saves you the hassle (and time) of selecting your rows, interating over them, then running at least another query.
IFNULL(some_col, return_val)
: This returns the first argument if it’s non-NULL. If it is NULL, it returns the second argument. This has the benefit, at least in the cases when I’ve used it, of adding more predictability to the query and saving me a little bit of code.
Those are easy enough to code around, of course; you can add a try(:some_col)
or a some_col.present?
or bang in an if some_col
at the end of a particular line, and there you go.
Letting the db istelf sort this out for you, though, lets your other methods and functions can do what they’re intended to do without worrying about what your db’s up to or what shape your data’s in.
As for the ugliness of tossing it in there, it’s easy (and useful!) enough to stash away like this:
class PaymentProfile < ActiveRecord::Base
scope :account, -> { select("IFNULL(preferred_account, default_account)") }
end
If this looks, to you, a lot like COALESCE
, you’re right. I’ll break out some of the differences below.
NULLIF(that_col, '')
: This function compares the two arguments. If they match, then it returns NULL. If not, it returns the first argument.
This one’s great for dealing with columns that can be populated both with empty strings and NULL
, despite, from your system’s point of view, being the same thing.
So in our brief little use case, if a bunch of empty strings (or even invalid values) snuck into preferred_account
, we can feed that in here to make sure we’re getting the default one.
This is how NULLIF
makes COALESECE
a lot more useful, in my book: you can use NULLIF
to “sanitize” one column’s value, so to speak, and make sure to select the one with the valid value.
And things of this sort come up from time to time. Developers from the past have this habit of not having predicted what we in the present will be wanting from the database. Or not seeing all the ways that other developers in the future might need protection from themselves.
So if a big backfill to clean up the db is out of the question – or really not worth the time in cases when its only effect is to require slightly trickier querying – this function comes in handy.
COALESCE AND IFNULL LOOK AWFULLY SIMILAR
That’s mainly down to the kinds of use cases I’ve mentioned here, which is what I’ve used them for almost exclusively.
In reality, they’re up to different things.
- IFNULL is control-flow, while COALESCE deals with comparison
- COALESCE can take several arguments and return the first non-NULL. Rather than just the control-flow-y uses I focus on here. IFNULL is stricly about comparison, with some instructions on what to do if the first one (“arg1 == NULL”) returns false. You can think of it sort of like the
.fetch(:key, fallback_value)
Ruby hash method. - when given all non-null values as arguments, COALESCE returns NULL. IFNULL just returns the first argument.
COALESCE(NULLIF(
is a lot easier on the eyes – or at least a lot less painful to look at – thanIFNULL(NULLIF(
, but that’s just me.
(Very) Simple Use Case
So in the case above, I was dealing with a situation like the following:
- A table has two columns for similar values. One is preferred but not always present, and there’s always at least one non-NULL value between them.
- Most of what needs to read this attribute doesn’t need to know which one it’s getting. They’ll just need a valid, non-NULL value.
- Ensuring that the preferred attribute is selected when present matters enough that the logic around that should be completely decoupled from anything that might read from it.
- We aren’t dealing with setting either of these attributes.
- Have one predictable name for whichever attribute gets returned is going to matter more, for most of what’s doing this, than which underlying db column it comes from.
- You need to minimize the number of round-trips to the db as much as humanly possible.
- You also would really benefit from minimizing the amount of post-query processing that needs to be done.
Consider an app that stores users’ payment information, (PaymentProfile
), including bank account numbers. You let your users add multiple accounts and specify which one they prefer. But they have to have at least one. Call ‘em preferred_account
and default_account
.
What you need, much of the time, is just a valid account number. Yeah, you want to choose preferred_account
if it’s there, but you don’t need to know whether that’s NULL
. You just need to be able to call something like user.account
and get a non-null value.
For one record at a time, this is no sweat. Just do something like
def account
preferred_account || default_accountpreferred_account
end
Even truthy but useless junk is easy enough to work around:
def accounts
preferred_account.present? ? preferred_account : default_account
end
And that’s fine, but you’re still reading that row in the db and introspecting it, just to get a non-null value. If you need to get these numbers from, say, a couple hundred thousand user records, then … well, meet tmux.
A Slightly Different Use Case
So let’s say the values stored at preferred_account
and default_account
aren’t actual account numbers because you aren’t crazy people.
Instead, they’re a kind of industry-standard identifier for records, people, accounts, whatever. And you’re building out a feature to integrate some info from some of those places onto your users’ accounts.
In this case, it’s publicly-available profile information, let’s say. You don’t need any OAuth-level security, and the main thing your users want, aside from incorporating some of this info, is not to be bothered with setting it up.
So you’re going to need a way to identify users on that other service and query their API for that info, hence this account field. And you’re probably going to be running jobs regularly to check for/update whole batches of users, maybe in cron or something.
So far, so easy:
users_for_query = User.joins(:this_thing).where(boolean: true, this_thing: {some_other_attr: "yup"}).pluck(:account)
some_query_logic(user_for_query)
do_the_thing_with_that_other_stuff
But for whatever reason, there’s two types of identifiers. Say a new industry standard came out that’s more secure and/or efficient, but everyone’s agreed to be backward-compatible with the old one. So now you have to deal with that.
This works:
users_for_query = User.(:this_thing).where(boolean: true, this_thing: {some_other_attr: "yup"}).map(&:find_account)
def find_account
preferred_account || default_account
end
But seriously, meh.
Also, you find out that there’s no callbacks/constraints/filters anything controlling how a preferred_account
with a value, to Ruby, of nil gets recorded. Or maybe it’s some other part of the code that’s not deleting these properly.
You don’t know – all you know is this table is used by things all over the codebase and, for one reason or another, a backfill isn’t an option here.
So now find_account
has to check for a null value, and whatever uses users_for_query
or similar things will have to account for this. If you need something from all of these tables, but the one with the preferred/optional attribute isn’t one of the attributes for the present class, then it gets a bit more complicated. Plus, you’re first querying the db and then iterating over the response.
And we haven’t even looked at the other end of this: matching the response objects from your API calls to users in your db so you can update the right fields.
As it happens, you’ve got a service that isn’t really specifying whether this is one of the old IDs or the new ones. And there’s no particularly easy way to identify these in code.
Let’s say the response object is simple enough for you to extract something like this:
{
info_we_want: 19893875,
identifier: 90980asdga098
}
And since you’re searching the API for a lot of IDs all at once, you really aren’t going to want to run a separate INSERT
for every one.
Here’s how COALESCE
can help:
First, make sure that users_for_query
thing has a account
field from the word go:
users_for_query = User.(:this_thing).where(boolean: true, this_thing: {some_other_attr: "yup"}).select("users.id, COALESCE(whatever.preferred_attribute, whatever.default_attribute")
This makes it relatively easy to set up your inserts by mapping users_for_query to response_objects by taking the identifier
attribute from the response objects, looking into the users_for_query
for that number and grabbing the user id (your internal identifier), then updating whatever table you’re working on accordingly.
If you’re simply adding new lines to a table (again, this is a new feature, so you’re pulling in data more than you are modifying and updating existing records), then you can hold your nose, put in some more raw SQL, and make it even easier.
Here’s how the process could work, more or less:
inserts = response_objects.map{ |obj| {user_id: users_for_query[obj["identifier"]], info_we_want: obj["info_we_want"] }
bulk_insert_statement = "INSERT INTO new_table (user_id, info_we_want) VALUES ("
inserts.each{ |k,v| bulk_insert_statment << "#{k}, #{v}), ("}
Not super-pretty, especially all that string manipulation (and the chomping you’ll have to do to make sure it closes correctly).
But you’ll make exactly 2 db calls: one to grab all the user info you need for the API call and one – regardless of how many records you get back – to insert the new rows.
So that’s a lot of stuff about a couple relatively simple functions. But spending a little more time thinking about how I could leverage things the db’s optimized to do to simplify my code and speed up my queries has kept on paying off even beyond the benefits I’ve so far wrung from these functions.