As you might know, Python’s slogan is batteries included
, which means that many of the common tasks that
one would want to do are possible with the standard library. Since Python is open source, I thought
that it might be fun to take a look at these batteries to see how Python implements these convenient
functions and classes.
One of the things that I find quite useful in the standard library is the Counter
class from the collections
module. As the name might have given away already, it allows you to easily count things.
Say you want to count count how often a specific letter appears in a word. You could write some code like this:
word = "montypython"
counter = {}
for letter in word:
if letter not in counter:
counter[letter] = 0
counter[letter] += 1
print(counter)
Output:
{'m': 1, 'o': 2, 'n': 2, 't': 2, 'y': 2, 'p': 1, 'h': 1}
Or - following the Easier to ask forgiveness than permission
motto - one could write the code like this:
word = "montypython"
counter = {}
for letter in word:
try:
counter[letter] += 1
except KeyError:
counter[letter] = 1
print(counter)
But since this is quite a common task, Python provides a class for this that you can use. When using the Counter
class,
your code could simply look like this:
from collections import Counter
word = "montypython"
counter = Counter()
for letter in word:
counter[letter] += 1
print(counter)
Strictly speaking, you could also just write Counter(word)
to get to the same result, but I want to look at the manual way
today. The question is: Why does this work? Shouldn’t it also raise a KeyError
since the keys that we are trying to increment
are not in counter
yet?
To understand what is happening here, we should take a look at the source code of Counter. 1
All we need is the following lines:
class Counter(dict):
def __missing__(self, key):
'The count of elements not in the Counter is zero.'
# Needed so that self[missing_item] does not raise KeyError
return 0
In the first line, we can see that Counter
is actually a subclass of dict
. dict
is Python’s dictionary type which
is implemented in C.
The actual implementation of Counter
has many more functions than just __missing__
, but it is this one that is interesting
to us. When we try to access the item of a dict (e.g. by saying give me the value for a
in the dict counter
: counter['a']
),
the C code makes a call to the dict_subscript
function. It looks (in part) like this:
ix = (mp->ma_keys->dk_lookup)(mp, key, hash, &value);
if (ix == DKIX_ERROR)
return NULL;
if (ix == DKIX_EMPTY || value == NULL) {
if (!PyDict_CheckExact(mp)) {
/* Look up __missing__ method if we're a subclass. */
PyObject *missing, *res;
_Py_IDENTIFIER(__missing__);
missing = _PyObject_LookupSpecial((PyObject *)mp, &PyId___missing__);
if (missing != NULL) {
res = PyObject_CallFunctionObjArgs(missing,
key, NULL);
Py_DECREF(missing);
return res;
}
else if (PyErr_Occurred())
return NULL;
}
The function tries to get the value for the given key. If it can find a value, it returns
it (further down) but if it cannot find a value (ix == DKIX_EMPTY || value == NULL)
, it makes a check to see if the current object is a subclass of dict
(!PyDict_CheckExact(mp)
). If that is the case the function looks up the special __missing__
function in this object and calls it to get
the value to be returned instead.
As we can see above the Counter
class actually implements the __missing__
function. It always returns 0. So in our
above example what happens is this:
In our example above, we iterate through "montypython"
. In the first iteration letter
is m
. Using +=
is syntactic sugar around __getitem__
and __setitem__
. Meaning that the call counter['m'] += 1
is (almost) shorthand for:
_count = counter.__getitem__('m')
_count += 1
counter.__setitem__('m', _count)
When this code is executed, __getitem__
will make a call internally to dict_subscript
, which will return DKIX_EMPTY
because
no value for m
is stored yet. Because it returns DKIX_EMPTY
, it will check if a __missing__
functions is defined instead
and if so it will use it’s return value. It will find that we have a __missing__
function defined on the Counter
object
and use the return value of 0
. This means that the second line will look like _count = 0 + 1
and the third line will now actually
add the value to the dictionary by calling counter.__setitem('m', 1)
. That is all the magic needed.
So all in all, the convenient Counter
module (which also offers a lot of other features like creating a counter directly from an iterable,
adding and substracting counters, getting most common values etc.) is not magic at all. Most of the code is written in Python and
the parts that interface with the C parts are not that difficult to understand 🚀.
If you found any mistakes or inconsistencies in this or have an idea for the next module from the python standard library that we should take a look at feel free to reach out to me via email or twitter
PS: Yes, for what I was showing here I could have just used a defaultdict(int)
but as far as I can tell that is completely
implemented in C so I would not have been able to show this C/Python intersection so nicely :)
PPS: Thanks a lot to Lisa for proofreading this!
1. Note that I'm not a python core developer though so the following should probably be taken with a grain of salt.↩