Keeping Your Sanity
When it comes to data science projects, reproducibility is always nice to have, not only for the sake of science, but also to preserve one’s sanity – revisited experiments should make sense rather than leave one baffled. Moreover, if one intends to share their project, they should want to make sure their peers would get the same result using the same code and dataset. Here, we are not talking about trivial aspects of versioning and data management. Nor are we talking about the usage of random state control parameters, which we hope all users of data science libraries are familiar with. What we want to talk about is the grey area of overlooked implementation details, and highlight a case where the [gensim] library’s accidental exposure of a Python implementation detail almost made us lose our minds.
Gaslighting and [gensim]
One of our projects involved natural language processing (NLP). To perform the task we used the [gensim] library (v. 3.1.0), a Python library which provides implementations of various algorithms useful for NLP. We ran our experiments and collected results. Some time later we ran the experiments once again, using the same dataset and conditions. Surprisingly, the results of the two runs did not match, despite our efforts to control the experiment.
For unrelated reasons, the code was moved to another machine with a different environment. And, to some relief but even greater confusion, suddenly the results were reproducible! The only difference between the two environments was in their versions of Python: the first machine had Python 3.5.2 installed, the other had version 3.6.3. All libraries, including [gensim], were at the same version. A version jump from v. 3.5 to v. 3.6 doesn’t suggest substantial changes – certainly nothing on the scale of Python’s notorious v. 2 to v. 3 leap – but the devil is in the details.
Chasing at Shadows
The [gensim] library includes a class called [Dictionary] that maps words to integer identifiers, collects word frequencies, and converts textual documents to vectors. These kinds of operations are common in NLP tasks, where the frequency of specific words can indicate their importance and topics of discussion. For example, the Latent Dirichlet Allocation topic model works with textual documents represented as vectors in which each value is an integer indicating the frequency of a specific word (the ‘bag of words’ encoding). Mapping words to integer identifiers also serves a very practical purpose: it reduces the memory needed to store enormous text corpora. It’s a valuable feature, but the words, identifiers and frequencies [gensim] stores in a [Dictionary] are all linked together in a complicated way that means any change to one can affect the others.
We made some careful investigations and soon had isolated the problem to the [Dictionary] class. Then we examined the source code and found that the functionality is fairly straightforward, except for one curious use of the [dict] collection type, Python’s implementation of the hash table data structure that maps arbitrary keys to a corresponding value. Due to their design, hash tables don’t have a ‘natural’ order of elements as linear collections such as lists do, and various implementation choices can mean that the order of elements within a hash table may be unpredictable. Our word-integer mappings were unpredictable, so perhaps [dict] was to blame? But attempts to force the issue were still difficult; recreating [Dictionary] instances gave us the consistency we sought, but between executions the results were very different.
The ordering of [dict] entries should never be relied upon due to the nature of hash tables, but this does not explain the non-determinism of [dict]. Why was this arbitrary order not consistent between executions?
Descent into Madness
As mentioned above, the [dict] in Python is implemented using the hash table data structure – the keys of a [dict] are hashed and the hashes are mapped to buckets in a dynamic table. Displaying the contents of a [dict] loops over the buckets. When iterating over a [dict], the buckets are traversed in their hash order (skipping empty buckets).
Let’s look at this in closer detail. Since the hash table has less buckets than hashes, an operation is needed to map a number of hashes to a single bucket. Python uses the operation [hash & (number_buckets-1)] to ensure all hashes are mapped to an allocated bucket. For the small [dict]s we are going to construct in this example Python will allocate 8 buckets, hence the number of the bucket selected will be [hash & 7]. Using Python 2.7:
The key [“dog”] being in bucket 5, appears before the key [“cat”], which was placed in bucket 7. Hash randomisation is already introduced, but disabled by default in Python 2.7, so the hashes are consistent between executions:
And therefore so is the order of iteration over [dict].
Since Python 3.3, hash randomisation was turned on by default. So, in one run we may get something like this:
And in another we may get this:
Since v. 3.3, the only way to get consistency in Python’s hashing is to set the [PYTHONHASHSEED]environment variable. In v. 3.6, the iteration order of Python [dict] changed yet again as it was made consistent with the order that elements are inserted into the collection. Thus, while the hash randomisation is enabled by default in v. 3.6, its effect on the [dict] iteration is obscured by the way it maintains the order of insertion. This means that our keys can be hashed like so:
As we can see, [“cat”] is placed into the earlier bucket, but [“dog”] was inserted first so it is the first entry returned when iterating over the [dict]. We can swap the insertion order and we still get the order preserving behaviour regardless of the hash values:
Attitudes towards [dict]’s hashing behaviour have varied, and the result is that the behaviour of any software that is affected by the order of iteration of [dict]entries may change depending on the version of Python used to run the software. Of course, the order of entries in a [dict] should never be relied upon, but much Python code uses [dict] and you may be using it without even knowing. Which brings us back to [gensim].
Revealing the Killer
As previously mentioned, [gensim]’s [Dictionary] class uses [dict] to perform a mapping between words and integer identifiers. The order of iteration over a [dict] determines which words are assigned which integers, so as the order changes so do the identifiers each word is assigned. This is actually a case where the usage of [dict] is incidental – many approaches can be used to map words to integers, but the [gensim] implementers have chosen [dict] iteration to achieve this. As such, the [Dictionary] class has the essential property of being deterministic only when the iteration order of [dict] is consistent.
In one way, this is great! Everyone performing experiments using [gensim]’s [Dictionary] and the latest version of Python has reproducible experiments, if their data transformations are deterministic. But the corollary is that many experiments are needlessly non-deterministic, and therefore not repeatable. Worse, this issue can manifest in confusing ways; our v. 3.5 experiments, for example, were producing different topic models because the initial vectors were different in each execution, causing the algorithm to converge in different ways. Imagine our surprise at starting at this curious result of that complex algorithm only to eventually arrive at the humble [dict].
The End… Or Is It?
The principle of reproducibility is essential to science, and all algorithms are fundamentally deterministic – even the hash functions [dict] depends on. Given the subtlety of some implementation details, it’s understandable that developers may get caught out. But the issue in [gensim]’s case is more profound: data science libraries need to ensure determinism if there is to be any confidence in the results they produce. To this end, we’ve submitted a PR that adds this property to [gensim]’s [Dictionary] by removing [dict] from the process of generating word identifiers and replacing it with a sorting policy that leaves no room for unpredictability. We are happy to report that the [gensim] maintainers have been receptive to this change, and we are hopeful that we can soon start using [Dictionary] with confidence that our experiments will be consistent.
But it makes you think… how long before we encounter the next source of unintentional non-determinism in a data science library?
Read further: Data Management for Data Scientists
The foundation of daisee is a collaboration between Deakin and UNSW Universities. This is a featured blog article from the Deakin Software and Technology Innovation Lab DSTIL blog.
Header image courtesy of Dean Hochman.