18 Jun 2018
The scientific computing community has a problem. A fairly large one. Come to think of it, even
good old experimental science has this problem. I’m talking about the
reproducibility crisis.
Essentially we cannot repeat the results seen in a significant portion, somewhere between 65%
and 95%, of experiments performed in biomedical science. And even worse, scientists aren’t
incentivized to make their research reproducible or even return a non-significant result.
Truth be told, science is hard.
It’s hard working with real data to find causes and effects. Being honest in science about the
fact that your model might not say anything new is not celebrated. Which leads me to my point.
If you work at a data science type tech company, you regularly have data scientists developing
models that try to make predictions, classify things, or make recommendations to people much like
science. These models are then passed along to engineers who create backend services which utilize
them. The crisis comes to a head when the engineer says to the scientist, “Hey, we need to test
this, got any ideas?”. The researcher will scratch their head, think very hard for a few minutes
and say one of the following:
- “I already tested it. It is a perfectly crafted model that needs no testing.”
- “I’m not sure how to test my model. Got any ideas?”
- “What do you mean test my model? What’s a test?”
- “Sure, here are my ideas….”
If you hear the last one, consider yourself blessed. If you hear the third one, you at least can
have a conversation about what testing is and why it’s important. The second response isn’t quite
so bad since the researcher is willing and this could lead to a very fruitful conversation about
the model and both parties come away with a greater appreciation for how integral we all are
in making the company work.
The first response is perhaps the least desireable. The researcher knows (or at least has some idea)
what testing is. The difficulty then is that the researcher has implicit faith in the absolute
truth of their model. I realize I am perhaps putting words in people’s mouths when I say this
but it does seem odd to believe absolutly in the truth of a model when statistics says the
opposite. Models have uncertainty, even perfectly crafted ones. Perhaps in a perfect world we might be
able to believe a model without needing to validate it but we are not in Kansas Dorothy.
But why do the testing? I’ll present a few reasons.
Models aren’t perfect
Sad to say this but I will quote George Box: “All models are wrong, but some are useful.” Models are
approximations of reality. This means we need to know how wrong our model is. Enter testing.
Your dataset isn’t complete
Unless you have unlimited resources and computing power, your dataset probably hasn’t sampled the
entirety of the population. Therefor you’ll probably encounter someone/something that wasn’t in
your dataset. Edge cases, corner cases. Testing protects us against completely new data never
seen before.
Your user population may change over time
People are born every day. Users join and leave over time. Your model that was built in 2012 might
not be as right in 2013. Your user base may have changed, new data might have become available,
demographics might have changed. Testing can surface this over time, hopefully as it happens. See
model drift.
So please, if you have a science model, test it. Test it good.
15 Jun 2018
The Manager’s Tale
Let’s say one day you’re sitting in your manager’s chair and one of your data scientists
comes up to you and says, “Hey, I have this fantastic idea that will give us a 10X boost in
customer satisfaction!”. The moment you hear this your eyes bug out and you leap from your
chair and holler:
And, has thou slain the Jabberwock?
Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!
Your delight is short lived though as you begin to think about how you will get this 10X
idea implemented. Images of irrate middle managers, thundering vice president’s of product, and
a fuming CSO make you recall some very wise words passed along to you years ago:
Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!
You take a deep breath and think some more about what your available options are. It occurs to
you that there are several ways you could solve this problem:
- Have your Engineers write tools in the production language that scientists can use
for research and prototyping.
- Let your Scientists write in their language of choice and then have Engineers translate
to the preferred production language.
- Have your Engineers and Scientists use the same language for research and production.
Option 2 seems to be the status quo for most tech companies. Option 1 seems like a
really cool idea which you have heard some companies claim to do
(Stitch Fix for example).
Option 3 frankly sounds like a place that only unicorns work at. Let’s think about each of
these options one at a time.
A Common Tongue (Option 3)
In a perfect world, there would be only one programming language that did it all. It would
be the swiss army knife that could do CSV parsing, plotting, data science, and parallel
big data processing. Unfortunately we don’t live in that kind of world. What we have is more
along the lines of as-seen-on-tv types of programming languages that can do some things
fantastically but not everything.
I know at this point someone will be saying, “Hey, what about ___ ? It does it
all and is quite good at everything!”. I would counter that if this were the case then there
would actually be a company that only uses one langauge for their full stack, which I have
yet to hear about. Some of this can probably be traced back to having too many cooks wanting to use
their favorite framework/JVM/newly fashionable language to do the job.
In reality, certain languages are more suited to certain tasks than others along with the fact
that developers will develop what they are passionate about. Developers don’t seem to want to
build some programming tower of babel. That just doesn’t seem to be a thing. Individuals have tried
to advocate for certain languages being the final word but again, where is this “one
langauge to rule them all”?
Another issue that comes with this option is that you need data scientists and engineers to agree
on a language. And it will have to do both tasks well enough. Which will probably disappoint everybody
at some point. So scratch that.
SSWEeeet Times (Option 2)
Like I mentioned, this option seems to be the standard method for putting models and improvements
into production. It then requires some sort of assignment of engineering resources towards
translating of researcher code.
One could argue that this arrangement separates the concerns of ideation and implementation. Let the
data scientists come up with the ideas and let the engineers make it work in the real world. While this
sounds great, I do think it leads to conflict, disappointment, and frustration. The Stitch Fix article
puts it pretty well I think:
In case you did not realize it, Nobody enjoys writing and maintaining data pipelines or ETL. It’s
the industry’s ultimate hot potato. It really shouldn’t come as a surprise then that ETL engineering
roles are the archetypal breeding ground of mediocrity.
Engineers should not write ETL. For the love of everything sacred and holy in the profession, this
should not be a dedicated or specialized role. There is nothing more soul sucking than writing,
maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.
I would take some issue with the label of mediocre that the author places on all who wind up in this weird
middle ground. But his point is still valid. Nobody likes implementing someone elses idea, particularly if
it’s poorly written in the first place. As a SSWE you then have two options:
- Rewrite the d@mn thing using proper software engineering design principles
- Give in to apathy and just pass the code along (mostly) unchanged
This arrangement also allows data scientists to abdicate responsibility for writing good code in the first
place. Why bother writing something that will scale, has good documentation, and is fast when you have an
engineer to do that for you?
The Nailsmith and The Knight (Option 1)
A small digression, please indulge me. I play Hollow Knight in my spare time,
when I’m not cooking, hiking with my wife and daughter, or building tiny lego towers with my daughter. The
knight in the game starts out with a basic nail to fight with. Part was through the game he meets the
nailsmith who offers to make the knight’s nail stronger and sharper. The knight gives his nail to the smith
who proceeds to make it sharper and more powerful. The knight can now defeat enemies quicker, deal more damage,
and explore more dangerous places on his quest. Why am I mentioning this?
I already elaborated the previous two approaches and what the disadvatages are. The main failing usually is
that people wind up doing things then don’t really want to. For example, if everyone writes the in same
language, then either data scientists will feel underpowered and weak or engineers will feel hamstrung. In
option 2 we have engineers specifically hired to deal with other people’s messy inefficient code who dream
of doing better things.
In either of those cases we have woefully ill equipped knights fighting monsters far bigger then they
originally thought they would.
So why not let people do what they are good at and passionate about?
Engineers are good at crafting well thought out tools. Data scientists are good at analysis, discovery, and
exploration. Let the engineers craft the tools for the data nerds. You wouldn’t tell a novice cook, “Hey,
if you want to make bread, you’ll need to go buy some wheat seed, plant it, tend it, harvest it, grind it,
make the bread, build your own brick oven, chop some wood, and then bake the bread!”. A cook cooks. A
farmer farms. The knight can’t fight monsters all that effectively without a good nail and he can’t have
good tools unless there’s a nailsmith to help him.
Epilogue
So where does that leave us? You’ve been twirling in your managers chair for some time now talking to yourself.
Your coworkers are giving you odd looks. The facilities manager has even wandered by and asked if you’re okay.
Unfortunately you can’t do much. You’re stuck with the organization structure that you have. You, as the
manager, need to make sure everyone is happy in what they’re doing. So my advice to you is this. Tell
your data scientist to get cozy with the engineer(s) they will be working with to implement the enhancement.
As we all know, ultimately the engineer is going to be implementing it, but we can ease the transition. The
data scientist can shepherd the new code from ideation through crafting by the engineer. Keep everyone in the
loop for the whole process. Once you allow someone to back out, it will create resentment. New ideas need
passionate people. The new idea is a baby that needs nuturing. And just like you need both parents involved to
have a happy healthy child, you need your scientists and engineers talking, collaborating, and working
together if we want a scalable, performant, fault-tolerant system.
I don’t really have a silver bullet for you, but I do know a nailsmith who might come in handy. Good luck people.
02 Jul 2016
When people ask me what I do, I usually reply that I do statistics at a tech company.
This usually placates people. For the more tenacious questioner, I usually go on to say
something about developing statistical models and programming. At this point most
people are content but for the very few who press me further, I will admit that I
am not a computer scientist and my programming training has been largely spotty,
pragmatic, and not particularly formal.
Beginnings
The first clue to this is that my first language was Matlab. I took an introductory course
which was mostly filled with mechanical engineers and hence my major takeaway from the
class was how to do math with a computer. From here, I then learned IDL, which only
astronomers seem to use any more. Again, this wound up being fancy calculator stuff. It
really wasn’t until I did a research internship with a planetary science professor that
I encountered Python.
Python
Python was something of a revelation. It was much less a set of calculator instructions
so much as an actual general purpose programming language. You could do so much more
than math. You could process text, write a command line program, etc. This was the
first time I had encountered object oriented programming, list comprehensions, and
lambdas. My mind was expanded greatly to say the least.
I didn’t really get any formal training in computer science which is not a huge regret to
me. I’m an astrophysicist who likes to use programming to solve things. Python made sense
for this kind of work. It is a general purpose language with lots of capabilities. And
from the first time I encountered it, circa 2008, it has mushroomed to include all sorts
of capabilities. It makes a pretty great first programming language.
And then there was Clojure
Fastforward a few years, I have my masters, I’ve learned R, and I have a job at a tech company.
I’m using Python to do data science. It’s all very straighforward. But not quite.
I was hired to work in the Science group at my company, which essentially means that I
help determine what models and data get used in production, doing proof of concept
in Python, which is then handed off to software engineers for implementation. This is
somewhat standard practice at a medium to large size company from what I can tell. The difficulty,
and I could probably write a whole other essay on this, is that Engineering implements
models in Clojure (or Scala to a lesser extent), so there is no continuity of language from
Science to engineering.
To mitigate this mismatch, I recently decided to learn Clojure, which brings me to the actual
point of this post.
Not quite learning
Numerous self-paced coding resources exist these days, which is a blessing. Learning to code not
so long ago usually involved tracking down a book, motivating yourself to invent a project,
or take a class (if you were lucky) at a local junior college. Now you can go to any of several
websites that make games out of learning coding. Which has its pros and cons.
For the most part I have been happy with what I have found. The assignments are usually bite
sized and teach a single concept at a time. Most websites also organize the content so that
assignments get progressively more complex. Feedback is usually very quick since tests are run
right at submit time.
While I applaud efforts like this, I have found some of these websites frustrating. The quality
of the assignments is very spotty. The issues range from poorly worded explanations of the problem
to unrealistic and obscure test cases. I probably sound bitter.
As a former math teacher I can see the merit of stretching students minds with corner cases, but
I also know that with self motivating study programs there is usually short window of time where
the student needs to be rewarded for their efforts. Taunting a student with pathalogical test
cases becomes mean. You are no longer teaching, you are rubbing dirt in someones face.
A compromise?
While I wish I had a good compromise for this kind of situation, I don’t see a clear one. As I
becomes a more and more competant programmer, I have begun to realize that edge cases pop up all
the time, contrary to popular opinion. The needs of a production environment, however, demand
results yesterday. It is the agile philosophy of “move fast and break things”. In fact most tech
companies would probably subscribe to this philosophy.
This is a difficult tension to live in. Wanting to produce a product that satisfies a customer
while covering that edge cases where a thousand customers request information from your server at
5:30 am and the service crashes.
Learning is much the same way. There is a tension between wanting the student to succeed and see
the connections between concepts, and the growing and stretching to understand the subtleties
of a problem. So for now, I leave you with this tension and hope that we all will continue to
persevere in learning and building better learning tools.
29 Apr 2015
Usually for me there has to be some major motivating force for me to learn
something new, like an optimization technique, statistical model, etc.
Programming languages are like that too. It is only because I am transitioning
into doing half engineering, half research work that I am even motivated to
learn Clojure. Not to say that Clojure is bad. Or good for that matter.
Another wrinkle to the situation is that I am diving into a Lisp-like language
for the first time. I find myself chuckling and recalling a cartoon about
programming
languages. The
punchline, to ruin it for you dear reader, is that Lisp is a shiv and the
wielder
of said shiv is probably crazy and dangerous. Philosophically speaking, I sort
of get the joke but like all humor, there’s a grain of truth in the backhanded
compliment. Take a simple task, like filtering out certain things from a list.
In good old Python, which is described as a double barreled shotgun that you
always use the wrong barrel of, such a task would look like this:
[i for i in range(10) if i > 3]
while in Clojure we get:
(filter #(> % 3) (range 10))
You could probably say that the Clojure syntax is a little more elegant, but
at this point in my Clojure journey, the two methods look pretty similar in
brevity. And maybe this is the point, as far as list comprehensions go. List
comprehensions are actually quite concise, which is one of the reasons I like
Python. But all that said, list comprehensions are probably not a good place to
compare the two languages.
A better comparison might be a Project Euler problem
I tried out. The problem goes like this,
If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below 1000.
One clarification is needed though, this is not an exclusive or, where we only
have multiples of only 3 or only 5. Multiples can be multiples of 3, 5 or 3 and
- If you don’t belive me, solve the problem and see the solution they give.
In Python, this sort of problem might be amenable to a list comprehension
(admittedly this is not the cleverest way, but really the naive straightforward
way), which would probably look like this:
sum([i for i in range(1000) if ((i % 3) == 0) or ((i % 5) == 0)])
Clojure:
(reduce + (filter #(or (= 0 (mod % 3)) (= 0 (mod % 5))) (range 1000)))
Not very different you say. I would agree again. The surprising point comes when
you look at performance. For adding up the multiples of 3 or 5 up to 1000,
Python comes in at 466 micro-seconds, while Clojure clocks in at 623
micro-seconds. The point goes to Python this time, but what happens when you ramp up the upper limit? Not surprisingly when you go up by a factor of 10, Python starts to slow down:
Upper limit |
Clojure |
Python |
1000 |
0.623 ms |
0.466 ms |
10000 |
2.38 ms |
3.59 ms |
100000 |
20.0 ms |
25.2 ms |
1000000 |
235 ms |
350 ms |
No I know this isn’t a really great benchmark but I do think it does at least hint at what is going on under the hood and how each of the systems handles numerical computations. Ultimately, neither Clojure nor Python is build first and foremost for numerical computations. Python was originally conceived as a scripting language that has become an incredibly useful language that can do just about anything you want it to, from building websites to inverting matrices. Clojure is still young but it seems to me that at least at this point, it is headed in roughly the same direction, i.e. a language that wasn’t quite originally intended to do everything but might actually get there some day.
At this, the beginning of my journey, I am hopeful for what Clojure can offer.
23 Feb 2015
As I have progressed deeper and deeper into the scientific Python stack over the
years, it struck me as surprising that I hadn’t encountered or tried Theano
before now. But first a little background on Theano.
Theano
Theano is a fairly
nice Python library that mainly concerns itself with efficient mathematical
operations. Think of it as an extension of Numpy, but
with some added features. Specifically, when you write out mathematical
operations in Theano, that operation gets turned into an operations graph, which
dynamically generates C code, which is always nice for faster mathematics.
Additionally there is the bonus of automatic differentiation
for easy gradient calcuations. Now at this point I should probably let you know
that I only have a basic understanding of automatic differentiation, but I do
understand the practical upshot, easily computed gradients.
Being a statistician and applied mathematician, this brings me much joy, mostly
because it means I can write likelihood functions directly, take gradients, and
optimize them quickly. There is a little extra work but it’s worth it.
Theano in Practice
For me, while a tool may be amazing in its capabilities, if the initial learning curve is too steep I will probably give up and find something else. While this might be a suboptimal strategy when it comes to expanding my skills, I still would argue that this is what really separates widely used tools from tools that languish in obscurity. For example, this is why scikit-learn
is such a widely used, strongly supported, and actively developed Python machine learning package. All you really have to do is the following if you want to do a random forest regression:
> from sklearn.ensemble import RandomForestRegressor
> regressor = RandomForestRegressor()
> regressor.fit(Xtrain, ytrain)
> predictions = regressor.predict(Xtest)
> regressor.score(Xvalidation, yvalidation)
That is a total of five lines of code to:
- Fit a model
- Predict at new x values
- Evaluate the model at some validation locations
It really doesn’t get much easier than this.
This is also why R is also popular with most statisticians. The formula syntax for generalized linear models is incredibly easy to use. My point with these examples is that lower hurdles to entry for non-computer scientists generally allows for a wider adoption of a tool. Which is why I have found myself coming to Theano so late and not much earlier.
Not that Theano is difficult to use. The package’s main page looks very cluttered and full of information I wish I knew what it meant. Note to Theano maintainers: the main home page is very cluttered. The home page that greets you does have links to introductions which, once you find them, are actually quite informative. For the most part, you can pretty much write mathematical operations as you would have in numpy. This is a tremendous advantage. Hurdle number one is not so bad.
There is some thinking ahead that needs to be done regarding whether the objects you are working with are matrices or vectors, but this is pretty light. Definitely not on the order of having to declare the type and size of arrays like in a compiled language.
Theano for the statistician
A typical task for a statistician using Theano might go something like:
- Pick a statistical model, given your data
- Write out the model likelihood
- Estimate the model parameters using the negative log-likelihood (and possibly the gradient if available)
- Estimate the uncertianty of the parameters using the Fisher information (the inverse of the Hessian of the negative log-likelihood)
None of the previous tasks with Theano turn out to be that difficult. Let’s take a look.
Writing out the likelihood