Awkward Python Still figuring stuff out...

Test your code. Please.

The scientific computing community has a problem. A fairly large one. Come to think of it, even good old experimental science has this problem. I’m talking about the reproducibility crisis. Essentially we cannot repeat the results seen in a significant portion, somewhere between 65% and 95%, of experiments performed in biomedical science. And even worse, scientists aren’t incentivized to make their research reproducible or even return a non-significant result.

Truth be told, science is hard. It’s hard working with real data to find causes and effects. Being honest in science about the fact that your model might not say anything new is not celebrated. Which leads me to my point.

If you work at a data science type tech company, you regularly have data scientists developing models that try to make predictions, classify things, or make recommendations to people much like science. These models are then passed along to engineers who create backend services which utilize them. The crisis comes to a head when the engineer says to the scientist, “Hey, we need to test this, got any ideas?”. The researcher will scratch their head, think very hard for a few minutes and say one of the following:

  1. “I already tested it. It is a perfectly crafted model that needs no testing.”
  2. “I’m not sure how to test my model. Got any ideas?”
  3. “What do you mean test my model? What’s a test?”
  4. “Sure, here are my ideas….”

If you hear the last one, consider yourself blessed. If you hear the third one, you at least can have a conversation about what testing is and why it’s important. The second response isn’t quite so bad since the researcher is willing and this could lead to a very fruitful conversation about the model and both parties come away with a greater appreciation for how integral we all are in making the company work.

The first response is perhaps the least desireable. The researcher knows (or at least has some idea) what testing is. The difficulty then is that the researcher has implicit faith in the absolute truth of their model. I realize I am perhaps putting words in people’s mouths when I say this but it does seem odd to believe absolutly in the truth of a model when statistics says the opposite. Models have uncertainty, even perfectly crafted ones. Perhaps in a perfect world we might be able to believe a model without needing to validate it but we are not in Kansas Dorothy.

But why do the testing? I’ll present a few reasons.

Models aren’t perfect

Sad to say this but I will quote George Box: “All models are wrong, but some are useful.” Models are approximations of reality. This means we need to know how wrong our model is. Enter testing.

Your dataset isn’t complete

Unless you have unlimited resources and computing power, your dataset probably hasn’t sampled the entirety of the population. Therefor you’ll probably encounter someone/something that wasn’t in your dataset. Edge cases, corner cases. Testing protects us against completely new data never seen before.

Your user population may change over time

People are born every day. Users join and leave over time. Your model that was built in 2012 might not be as right in 2013. Your user base may have changed, new data might have become available, demographics might have changed. Testing can surface this over time, hopefully as it happens. See model drift.

So please, if you have a science model, test it. Test it good.

There be Jabberwockies, A Tale of Mismatched Needs

The Manager’s Tale

Let’s say one day you’re sitting in your manager’s chair and one of your data scientists comes up to you and says, “Hey, I have this fantastic idea that will give us a 10X boost in customer satisfaction!”. The moment you hear this your eyes bug out and you leap from your chair and holler:

And, has thou slain the Jabberwock? Come to my arms, my beamish boy! O frabjous day! Callooh! Callay!

Your delight is short lived though as you begin to think about how you will get this 10X idea implemented. Images of irrate middle managers, thundering vice president’s of product, and a fuming CSO make you recall some very wise words passed along to you years ago:

Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!

You take a deep breath and think some more about what your available options are. It occurs to you that there are several ways you could solve this problem:

  1. Have your Engineers write tools in the production language that scientists can use for research and prototyping.
  2. Let your Scientists write in their language of choice and then have Engineers translate to the preferred production language.
  3. Have your Engineers and Scientists use the same language for research and production.

Option 2 seems to be the status quo for most tech companies. Option 1 seems like a really cool idea which you have heard some companies claim to do (Stitch Fix for example). Option 3 frankly sounds like a place that only unicorns work at. Let’s think about each of these options one at a time.

A Common Tongue (Option 3)

In a perfect world, there would be only one programming language that did it all. It would be the swiss army knife that could do CSV parsing, plotting, data science, and parallel big data processing. Unfortunately we don’t live in that kind of world. What we have is more along the lines of as-seen-on-tv types of programming languages that can do some things fantastically but not everything.

I know at this point someone will be saying, “Hey, what about ___ ? It does it all and is quite good at everything!”. I would counter that if this were the case then there would actually be a company that only uses one langauge for their full stack, which I have yet to hear about. Some of this can probably be traced back to having too many cooks wanting to use their favorite framework/JVM/newly fashionable language to do the job.

In reality, certain languages are more suited to certain tasks than others along with the fact that developers will develop what they are passionate about. Developers don’t seem to want to build some programming tower of babel. That just doesn’t seem to be a thing. Individuals have tried to advocate for certain languages being the final word but again, where is this “one langauge to rule them all”?

Another issue that comes with this option is that you need data scientists and engineers to agree on a language. And it will have to do both tasks well enough. Which will probably disappoint everybody at some point. So scratch that.

SSWEeeet Times (Option 2)

Like I mentioned, this option seems to be the standard method for putting models and improvements into production. It then requires some sort of assignment of engineering resources towards translating of researcher code.

One could argue that this arrangement separates the concerns of ideation and implementation. Let the data scientists come up with the ideas and let the engineers make it work in the real world. While this sounds great, I do think it leads to conflict, disappointment, and frustration. The Stitch Fix article puts it pretty well I think:

In case you did not realize it, Nobody enjoys writing and maintaining data pipelines or ETL. It’s the industry’s ultimate hot potato. It really shouldn’t come as a surprise then that ETL engineering roles are the archetypal breeding ground of mediocrity.

Engineers should not write ETL. For the love of everything sacred and holy in the profession, this should not be a dedicated or specialized role. There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.

I would take some issue with the label of mediocre that the author places on all who wind up in this weird middle ground. But his point is still valid. Nobody likes implementing someone elses idea, particularly if it’s poorly written in the first place. As a SSWE you then have two options:

  1. Rewrite the d@mn thing using proper software engineering design principles
  2. Give in to apathy and just pass the code along (mostly) unchanged

This arrangement also allows data scientists to abdicate responsibility for writing good code in the first place. Why bother writing something that will scale, has good documentation, and is fast when you have an engineer to do that for you?

The Nailsmith and The Knight (Option 1)

A small digression, please indulge me. I play Hollow Knight in my spare time, when I’m not cooking, hiking with my wife and daughter, or building tiny lego towers with my daughter. The knight in the game starts out with a basic nail to fight with. Part was through the game he meets the nailsmith who offers to make the knight’s nail stronger and sharper. The knight gives his nail to the smith who proceeds to make it sharper and more powerful. The knight can now defeat enemies quicker, deal more damage, and explore more dangerous places on his quest. Why am I mentioning this?

I already elaborated the previous two approaches and what the disadvatages are. The main failing usually is that people wind up doing things then don’t really want to. For example, if everyone writes the in same language, then either data scientists will feel underpowered and weak or engineers will feel hamstrung. In option 2 we have engineers specifically hired to deal with other people’s messy inefficient code who dream of doing better things.

In either of those cases we have woefully ill equipped knights fighting monsters far bigger then they originally thought they would.

So why not let people do what they are good at and passionate about?

Engineers are good at crafting well thought out tools. Data scientists are good at analysis, discovery, and exploration. Let the engineers craft the tools for the data nerds. You wouldn’t tell a novice cook, “Hey, if you want to make bread, you’ll need to go buy some wheat seed, plant it, tend it, harvest it, grind it, make the bread, build your own brick oven, chop some wood, and then bake the bread!”. A cook cooks. A farmer farms. The knight can’t fight monsters all that effectively without a good nail and he can’t have good tools unless there’s a nailsmith to help him.

Epilogue

So where does that leave us? You’ve been twirling in your managers chair for some time now talking to yourself. Your coworkers are giving you odd looks. The facilities manager has even wandered by and asked if you’re okay.

Unfortunately you can’t do much. You’re stuck with the organization structure that you have. You, as the manager, need to make sure everyone is happy in what they’re doing. So my advice to you is this. Tell your data scientist to get cozy with the engineer(s) they will be working with to implement the enhancement. As we all know, ultimately the engineer is going to be implementing it, but we can ease the transition. The data scientist can shepherd the new code from ideation through crafting by the engineer. Keep everyone in the loop for the whole process. Once you allow someone to back out, it will create resentment. New ideas need passionate people. The new idea is a baby that needs nuturing. And just like you need both parents involved to have a happy healthy child, you need your scientists and engineers talking, collaborating, and working together if we want a scalable, performant, fault-tolerant system.

I don’t really have a silver bullet for you, but I do know a nailsmith who might come in handy. Good luck people.

If you want to learn to program...

When people ask me what I do, I usually reply that I do statistics at a tech company. This usually placates people. For the more tenacious questioner, I usually go on to say something about developing statistical models and programming. At this point most people are content but for the very few who press me further, I will admit that I am not a computer scientist and my programming training has been largely spotty, pragmatic, and not particularly formal.

Beginnings

The first clue to this is that my first language was Matlab. I took an introductory course which was mostly filled with mechanical engineers and hence my major takeaway from the class was how to do math with a computer. From here, I then learned IDL, which only astronomers seem to use any more. Again, this wound up being fancy calculator stuff. It really wasn’t until I did a research internship with a planetary science professor that I encountered Python.

Python

Python was something of a revelation. It was much less a set of calculator instructions so much as an actual general purpose programming language. You could do so much more than math. You could process text, write a command line program, etc. This was the first time I had encountered object oriented programming, list comprehensions, and lambdas. My mind was expanded greatly to say the least.

I didn’t really get any formal training in computer science which is not a huge regret to me. I’m an astrophysicist who likes to use programming to solve things. Python made sense for this kind of work. It is a general purpose language with lots of capabilities. And from the first time I encountered it, circa 2008, it has mushroomed to include all sorts of capabilities. It makes a pretty great first programming language.

And then there was Clojure

Fastforward a few years, I have my masters, I’ve learned R, and I have a job at a tech company. I’m using Python to do data science. It’s all very straighforward. But not quite.

I was hired to work in the Science group at my company, which essentially means that I help determine what models and data get used in production, doing proof of concept in Python, which is then handed off to software engineers for implementation. This is somewhat standard practice at a medium to large size company from what I can tell. The difficulty, and I could probably write a whole other essay on this, is that Engineering implements models in Clojure (or Scala to a lesser extent), so there is no continuity of language from Science to engineering.

To mitigate this mismatch, I recently decided to learn Clojure, which brings me to the actual point of this post.

Not quite learning

Numerous self-paced coding resources exist these days, which is a blessing. Learning to code not so long ago usually involved tracking down a book, motivating yourself to invent a project, or take a class (if you were lucky) at a local junior college. Now you can go to any of several websites that make games out of learning coding. Which has its pros and cons.

For the most part I have been happy with what I have found. The assignments are usually bite sized and teach a single concept at a time. Most websites also organize the content so that assignments get progressively more complex. Feedback is usually very quick since tests are run right at submit time.

While I applaud efforts like this, I have found some of these websites frustrating. The quality of the assignments is very spotty. The issues range from poorly worded explanations of the problem to unrealistic and obscure test cases. I probably sound bitter.

As a former math teacher I can see the merit of stretching students minds with corner cases, but I also know that with self motivating study programs there is usually short window of time where the student needs to be rewarded for their efforts. Taunting a student with pathalogical test cases becomes mean. You are no longer teaching, you are rubbing dirt in someones face.

A compromise?

While I wish I had a good compromise for this kind of situation, I don’t see a clear one. As I becomes a more and more competant programmer, I have begun to realize that edge cases pop up all the time, contrary to popular opinion. The needs of a production environment, however, demand results yesterday. It is the agile philosophy of “move fast and break things”. In fact most tech companies would probably subscribe to this philosophy.

This is a difficult tension to live in. Wanting to produce a product that satisfies a customer while covering that edge cases where a thousand customers request information from your server at 5:30 am and the service crashes.

Learning is much the same way. There is a tension between wanting the student to succeed and see the connections between concepts, and the growing and stretching to understand the subtleties of a problem. So for now, I leave you with this tension and hope that we all will continue to persevere in learning and building better learning tools.

First Steps in Clojure

Usually for me there has to be some major motivating force for me to learn something new, like an optimization technique, statistical model, etc. Programming languages are like that too. It is only because I am transitioning into doing half engineering, half research work that I am even motivated to learn Clojure. Not to say that Clojure is bad. Or good for that matter.

Another wrinkle to the situation is that I am diving into a Lisp-like language for the first time. I find myself chuckling and recalling a cartoon about programming languages. The punchline, to ruin it for you dear reader, is that Lisp is a shiv and the wielder of said shiv is probably crazy and dangerous. Philosophically speaking, I sort of get the joke but like all humor, there’s a grain of truth in the backhanded compliment. Take a simple task, like filtering out certain things from a list. In good old Python, which is described as a double barreled shotgun that you always use the wrong barrel of, such a task would look like this:

[i for i in range(10) if i > 3]

while in Clojure we get:

(filter #(> % 3) (range 10))

You could probably say that the Clojure syntax is a little more elegant, but at this point in my Clojure journey, the two methods look pretty similar in brevity. And maybe this is the point, as far as list comprehensions go. List comprehensions are actually quite concise, which is one of the reasons I like Python. But all that said, list comprehensions are probably not a good place to compare the two languages.

A better comparison might be a Project Euler problem I tried out. The problem goes like this,

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

One clarification is needed though, this is not an exclusive or, where we only have multiples of only 3 or only 5. Multiples can be multiples of 3, 5 or 3 and

  1. If you don’t belive me, solve the problem and see the solution they give.

In Python, this sort of problem might be amenable to a list comprehension (admittedly this is not the cleverest way, but really the naive straightforward way), which would probably look like this:

sum([i for i in range(1000) if ((i % 3) == 0) or ((i % 5) == 0)])

Clojure:

(reduce + (filter #(or (= 0 (mod % 3)) (= 0 (mod % 5))) (range 1000)))

Not very different you say. I would agree again. The surprising point comes when you look at performance. For adding up the multiples of 3 or 5 up to 1000, Python comes in at 466 micro-seconds, while Clojure clocks in at 623 micro-seconds. The point goes to Python this time, but what happens when you ramp up the upper limit? Not surprisingly when you go up by a factor of 10, Python starts to slow down:

Upper limit Clojure Python
1000 0.623 ms 0.466 ms
10000 2.38 ms 3.59 ms
100000 20.0 ms 25.2 ms
1000000 235 ms 350 ms

No I know this isn’t a really great benchmark but I do think it does at least hint at what is going on under the hood and how each of the systems handles numerical computations. Ultimately, neither Clojure nor Python is build first and foremost for numerical computations. Python was originally conceived as a scripting language that has become an incredibly useful language that can do just about anything you want it to, from building websites to inverting matrices. Clojure is still young but it seems to me that at least at this point, it is headed in roughly the same direction, i.e. a language that wasn’t quite originally intended to do everything but might actually get there some day.

At this, the beginning of my journey, I am hopeful for what Clojure can offer.

First Steps with Theano

As I have progressed deeper and deeper into the scientific Python stack over the years, it struck me as surprising that I hadn’t encountered or tried Theano before now. But first a little background on Theano.

Theano

Theano is a fairly nice Python library that mainly concerns itself with efficient mathematical operations. Think of it as an extension of Numpy, but with some added features. Specifically, when you write out mathematical operations in Theano, that operation gets turned into an operations graph, which dynamically generates C code, which is always nice for faster mathematics. Additionally there is the bonus of automatic differentiation for easy gradient calcuations. Now at this point I should probably let you know that I only have a basic understanding of automatic differentiation, but I do understand the practical upshot, easily computed gradients.

Being a statistician and applied mathematician, this brings me much joy, mostly because it means I can write likelihood functions directly, take gradients, and optimize them quickly. There is a little extra work but it’s worth it.

Theano in Practice

For me, while a tool may be amazing in its capabilities, if the initial learning curve is too steep I will probably give up and find something else. While this might be a suboptimal strategy when it comes to expanding my skills, I still would argue that this is what really separates widely used tools from tools that languish in obscurity. For example, this is why scikit-learn is such a widely used, strongly supported, and actively developed Python machine learning package. All you really have to do is the following if you want to do a random forest regression:

> from sklearn.ensemble import RandomForestRegressor

> regressor = RandomForestRegressor()
> regressor.fit(Xtrain, ytrain)

> predictions = regressor.predict(Xtest)

> regressor.score(Xvalidation, yvalidation)

That is a total of five lines of code to:

  1. Fit a model
  2. Predict at new x values
  3. Evaluate the model at some validation locations

It really doesn’t get much easier than this.

This is also why R is also popular with most statisticians. The formula syntax for generalized linear models is incredibly easy to use. My point with these examples is that lower hurdles to entry for non-computer scientists generally allows for a wider adoption of a tool. Which is why I have found myself coming to Theano so late and not much earlier.

Not that Theano is difficult to use. The package’s main page looks very cluttered and full of information I wish I knew what it meant. Note to Theano maintainers: the main home page is very cluttered. The home page that greets you does have links to introductions which, once you find them, are actually quite informative. For the most part, you can pretty much write mathematical operations as you would have in numpy. This is a tremendous advantage. Hurdle number one is not so bad. There is some thinking ahead that needs to be done regarding whether the objects you are working with are matrices or vectors, but this is pretty light. Definitely not on the order of having to declare the type and size of arrays like in a compiled language.

Theano for the statistician

A typical task for a statistician using Theano might go something like:

  1. Pick a statistical model, given your data
  2. Write out the model likelihood
  3. Estimate the model parameters using the negative log-likelihood (and possibly the gradient if available)
  4. Estimate the uncertianty of the parameters using the Fisher information (the inverse of the Hessian of the negative log-likelihood)

None of the previous tasks with Theano turn out to be that difficult. Let’s take a look.

Writing out the likelihood