Machine Learning for business is much more than number crunching
“We have build a marvelous model using Deep Learning!”.
Sure, but what does it means when heading for production? Does the data is as clean at it should? Is it available? At what cost? What are legal and privacy implications? How will customers react? Will decisions makers understand its implications? Id the model stable over time? Robust to outliers? …
How fun it is to regale others with tales of how you found your model somewhat predictive. But what really matters at the end of the day?
But now it’s the job of some developer to take your prototype model, which pulls from innumerable sources, and turn it into a production system.
All of a sudden there’s a “pipeline jungle,” a jumbled up stream of data sources and glue code for feature engineering and combination, to create something programmatically and reliably in production that you only had to create once manually .
It’s easy, in the research and design phase of a machine learning project, to over-engineer the product. Too many data sources, too many exotic and brittle features, and as a corollary, too complex a model. One trap is leaving in low powered features in your prototype model, because well, they help a little, and they’re not hurting anyone, does it?
What’s the value of those features versus the cost of leaving them in? That’s extra code to maintain, maybe an extra source to pull from. And as the Google paper notes, the world changes, data changes, and every model feature is a potential risk for breaking everything.
Deep Learning models that’s fed scraped data from the internet’s butthole are great at research papers published on sites like Arxiv.org. But it’s important to exercise a little self-control. As the authors of the technical paper put it, “Research solutions that provide a tiny accuracy benefit at the cost of massive increases in system complexity are rarely wise practice.” Amen!
Who’s going to own this model and care for it and love it and feed it and put a band-aid on its cuts when its decision thresholds start to drift? Since it’s going to cost money in terms of manpower and tied up resources to maintain this model, what is the worth of this model to the business? Deep Learning is great, but also very prune to complexities if you don’t have the right data, right parameters, and most importantly, the right team. Sometimes a simple model, like logistic regression can do the job with a few interactions.
Humans are Feedback Loop Machines
Any attempt to train on present day data, data which has now been polluted by the business’s model-driven actions is fraught with peril. It’s a feedback loop. Of course, such feedback loops can be mitigated in many ways. Holdout sets for example. But we can only mitigate a feedback loop if we know about its existence, and we as humans are awesome at generating feedback loops and terrible at recognizing them.
Models are put in the hands of others to act on. And when the police predict that a community is full of criminals and then they start harassing that community, what do you think is going to happen? The future training data gets affected by the police’s “special attention.” Predictive modeling feeds back into systematic discrimination.
This is one of the risks of massification of data science techniques. As we put predictive models more and more in the hands of the layperson, have we considered that we might cut anyone out of the loop who even understands or cares about their misuse?
Integrate and correct
When data scientists treat production implementation as a black box they shove their prototypes through and when engineers treat ML packages as black boxes they shove data pipelines through, problems abound.
Mathematical modelers need to stay close to engineers when building production data systems. Both need to keep each other in mind and keep the business in mind. The goal is not to use deep learning. The goal is not to program in Go. The goal is to create a system for the business that lives on. And in that context, accuracy, maintainability, sturdiness…they all hold equal weight.
So as a data scientist keep your team close and your colleagues from other teams (engineers, MBAs, legal, ….) also closer with the goal of getting work done together. It’s the only way your models will survive past prototype.
At the end of the day, its all about humans not machines or data.