Machine Learning is trendy, but there are nasty traps.
Machine Learning in short 📚
Machine Learning (ML) is a branch of artificial intelligence (AI) and computer science that focuses on using data and algorithms to imitate the way that humans learn, gradually improving its accuracy.
There are different types of ML, for example:
- Supervised learning to learn patterns from large sets of labeled data that have common properties (e.g., weather conditions)
- Unsupervised learning to analyze massive amounts of unlabeled data (e.g., social networks, spam filters) without human intervention
- Deep learning: using neural networks to mimic the human brain to learn patterns from unstructured data (e.g., image recognition, speech)
Most of the time, the idea is to train a model. You can see models as mathematical representations of objects and their relationships to each other (e.g., decision trees, graphic models, linear regression).
You can’t launch a new product without ML 💵
Wrong! While ML can help, it might not be the ultimate booster for your sales. Human editing can be way more efficient in some contexts.
So far, ML has not replaced engineering. Algorithms don’t replace killer features either.
Building a consistent and accurate ML system requires skills, time, and lots of data. Besides, it’s often an iterative process. You have to build pipelines you can trust.
As a first step, you’d better focus on your product. ML could be a way to extend and improve your business, but it won’t take magic decisions for you, and you’d better not use it until you have data.
Your model is not a butterfly 🦋
Most models require adequate supervision, and it’s not uncommon for ML systems to silently decay.
You have to manually inspect the data and statistics from time to time to reduce failures and refresh outdated tables.
ML requires monitoring.
Keep your model as simple as possible, especially at the beginning. It makes debugging easier, and it’s also more convenient to handle feedbacks.
For example, it might be harder to make predictions when the model is not probabilistic or if you don’t expect a specific value.
Having simple goals such as reaching a particular metric can be an excellent start for your ML system.
Kill the noise 🧨
The noise is inevitable, but with the wrong data, you make the wrong analysis. In the worst-case scenario, the algorithm identifies the noisy data as a pattern!
Fortunately, there are some techniques to extract the noise. You may use de-noising tools to clean noisy data.
However, my explanation is oversimplistic here as noisy data can be challenging, causing performance issues and involving way more sophisticated approaches.
It’s not exactly like pushing the “delete noise” button. For example, it may involve the Bregman divergence. I can’t give you a simple explanation here and now, but let’s say that, in mathematics, Bregman divergences measure gaps between convex functions and their tangents, which appears to be particularly convenient in ML.
Measure several times 🔬
Collecting data only once is not a clever approach, especially if you need to make predictions for new phenomenons.
Instead, data scientists recommend making multiple measurements from various sources to check the data’s consistency.
Can peers reproduce your model? 🤼
In the expression “data science”, there’s the word “science”.
If other scientists cannot reproduce your model with different ranges and conditions, then it’s not science, and you’ve probably not captured a meaningful phenomenon.
Causes and consequences 🤔
Data can be tricky. It’s not uncommon to get closely linked results without determining something causes something else.
However, it’s easy to take it as a meaningful signal and make wrong theories about the future. For example, people might consider the sales stagnate because not enough users click on a specific button, while the main problem is elsewhere.
IMHO, it’s one of the biggest mistakes people make with their data.
Machine Learning is extra powerful, but it’s not magic at all. It won’t take critical decisions in your place.
It requires monitoring, and data is sometimes hard to interpret correctly.
Photo by Drew Graham on Unsplash