13/01/2021 – Preporuka za danas

Quora.com Question (2019):

“Machine learning seems to have settled down into ~ 1000 algorithms. Can’t we simply automate the job of a data scientist by just trying them all on any particular case and retaining the best performing one?”

Some of the interesting answers:

“I am afraid you do not understand what the title “Data scientist” entails. There are 2 main branches in data science/machine learning at the moment: software development (which is also called data science since you are using machine learning frameworks/scalable solutions for these algorithms), and actual data science. …”

“Where did this 1000 number come from? There are an almost infinite arrangements of valid networks for every problem, and only a few dozen prominent types of techniques to apply. Neither is near 1000 nor are they things you can cycle through and just try. Let’s try something simpler, say the preparation of data for processing. You need to normalize it. This can be rearranging columns or processing values into a different form among many tasks. Just those two things are simple but we don’t have a way to just try all the options and see what works. If I need to compute the speed between two measurements do I just let the computer randomly choose between operations and inputs until it hits upon the right one? No, of course not…”

“A data scientist doesn’t (or definitely shouldn’t) just throw all the existing algorithms at a problem to see which one sticks. A data scientist’s job is to create understanding out of raw data… A lot of that cannot be automated quite so easily. Throwing a classification algorithm at a regression problem is not going to work. If you have structured data (for example, I’m currently dealing with data coming from different sessions of buses; each point is a location and time stamp, together with some more information; just throwing a random machine learning algorithm at that will not understand that it needs to look at individual sessions separately and treat them as sequential data)”

Yes and No: It seems many problems arising from different fields of study like speech, image, text, music, control etc., Can be solved using any of the “standard” algorithms. These standard algorithms include random forests, gradient boosting, Monte Carlo…and the list goes on.

When I say “Yes”, even thought the internal workings of these algorithms are different, they still have a common objective. So, if you have well defined objective for your problem, you can iterate over all possible ML models and obtain more accurate and precise predictions. This works fine if you are only concerned about prediction accuracy. If you would like to infer something about the variables involved or model itself. This bruteforce approach is not going to help.
Now I come to “No” part of my answer. As I mentioned above, accuracy is not the always the king. The ML have been very successful in providing better predictions. The ML is still in it’s infancy, when compared to more matured subjects like mathematical statistics or physics. No general framework has been identified so far, which explains why some of these algorithms are exceptionally good at solving one particular class of problems while others can’t.”

“This is called ‘autoML’ and already exists. It’s also being developed by every major cloud provider. If this was a data scientists job, then they’d be very scared. However, it’s not.

Trying out different models is fun and fairly trivial in difficulty once the data is ready to go. The hard part of the job is everything around finding the best model.

Finding data, pipelining it, cleaning it, validating, wrangling it to the best functional form, mapping to reduce dimensionality, choosing a model that minimises the bias variance tradeoff while still running in an acceptable amount of time, understanding your outputs, translating them to a business solution, putting the power of the tool in the right persons hands, etc. Etc. Etc. There are so many other considerations that form a data scientists job.

Not to mention the fact that autoML could only work for supervised learning techniques and would not help with unsupervised or reinforcement techniques.

AutoML is super cool but it’s only replacing a very small part of any given data science solution.

(Also, lots of data solutions don’t even involve modelling, for instance visualisation or dashboarding projects.)”

Pročitajte više

Dragan Vukmirović

Blog o statistici

13/01/2021 – Preporuka za danas

Quora.com Question (2019):

“Machine learning seems to have settled down into ~ 1000 algorithms. Can’t we simply automate the job of a data scientist by just trying them all on any particular case and retaining the best performing one?”