Data Science is the latest trend in the industry. Several organizations have realized the potential of data science to generate useful information from structured and unstructured data. And with data science comes “Data Scientists”
It’s funny how everyone in data science world calls themselves a Data Scientist. And believe me when I tell you these are few of their reasoning…
Being a data scientist is much more. It starts with simple calculus like basic differentiation knowledge to complex neural networks. I am not here to demoralize anyone. In fact, I am writing this blog to share my point of view on how to become one.
When I started my journey 3 years back, I faced several difficulties due to lack of proper guidance. This blog would be a high-level step by step guide giving you more knowledge of the data science world as a whole. I am hopeful that it would help few young curious souls out there.
So, let’s stop the chitchat and let’s just jump into the science pool. I am excited for you.
1. Statistics: First and most important is your stats knowledge.
a. Know about the probability distributions. Below pic will surely help you. Read about these distributions.
I found a useful Cloudera blog for this: http://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/
Know about central limit theorem which makes our life so much easier.
b. Simple Stats and Jargons:
Next, you need to learn these:
Qualitative or quantitative data | Ratio/interval/ordinal/nominal data | Difference between population and sample – mean and variance | Skewness and Kurtosis | Standard deviation, mean, quartiles | Cheby Sheff’s Theorem, Coefficient of variation, Bayes law | Least square methods | Various Probability theory – classical/ Relative frequency/ subjective probability theory | Joint, marginal, conditional probability | Exclusive, Exhaustive, Independent events | Linear/non-linear correlation | Homo/heteroscedasticity | Outlier/ anomaly
c. Estimations, Hypothesis testing:
Z estimate | Difference of 2 mean | Analysis of variance (ANOVA) – one way/multiple comparisons – Fisher, Bonferroni, Tukey methods| T-test, F-test, Chi-square-test
d. Know simple linear regression and correlation concepts in depth!
Correlation is not Causation. Get it!?
Do not take this section lightly. This is the backbone for becoming a data scientist. You can finish this in a week.
2. SQL
No matter where you go or work basic SQL knowledge will definitely help you. You will use your SQL knowledge (or a version of it) everywhere even in spark (as Spark SQL) may be running on your Hadoop instance. Or if you are a beginner in SAS, you can always use PROC SQL on it.
3. Now Learn the Math behind machine learning models and then, only then use it!
Models are easy to use but sometimes, hard to explain. Trial and error on models won’t give you correct output. Plus, in the end, you need to get insights which can be applied to business. Without understanding your model your insights may be wrong.
You need to know about distributions and its limitations, types of analysis required, data transformation to fit a model’s requirement. You must learn how to measure predictions? Which deciding criteria to use – Accuracy, Precision, Recall or F1 score? ROC, AUC, Multiclass evaluation, learn about dummy variables, balance your classes – (even 99.99 % accuracy of imbalanced classes is of no use!)
Also, be aware of data leakage problems.
Here is a very useful cheat sheet from scikit-learn – ML package in Python. It helps you choose the right model.
Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Learn these models. It does look complex at first but believe me, it’s not. Once you finish all previous steps and its mathematical models, you will get the hang of the coding easily. Generally, just the function name needs to be changed with different models.
To take it to next level learn these techniques:
-
K fold, Stratified, Leave one out Cross Validation | Validation curve
-
Multiclass Evaluation – Macro and Micro Average
-
Ensemble Learning
-
Random Forest
-
Voting Classifier – Soft and Hard
-
Sampling Techniques – Bagging, Pasting
-
Boosting – Adaptive Boosting and Gradient Boosting: can be used for both classification and regression
Now if you feel you got this, start exploring Neural Networks: Perceptron – Hebb’s rule, Multiple layer perceptron – back propagation, Epoc, Batch Size. For this python does much better work than R especially because of its amazing Tensor flow package.
But sometimes you’ll have no idea how your one model gives far better result than any other models. You just trust it, especially with neural nets.
I personally prefer a simpler model with 1-1.5 % lower accuracy over a complex model for its ease of explainability and usability.
4. Data Visualization
According to me, there are 2 kinds of data visualizations.
a. Used for Model building & descriptive analytics – For data cleaning, model improvement, or for all above steps. These are generally for developers and are non-fancy. Majorly to get a sense of the data and predictions.
b. Used for Non-Tech Stakeholders – Now, in my opinion, this is the most important step because everything depends on communicating effectively. Believe me or not, many miss this important skill. I feel that’s because they don’t understand the importance of visualization books. They will read textbooks for all above steps except this!
Tools: Tableau, Power BI. In fact, R & Python also has some nice packages for these.
Don’t be Fred!
Live and let others live.
5. Communication
The whole point of a Data Scientist job is to communicate the findings to senior level peoples including VPs and directors. If you can’t convey your message effectively, all your efforts will go to waste, your findings won’t convert into a product. You should always be ready for presentations or talking to a lot of people. Proper visualizations as discussed above and good public speaking skills will make it easy.
Normally for an average data scientist, these skills are enough. But I do believe they shouldn’t depend on a Data Engineer for data every time. Request and re-request and this go on an on for a cleaner, proper data. I feel I should be able to stand on my own two feet and pull the updated data whenever I want and however I want it. It increases my productivity by saving me more time.
Easier said than done. Now because of data boom, just SQL knowledge will not cut it. You should also know about big data architecture to pull data from.
6. Big Data – Hadoop
Now the sample you build the model on is nothing compared to the actual world- the Big Fish. You need to pull data from Hadoop infrastructure or ask one of your Data engineers to do so. Data engineers are responsible for cleaning, preparing and optimizing data for consumption. I feel being independent to get the data you need is one big step towards being a scientist.
Knowledge of Hadoop Architecture, Linux and HDFS Basic Commands, Sqoop, Impala, Hive, Pig Latin, Flume, Solr, Spark and Scala would be enough. Plus, you can always refer to the documentation if you need to get the code. Just basic knowledge of how to use it would be sufficient for a Data Scientist.
7. Amazon Web Services (AWS): Cloud Computing Services
Building and Maintaining a big data architecture is very expensive and time-consuming. So here comes cloud computing. You can do all your analytical processing for production instance in cloud taking help of these services. And the major giant in the market for this is Amazon Web Services (AWS). Basic knowledge of how to use the services like EC2, S3, DynamoDB, EMR, Athena, Lambda, and Elasticsearch should be enough.
8. Certifications
Now if you really want to showcase these to the world you can start with the below 3 certifications:
1. Dell EMC – Data Scientist Associate (DECA-DS) Certification
2. Cloudera – CCA Spark and Hadoop Developer Exam (CCA175)
3. Amazon Web Services – AWS Certified Developer/ AWS Certified Solutions Architect – Associate
4. Tableau – Tableau Desktop/ Server – Associate
Also, I did Salesforce certification to understand the CRM behavior better. Salesforce Wave for Big data can access data from on-premise Hadoop clusters and cloud-based big data repositories (AWS). It just broadens your knowledge (not necessary, your choice).
If you agree/disagree or have any other certification recommendations do let the world know.
The world is one big data problem. Let’s solve it together with data science