How to become an expert in Data Science: a step-by-step training plan.
Few people can predict events up to one hundred percent true. But the dateintists have learned. And we found the latest trends in Data Science and made a plan for those who want to explore this area in depth.
Language selection
Now there are two main languages
used in data science: Python and R. The R language is used for complex financial analyzes and scientific research, so its deep study can be put off until later.
At the initial stage, you can focus on learning the basics:
- nuances of RStudio;
- Rcmdr, rattle and Deducer libraries;
- container data types, vectors and primary data types;
- factors, structures and matrices.
Quickly understand the theory of the language R will help the site Quick-R.
Python is more popular: it is easier to learn how to write code, and many packages of data visualization, machine learning, natural language processing, and complex data analysis are written for it.
What is important to learn in Python:
- functions, classes, objects;
- data structures;
- basic algorithms and libraries;
- high-quality debugging and testing code;
- Jupyter Notebook;
- Git
To master the basic concepts of Python, it will take you about 4-6 weeks, provided that you spend 2-3 hours a day studying.
Python Libraries
Numpy
NumPy is a library of scientific computing. Almost every Python package for Data Science or Machine Learning depends on it: SciPy (Scientific Python), Matplotlib, Scikit-learn.
NumPy helps you perform mathematical and logical operations: for example, it contains useful functions for n-arrays and matrices. The library also supports multidimensional arrays and high-level math functions for working with them.
Why do you need to know mathematics? Why the computer can not count everything?
Often, machine learning methods use matrices to store and process input data. Matrices, vector spaces, and linear equations are all linear algebra.
To understand how machine learning methods work, you need to know mathematics well. Therefore, it will be better to complete the entire course of algebra entirely: alone or with mentors.
In addition, mathematics and mathematical analysis are important for process optimization. Knowing them, it is easier to improve the speed and accuracy of machine learning models.
What is important to master:
- basis of linear algebra: linear combinations, dependence and independence, vector points and vector product, matrix transformations, matrix multiplication,
- inverse functions;
- arrays;
- processing of mathematical expressions and static data;
- visualization via Matplotlib, Seaborn or Plotly.
Where can I pull up knowledge of NumPy: official documentation.
Where you can pull up knowledge of algebra: Calculus (Chapter 11) for Data Science.
Pandas
Pandas is an open source library built on NumPy. It allows you to perform quick analysis, cleansing, and data preparation. Such a kind of Excel for Python.
The library is well able to work with data from different sources: Excel sheets, CSV files, SQL files, web pages.
What is important to master:
- reading and writing many different data formats;
- selection of data subsets;
- search and fill in missing data;
- applying operations to independent data groups;
- data conversion into different forms;
- combining multiple data sets together;
- extended time series functionality.
Where can I pull up knowledge on Pandas: Pydata.
Databases and Information Collection
If you are already familiar with Python, Pandas and NumPy, you can begin to learn how to work with databases and parsing information.
SQL
Although NoSQL and Hadoop have already taken root in the science of data, it is important to be able to write and execute complex queries on SQL.
Often, raw data — from electronic medical records to customer transaction history — is stored in organized collections of tables called relational databases. To be a good data expert, you need to know how to process and extract data from these databases.
Need to learn:
- add, delete and retrieve data from databases;
- perform analytical functions and transform database structures;
- PostgreSQL;
- MySQL;
- SQL Server.
Algorithms
To be a programmer without knowledge of algorithms is scary, and Data Scientist is dangerous. So if you have already mastered Python, Pandas, NumPy, SQL and API, it’s time to learn how to use these technologies for research.
The speed of a good specialist often depends on three factors: the question posed, the amount of data and the algorithm chosen.
Therefore, at this stage it is important to understand the algorithms and data structures of Bellman-Ford, Dijkstra, binary search (and binary trees as a tool), depth and width search.
Tproger (algorithms, data structures) and Khan Academy will help to tighten knowledge.
Machine learning and neural networks
It's time to apply the skills to solve real problems. Before this stage, it is important to know mathematics: searching, cleansing and preparing data, building models from the point of view of mathematics and statistics, optimizing them with the help of mathematical analysis - that's all.
Real problems are most often solved with the help of serious libraries like TensorFlow and Keras.
Need to master:
- data preprocessing
- linear and logistic regression,
- clustering and learning without teachers,
- time series analysis
- decision trees
- recommendation systems.
You can further strengthen knowledge of machine learning here: Machine learning from Andrew Un.
Thanks for this. Sometimes I think about getting into this.