Next Steps#

If you’ve made it through the other chapters in this book, you have built a basic working knowledge of some of the fundamentals of data science, including data visualization and manipulation, probability and statistics, and linear algebra and dimensionality reduction. You have learned the basics of some of the most important libraries for working with data in Python, including Pandas, NumPy, Matplotlib, SciPy, and Scikit-learn. However, data science is a very broad field that requires a diverse skill set. In this chapter, I provide some pointers to books and online courses that you might consider to further develop your data-science skill set.

Machine Learning, Statistical Learning, and Neural Networks#

The next step for many readers of this book will be to learn about machine learning (ML) techniques. A good book to learn most of the basic concepts and techniques of ML is Machine Learning: A First Course for Engineers and Scientists[1] by Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön, published by Cambridge University Press, 2022 (ISBN 1108843603).

To put these ideas into practice, I recommend Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow[1] (3rd edition) by Aurélien Géron, published by O’Reilly Media, 2022 (ISBN 1098125975). This is a comprehensive book that includes both theory and practice. It builds upon our work with scikit-learn and leverages the popular Keras and TensorFlow libraries for creating neural networks. This book covers some of the most advanced and popular neural network architectures and techniques as of 2022, including attention and transformers.

Many ML techniques can be considered extensions of statistics techniques and thus are a type of statistical learning. For an accessible but broad coverage of statistical learning techniques, I recommend An Introduction to Statistical Learning: with Applications in Python[1] by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor published by Springer, 2023 (ISBN 3031387465).

Online courses:

As an alternative to a textbook, Inria and France Université Numérique (FUN) periodically offer a massive open online course (MOOC) called “Machine learning in Python with scikit-learn.” Some of the scikit-learn core developers are involved in this course. The course runs for thirteen weeks and covers many of the machine-learning models in scikit-learn. (It does not cover neural networks.) For details, see https://www.fun-mooc.fr/en/courses/machine-learning-python-scikit-learn/

To learn more about neural networks from an online course, I recommend Practical Deep Learning for Coders, from fast.ai: https://www.fast.ai/. This is a two-part course. Part 1 covers the basics of neural networks, and Part 2 delves into advanced topics, including popular neural network architectures and techniques as of 2022. Part 1 of the course is based on the book Deep Learning for Coders with fastai & PyTorch: AI Applications Without a PhD[1] by Jeremy Howard and Sylvain Gugger, published by OReilly Media, 2020 (ISBN 1492045527).

An Overview of Data Science#

For readers who would like to get a broad overview of topics and techniques used in data science, I recommend Data Science from Scratch[1] (2nd ed.) by Joel Grus, published by O’Reilly Media, 2019 (ISBN 1492041130). As with the book you are reading, Data Science from Scratch uses Python. It provides a broader but less in-depth coverage of data science, including probability and linear algebra, statistical tests, collecting and working with data, machine learning, natural language processing, databases, and ethics. The coverage of each topic is not very deep, but you can get an idea of many important topics of interest to data scientists.

Pandas#

The Pandas library is widely used by data scientists for loading, cleaning, manipulating, and storing data. There are several good choices to deepen your understanding of this powerful library, depending on your learning goals. Two that I recommend most are:

  • Python for Data Analysis: Data Wrangling with Pandas, NumPy, and Jupyter[1] (3rd edition) by Wes McKinney, published by O’Reilly Media, 2022 (ISBN 109810403X). This book covers the most important Pandas techniques and how to apply them. It provides detailed background on Python and NumPy. This book also provides an introduction to several statistical and machine learning libraries, including Patsy, statsmodels, and scikit-learn.

  • For an advanced treatment that focuses solely on Pandas, I recommend Effective Pandas Patterns for Data Manipulation[1] by Matt Harrison, independently published, 2021 (ISBN-13 979-8772692936). This book has a comprehensive and sophisticated coverage of Pandas, including advanced topics, such as styling Pandas output and debugging for Pandas.

Data Visualization#

In this textbook, we have used a variety of plots, including scatter plots, histograms,line plots, contour plots, and heatmaps. However, there are many important topics in data visualization that could not be covered. These include principles for visualizing data and techniques for making visualizations attractive and effective at conveying a message. Many books address these topics from different aspects, but I will suggest one that I feel is most targeted toward data scientists and most usable as a follow-up text to the book you are reading. That book is Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures[1] by Claus Wilke, published by O’Reilly Media, 2019 (ISBN 1492031089). This book provides a series of chapters for visualizing different types of data but also includes a variety of topics that address important principles of data visualization. The code to generate the figures in the book is provided on the author’s GitHub page in the R programming language at clauswilke/dataviz. However, code to create the visualizations in Python is also provided by Hoa Nguyen at hnguyentt/dataviz-python.

Databases#

Databases are systems for storing data in an organized way that facilitate the storage and maintenance of huge amounts of data, while also typically providing sophisticated techniques to retrieve selected data for further processing. Thus, understanding databases and how to work with them is an important skill for data science. For those who are new to databases, I recommend learning about Structured Query Language (SQL — pronounced “ess cue ell” or like the word “sequel”), which is a language for creating, manipulating, and querying databases. SQL databases are the most common structured databases.

For a general introduction to databases and SQL, I recommend Getting Started with SQL: A Hands-On Approach for Beginners[1] by Thomas Nield, published by O’Reilly Media, 2016 (ISBN 1491938617). This book has the advantage that it teaches users about SQL programming through the use of SQLite and SQLiteStudio. SQLite is a serverless SQL database, and SQLiteStudio provides a graphical interface for working with the SQLite database. The combination makes it easy for users to get started learning SQL. The book covers all the basic features of SQL and covers important topics, such as how to design a database structure. It also provides links to next steps.

If you prefer to learn about SQL in the context of data science, I recommend SQL for Data Scientists: A Beginner’s Guide for Building Datasets for Analysis[1] by Renee M. P. Teate, published by Wiley, 2021 (ISBN 1119669367). This book uses an example of a Farmer’s Market Database throughout the book, and the book’s website provides an Interactive SQL Editor that can be used to interact with this database without having to install an SQL server.

Probability#

For readers who want to learn more about probability, I recommend Probability, Statistics, and Random Processes for Electrical Engineering[1] (3rd edition) by Alberto Leon-Garcia, published by Pearson, 2008 (ISBN 0131471228). This is a traditional textbook targeted at undergraduate and graduate engineering students, and it treats probability and statistics using a theoretical and mathematical approach. This book does not require an electrical engineering (EE) background but does have some examples that are more targeted toward EE. This book covers a huge amount of material. It is very well written and has lots of examples to help readers master the material.

Keeping Up with What’s Next#

“Towards Data Science”: https://towardsdatascience.com/ is a Medium publication that produces a constant stream of curated articles related to data science and machine learning. This website will help you keep up with the current trends in the field, and it also has many tutorials that will help you learn new statistical techniques.