The 17 Best Free Tools for Data Science. Dataquest, November 8, 2019
“If you’re just getting started, we’ve picked out some of our absolute favorites: the best free tools for data science using Python, R, and SQL.
- R
A key benefit of the R language is that it was designed primarily for statistical computing, so many of the key features that data scientists need are built-in.
- Python
Like R, Python was also created in the 90s. But unlike R, Python is a general-purpose programming language. It’s often used for web development, and it is one of the most popular overall programming languages. If you learn Python and later decide that software development is a better fit for you than data science, a lot of what you’ve learned is transferable.
- SQL
SQL is complimentary language to Python and R — often it will be the second language someone learns if they’re looking to get into data science. Because most of the world’s data is stored in databases, SQL is an incredibly valuable language to learn. It’s common for data scientists to use SQL to retrieve data that they will then clean and analyze using Python or R.
R Packages
R has a thriving ecosystem of packages that add functionality to the core R language. These packages are distributed by CRAN ( Comprehensive R Archive Network) and can be downloaded using R syntax (as opposed to Python that uses separate package managers). The packages we list below are some of the most commonly used and popular packages for data science in R.
- Tidyverse
Technically, tidyverse is a collection of R packages, but we include it here together because it is the most commonly used set of packages for data science in R. Key packages in the collection include dplr for data manipulation, readr for importing data, ggplot2 for data visualization, and many more (see picture below).
- ggplot2
The ggplot2 package allows you to create data visualizations in R. Even though ggplot2 is part of the tidyverse collection, it predates the collection and is important enough to mention is its own.
ggplot2 is popular because it allows you to create professional-looking visualizations fast using easy-to-understand syntax.
R includes plotting functionality built-in, but the ggplot package is generally considered superior and easier to use and is the number one R package for data visualization.
- R Markdown
The R Markdown package facilitates the creation of reports using R. R Markdown documents are text files that contain code snippets interleaved with markdown text.
R Markdown documents are often edited in a notebook interface that allows the creation of code and text side by side. The notebook interface allows the code to be executed and the output of the code to be seen inline with the text.
R Markdown documents can be rendered into many versatile formats including HTML, PDF, Microsoft Word, books, and more!
- Shiny
The Shiny package allows you to build interactive web apps using R. You can build functionality that allows people to interact with your data, analysis, and visualizations as a web page.
Shiny is particularly powerful because it removes the need for web development skills and knowledge when creating apps and allows you to focus on your data.
- mlr
The mlr package provides a standard set of syntax and features that allow you to work with machine learning algorithms in R. While R has built-in machine learning capabilities, they are cumbersome to work with. Mlr provides an easier interface so you can focus on training your models.
mlr contains classification, regression, and clustering analysis methods as well as countless other related capabilities.
Python Libraries
Like R, Python also has a thriving package ecosystem, although Python packages are often called libraries.
Unlike R, Python’s primary purpose is not as a data science language, so use of data-focused libraries like pandas is more or less mandatory for working with data in Python.
Python packages can be downloaded from PyPI (the Python Package Index) using pip, a tool that comes with Python but is external to the Python coding environment.
(A complementary alternative to pip is the conda package manager, which we’ll talk about later on.)
- pandas
The pandas library is built for cleaning, manipulating, transforming and visualizing data in Python. Although it’s a single package, its closest analog in R is the tidyverse collection.
In addition to offering a lot of convenience, pandas is also often faster than pure Python for working with data. Like R, pandas takes advantage of vectorization, which speeds up code execution.
- NumPy
NumPy is a fundamental Python library that provides functionality for scientific computing. NumPy provides some of the core logic that pandas is built upon. Usually, most data scientists will work with pandas, but knowing NumPy is important as it allows you to access some of the core functionality when you need to.
- Matplotlib
The Matplotlib library is a powerful plotting library for Python. Data scientists often use the Pyplot module from the library, which provides a standard interface for plotting data.
The plotting functionality that is included in pandas calls Matplotlib under the hood, so understanding matplotlib helps with customizing plots you make in pandas.
- Scikit-Learn
Scikit-learn is the most popular machine learning library for Python. The library provides a set of tools built on NumPy and Matplotlib for that allow for the preparation and training of machine learning models.
Available model types include classification, regression, clustering, and dimensionality reduction.
- Tensorflow
Tensorflow is a Python library originally developed by Google that provides an interface and framework for working with neural networks and deep learning.
Tensorflow is ideal for tasks where deep learning excels, such as computer vision, natural language processing, audio/video recognition, and more.
Software
So far, we’ve looked at the best languages for data science and the best packages for two of those languages. (As a query language, SQL is a bit different and doesn’t use “packages” in the same sense).
Next, we’ll look at some software tools that are useful for data science work. These aren’t all open-source, but they’re free for anyone to use, and if you work with data on a regular basis they can be big time-savers.
- Google Sheets
If this were not a list of free tools, then undoubtedly Microsoft Excel would be at the top of this list. The ubiquitous spreadsheet software makes it quick and easy to work with data in a visual way, and is used by millions of people around the world.
Google’s Excel clone has of the core functionality of Excel, and is available free to anyone with a Google account.
- RStudio Desktop
RStudio Desktop is the most popular environment for working with R. It includes a code editor, an R console, notebooks, tools for plotting, debugging, and more.
Additionally, Rstudio (the company who make Rstudio Desktop) are at the core of modern R development, employing the developers of the tidyverse, shiny, and other important R packages.
- Jupyter Notebook
Jupyter Notebook is the most popular environment for working with Python for data science. Similar to R Markdown, Jupyter notebooks allow you to combine code, text, and plots in a single document which makes data work easy.
Like RMarkdown, Jupyter notebooks can be exported in a number of formats including HTML, PDF, and more.
Dataquest’s guided Python data science projects almost all task students with building projects in Jupyter Notebooks, since that’s what working data analysts and scientists generally do in real-world work.
- Anaconda
Anaconda is a distribution of Python designed specifically to help you get the scientific Python tools installed. Before Anaconda, the only option was to install Python by itself, and then install packages like NumPy, pandas, Matplotlib one by one. That which wasn’t always a straightforward process, and it was often difficult for new learners.
Anaconda includes all of the main packages needed for data science in one easy install, which saves time and allows you to get started quickly. It also has Jupyter Notebooks built-in, and makes starting a new data science project easily accessible from a launcher window. It is the recommended way to get started using Python for data science.
Anaconda also includes the conda package manager, which can be used as an alternative to pip to install Python packages (although you can also use pip if you prefer).
Pročitajte više