Inferring from seemingly random data and finding insights is one of the key activities of a data analyst. Data is a resource that continues to be assembled and refined by individuals, companies, and organizations for myriad uses. The data isn't inherently in a form that's directly usable for meaningful analysis, though. Oftentimes, you'll have to take the steps of cleaning and putting the data into a format that is more conducive to analysis. In unlocking the big hitters in the world of data, you need to make it so others can get a grip on it and see what you can see. This involves converting the data into what we call knowledge. In doing this, you need tools and programs. An important part of data analysis is the drawing of valid and fair inferences from data. These inferences are often used in making decisions, and these decisions can affect many people and can reflect the organization's and, possibly, the analyst's view of the world.
1.1. What is Data Analysis?
The term "data analysis" may have some different meanings, depending on the particular context in which it is being used. However, we will refer to data analysis as the discovery and interpretation of patterns and trends in a dataset: "To look at and interpret the differences, commonalities, and spurious processes in (1) a single set of data (micro), (2) between parts of a set of data (meso), or (3) between the sets of data (macro) for hypothesis preservation and generation."
The application of data analysis refers to the problems where people have gathered a set of data, and the goal is to discover important characteristics from the data. Specifically, the goal of this type of data analysis is to identify patterns and to classify and summarize the data to help make it understandable. Data analysis does not result in a different set of data. Instead, it helps to structure the data and to identify interesting characteristics that help to define the data. Our purpose is to develop data analysis in terms of techniques and the nature of both quantitative and qualitative data. Data analysis can complement research by identifying connections and patterns that are of value; therefore, data analysis is useful even when the data have not been collected to address any specific question. Data analysis can be applied to new data. The focus is on extracting useful information from large volumes of data, regardless of its original purpose, to help people make wise decisions or to predict the future.
1.2. Importance of Data Analysis
Data analysis has already become an essential tool in many applications. Roughly, it consists of transforming data into interpretable information. The process of data analysis—for data that we will have called observational—usually comprises the following main steps: * Summarizing the data, usually by means of visual representations, like histograms, boxplots, and scatter plots, and numerical measures, like averages, standard deviations, etc. * Fitting a selection of parametric probability models. * Using the context to eliminate models that are unsuitable and to obtain information and interpretations about the phenomenon from these models. The goals of data analysis are often: * Identifying patterns and outliers in the data. This can be done by either using models and distribution theory or just by visualization. * Summarizing the data using suitable descriptive statistics, which depend on the data type. * Generally, doing a preliminary exploration of the data. There are other contexts where we want to use the data to make predictions or to support a hypothesis. In the first case, also known as prediction, one wants to use the data—learn from it—to come up with a rule to predict a new or future outcome based on the tendency followed by the data already known. Many data analysis tools are used for this task, like time series, regression, and classification tools. In the second case, one wants to use the data to support a particular hypothesis. One of the main steps of data analysis—modeling—is also related and used in both situations. Considering this, here goes the definition of a simple statistical model and the possible outcomes of the fitting procedure.
2. Getting Started with Python for Data Analysis
Data analysis can result in insightful and beneficial information. Some of these benefits are implementing efficient operations, identifying and refining critical areas, or enhancing customers’ experiences. No data analysis can start without having the data, and we are lucky enough that in the 21st century, getting data is not a big issue. In fact, we are very rich in terms of having data. In most organizations, one significant problem is not having data, but choosing the right data and making sense of it, mainly from the data analytics perspective. A thorough data analysis can illustrate whether historical data can reveal certain patterns or behaviors in the future, or if a particular action plan that has been taken has led to valuable outcomes.
In the market, many tools are available to perform data analysis, starting from the simple Microsoft Excel to some heavy-duty statistical tools. Each of these tools follows different sets of rules and guidelines to perform data analysis. In this specialization, we are going to utilize the power of Python to perform data analysis. In the first two modules, we will help you walk through what and how the Python programming language can efficiently perform data analysis. Later sections will introduce you to most of the challenging and significant libraries available in Python for performing data analysis. There are more tools available in Python to perform data analysis, but we are going to limit ourselves to the libraries mentioned above. Let us begin to introduce Python in the upcoming section.
2.1. Setting up Python Environment
The first step is to set up the Python environment on your computer.
There are several options available, but to make things simple, I recommend using Anaconda. The Anaconda distribution comes pre-loaded with almost all the packages that you would need to develop machine learning applications. It also includes an easy-to-use IDE to enable rapid prototyping.
When you are installing Anaconda, make sure that you choose to install Jupyter Notebooks, which is an easy-to-use tool that you can use to run Python code in an interactive manner and conduct further analysis. You will learn more about Jupyter Notebooks soon.
After you have installed Python and Jupyter Notebooks, click on the Jupyter icon to start Jupyter Notebooks, where you can complete the rest of the tasks.
Note that there are also other options that you can use to write and run Python code. For instance, you can use tools such as Google Colab, which lets you write and execute Python code in a web browser. It is well-suited for those who don’t want to set up all the various software packages on their computers.
Next, let’s install a few packages that we will need to use in our assignments. Open a command prompt and type the following commands:
conda install numpy conda install pandas conda install scikit-learn
The conda install command installs the latest version of a package and resolves any package dependencies. After following the instructions, you are ready to move to the next tutorial, where you will learn more about Jupyter Notebooks.
2.2. Basic Python Concepts for Data Analysis
Python is a high-level, interpreted, and general-purpose dynamic programming language that has been successfully used in a wide range of domains such as software engineering, signal and image processing, biomedicine, finance, and data analysis. A high percentage of people who do scientific programming use Python or some language that interoperates with Python. This is due, in part, to Python's easy-to-use and extend capabilities and to many public and open-source Python tools in the form of packages or libraries. The syntax of the Python language is simple, concise, and close to the pseudocode used in software engineering. It supports object orientation and the use of modules for splitting code into more manageable and easy-to-maintain pieces. Functions and objects are first-class citizens in the Python language, which allows, for example, passing functions as arguments to other functions and returning the result of a function. In an object-oriented environment, data and methods are joined together in the same structure. Python facilitates object-oriented programming, where functions (or methods) are primary workhorses. Python contains a substantial list of data types, such as list, tuple, and dictionary, that can be used to model and analyze data. Furthermore, Python is ideal for interfacing with a large number of existing libraries developed in languages such as C, C++, and Java. In this way, Python is often used in a scripting environment to drive applications written in a lower-level language, providing a more rapid and powerful environment for users. There are alternatives to Python in the domain of scientific computing and data analysis, but none of them provides the power of Python in terms of lightweight, ease of use, open-source, flexibility, compactness, and the number and quality of science tools.
3. Data Wrangling and Cleaning
Data wrangling is the process of converting data from the initial format to a format that may be better for analysis. Thus, data wrangling may involve various operations, which could include handling missing data, removing duplicates, adding more data, etc. Data with missing values could have been due to various reasons, like data collection errors, no value recorded, or confidentiality. Each column in a DataFrame has a data type. In order to explore the data further, it is necessary to have a profound understanding of the data or to perform data validation before changing data types. Reducing data size could be due to decreased memory usage, increased computation speed, or reading large files in smaller systems. Data cleaning is the process of applying data wrangling and further understanding to produce cleaner data for easy analysis. Here, the data is cleaned using a combination of handling missing data, data normalization, data standardization, and data binning. In some cases, the data smoothing process is applied. Data can be prone to errors; this can be due to incorrect data entry, corrupted data, or known errors while measuring data. In addition to or in substitution for removed or filtered data by a professional, the following techniques can be utilized for outliers: binning the data, using a mean or median, imputing data, and interpolation. Feature engineering is yet another procedure to streamline the data into a more outstanding format in order to reduce the risk of overfitting and improve accuracy. Data engineering applies techniques such as labeling, one-hot encoding, binning, polynomial transformations, and interaction to the dataset and later generates another standardized polynomial frame. A model is trained with the tweaked dataset, and it is tested. Data understanding and preparation were utilized to comprehend, verify, and prepare the dataset for analysis and training.
3.1. Importing Data
Data analysis has become a primary tool to extract useful information from vast amounts of data. One type of data is known as structured data - think of a table of passwords, with one column containing usernames and the other passwords. To actually be useful, the password table must be loaded into software. In this lesson, we will describe different ways to import structured data into Python, using comma-separated values and other common text-based structured field types. After data is imported into the Python language, we can focus on describing it, finding correlations in the data, or drawing inferences. In this first lesson on data analysis, we start by importing data into Python, using one of several Python libraries. Dataframes that are used can be likened to a sheet of paper, with rows and columns denoted by "index" and "columns" respectively. The extension for a CSV file is .csv, which stands for Comma Separated Values.
In the early days of computing, files were simple and both structured in similar ways. Not all structured data has as clean a format as .csv, and that format has a very limited capacity to carry information within files - for instance, it is not possible to save a .csv file with multiple sheets. Excel files are a modified version of XML, where cells are organized into rows and columns, and can be represented in tabular form. Other possibilities include JSON.
3.2. Handling Missing Data
Data analysis in Python is nearly equivalent to data analysis in Pandas, and Series and DataFrames are structures we will go back and forth to a lot during these sessions. An important initial step is to build some understanding of ways to turn data that has just entered Pandas into something that looks more manageable.
Missing data is very common in most datasets. In our discussions around statistical notation, we should have seen that missing data is denoted in many different ways, for example, 'NA', 'NaN', or also a blank space ' '. Pandas is created to handle all of these conversions for you, making life far easier!
The read_csv function in Pandas is very powerful and reads each entire file to pick up the types of each column. This means, for example, that int and float files and columns will all be picked up without any issue. Most important for data versioning, the types will be picked up with None in places where there is an NA in the file. Further, if you would like explicit not None, you can choose the keep_default_na=True, which will only keep Pandas default NA values.
In order to assist you in understanding the kind of NA per column, you can check the frequency of NAs/None by running dataframe.isnull().sum(). To verify how often a specific column appears, you can combine this with a condition on the column like dataframe['column']. In this case, you would have dataframe['column'].isnull().sum(). Additionally, you can check the opposite with isna(). If the result is True, this means that the row should get special attention, depending on the application.
4. Exploratory Data Analysis
In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted to encourage statisticians to explore datasets, to generate hypotheses that lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, as well as examining the data for outliers and data entry errors.
In data mining, exploratory data analysis is largely described by visualizing the data, which behaves as tables, charts, plots, etc. There are many situations where a user may not have a data mining goal in mind but still would like to explore a large dataset. Various methods are used for EDA, such as data visualization and data summarization of univariate, bivariate, and multivariate characteristics. Statistical and machine learning models, such as regression, can be used as part of the analysis, but they should be planned as part of an iterative and interactive process; data analysis should incorporate an ever-growing understanding and investigation.
4.1. Descriptive Statistics
So you've managed to get some data – maybe by querying some data source or just by directly entering it into Python. What's next? After you've "cleaned up" your dataset, things like deciding on whether to remove missing data points, or whether or not to split your data into different datasets based on a variable, you'll want to summarize, count, and visualize your data. This usually involves generating such things as the mean, the median, the mode, the variance, the standard deviation, the 25th and 75th percentiles, and so on. In other words, the descriptive statistics. Let's take a look at the dataset, for example. We might want to see the average of the ratings and the standard deviation of the ratings. We can do that with the functions mean and std.
4.2. Data Visualization
Data is best understood when it is visualized. In this unit, you will learn how to use a data visualization library.
If you want to analyze the data, the first step is to avoid complex raw code. To analyze and check your data, you need to use data visualization. Data visualization is the presentation of data in graphical format. It helps to understand the complex structure, general look, outliers, and the big picture of the dataset; thus, it is critical to make the data more understandable. It also helps in identifying variables that are correlated and are the most important to the model, etc. In this unit, you will learn how to use a powerful Python tool. You can plot the data in different formats, which will help you visualize the data and gain insights that are not possible using simple summary statistics.
There are several varieties of plots, like line plots, histograms, and scatter plots. Line plots overlay multiple datasets in one plot so that patterns can be compared. Histograms show the frequency distribution of a variable. Scatter plots show the relationship between two variables. A line plot is generally used to present observations collected at regular or uniform intervals. The x-variable represents the time periods. The line is used to represent the connectivity. This means that line plots can be used to predict future events.
5. Statistical Analysis with Python
In the last section, we introduced the concept of descriptive statistics as a measure to identify patterns in a given set of features. We then used these measures to better understand the different features of our data. Descriptive statistics provided a summary of our data; they show the mean and the standard deviation of the features.
In our discussion, we also approached the concept of data distribution and noted that many machine learning algorithms assume a Gaussian (or normal) distribution. We also saw a normality test that will provide an indication about the distribution of our data. In this section, we'll go deeper into the concepts of probability and inferential statistics and their application to data analysis. We'll discuss concepts such as normality, skewness, kurtosis, correlation, and sampling, and perform general statistical tests to verify the hypothesis of our data. As in the last section, we'll use packages that can be used to perform these tests. Note that, in the library, there are a considerable number of statistical tests that allow us to perform significant tests on a given sample of data.
5.1. Hypothesis Testing
Statistical hypothesis testing is a key technique of classical statistics. It is a method to test the reliability of an experimental result. We have already been using hypothesis tests throughout this course, albeit informally. We have been calculating confidence intervals and arguing for the null hypothesis based on whether the confidence intervals include the null hypothesized value. In formal hypothesis testing, we use p-values for the same purpose.
Hypothesis testing is really about distinguishing the signal in a data set from the noise. In the baby weight data set, where the outcomes of interest were continuous, our null hypothesis was that the difference in weight between the babies born to smoking mothers and the babies born to non-smoking mothers was 0. That is, with respect to that data set, there was no signal. If we were willing to reject that hypothesis, even though a difference was observed, there was too much noise to be sure whether the difference was real. The p-value is about quantifying that assessment.
More generically, the null hypothesis is the default position of no effect or no difference, whereas the alternative hypothesis is what we are trying to test. With the baby weight data set, this was simply a two-sample t-test with the following null and alternative hypotheses.
5.2. Correlation and Regression Analysis
The Pearson correlation coefficient measures the strength and direction of a relationship between two variables. The formula for the population correlation coefficient is:
ρ(x, y) = cov(x, y) / (σ_x σ_y)
where cov(x, y) = covariance between x and y, σ_x = standard deviation of x, and σ_y = standard deviation of y.
The formula for the sample correlation coefficient is:
r = Σ(x - x̄)(y - ȳ) / √(Σ(x - x̄)² Σ(y - ȳ)²)
There is another way of estimating a linear relationship between the variables in the simple linear regression, which minimizes the sum of squared differences between the predictions and the actual values. It is based on the explained and unexplained sum of squares of differences and is called least squares regression.
The line is described by the formula: ŷ = b₀ + b₁x where ŷ = dependent variable, b₀ = intercept of the line, b₁ = the regression coefficient, x = independent variable.