###### Climate-related Risk Drivers and their ...

After completing this reading, you should be able to: Describe climate-related risk drivers... **Read More**

**After completing this reading, you should be able to:**

- Discuss the issues unique to big datasets.
- Describe and assess various tools and techniques used to manipulate and analyze big data.
- Examine areas for collaboration between econometrics and machine learning.

The current increase in usage of computer-mediated transactions has enabled automated data collection from these transactions. These data need to be manipulated and analyzed. The conventional statistical and econometric techniques such as regression may work well. However, there are issues unique to big datasets that require different tools as discussed below:/p>

The large size of the data involved requires more powerful data manipulation tools. Besides the need for massive storage, there is a need for additional computing resources to process the huge dataset.

There may be more potential predictors than appropriate for estimation. Therefore, variable selection is required to select the most appropriate subset of predictors. Unnecessary predictors may lead to more noise in the estimation of other quantities of interest.

Large datasets may allow for more flexible and efficient modeling algorithms than simple linear models. Machine learning techniques may be more appropriate tools for modeling complex relationships. These techniques include decision trees, deep learning, support vector machines, among others.

Historically, economists have dealt with small sizes of data that could fit in a spreadsheet. However, this is changing as new and more-detailed data has become available. One reason for more-detailed data is the advancement of the internet since everything on the internet is recorded. Additionally, there is increased usage of computer-mediated transactions leading to the emergence of more detailed data. The following are the tools for manipulating big data.

Relational databases offer a flexible way of storage, manipulation, and retrieval of data using a structured query language (SQL). SQL is easy to learn and very useful when dealing with medium-sized datasets. However, standard relational databases may not be well suited when dealing with several millions of observations or several gigabytes of data as they become unmanageable.

These databases are more appropriate for managing several gigabytes of data. Nonrelational databases, also called NoSQL databases, gives a means for storing and extracting data that is not modeled in tabular relations used in relational databases. Many firms find it necessary to develop systems that can process billions of transactions per day. Since the transactions are computer-mediated, the data is readily available. Some of the tools for manipulating big data based on nonrelational databases include the following:

This is a system that supports large files. The files are too large to the extent that they must be distributed across hundreds or thousands of computers.

This is a data table built on the Google file system that supports sparse semistructured data and can stretch over many computers.

This is a system used to access and manipulate large data structures such as Bigtables. MapReduce allows a user to use multiple machines to access the required data or database. The user query is mapped to the machines and then applied in parallel to different fragments of the data. The partial calculations are then combined to create the required summary table.

This is a procedural domain-specific programming language used to create MapReduce jobs.

Go is a flexible open-source general-purpose computer language that enables an easier parallel data processing.

This is a tool that lets data queries be written in a simplified form of the structured query language (SQL). Dremel makes it possible to run an SQL query on a petabyte of data (1,000 terabytes) in a few seconds.

After the big data is processed with the help of the manipulating tools, the outcome is usually a summarized table of data that is directly human-readable or can be loaded into an SQL database, a spreadsheet, or a statistical package. If the outcome is still inconveniently large, then it is more appropriate to select a subsample for statistical analysis.

There are four categories of data analysis in statistics and econometrics; they include the following:

- Prediction
- Summarization
- Estimation
- Hypothesis-testing

The tools for big data analysis are aimed at achieving one or more of the above-named categories. The following are some of the tools for analyzing big data.

Regression analysis is the most common tool used for summarization and is mostly preferred in economic applications. Elastic net regression is used for estimation. In order to formulate a statistical prediction, the main concern is understanding the conditional distribution of a variable y given other variables \(x=(x_1,…,x_p)\). The analyst will have observed values of y and x and is interested in computing a good prediction of y given new values of x.

Machine learning is majorly concerned with prediction. It is used to develop high-performance computer systems to provide predictions in the presence of computational constraints. Consider the conditional distribution of a variable y given other variables \(x=(x_1,…,x_p)\). The x-variables are called predictors or features in machine learning. A function is generated to provide a “good” prediction of y as a function of x. Usually, “good” means it minimizes some loss function such as the sum of squared residuals, and so on. While most analysts would think of using a linear or logistic regression when solving a prediction problem, there are better choices, especially when a lot of data is available. These choices include nonlinear regression such as classification and regression trees (CART), random forests, and penalized regression such as LASSO.

Data science is majorly concerned with prediction and summarization, but also with data manipulation, visualization, and other similar tasks.

Econometrics’ major objective is to find the causal relationship of data while machine learning’s aims at achieving a predictive accuracy. Econometrics and machine learning, thus, differ in focus, purpose, and techniques. Nonetheless, both the techniques perform well in their separate orbits. However, due to the increase in big data and the demand for solving complex problems, there are emerging trends of integrating econometrics and machine learning. The following are some of the areas where there exist opportunities for productive collaboration between econometrics and machine learning.

This is the most important area for collaboration between econometrics and machine learning. Econometricians have developed various tools for causal inference. These tools include instrumental variables, difference-in-differences, regression discontinuity, and various forms of natural and designed experiments. Although most of machine learning work mostly deals with pure prediction, theoretical computer scientists have made various contributions to causal modeling. However, these theoretical advances have not yet been incorporated into machine learning practice to a significant level.

Machine learning techniques mostly use cross-sectional data, which is assumed to be independent and identically distributed. However, bayesian structural time series (BSTS) models indicate that some of these techniques can be applied for time series models. Additionally, machine learning techniques may also be used to look at panel data.

## Practice Question

Which of the following is a tool for analyzing big data?

A. Regression analysis

B. Data processing

C. Summarization

D. All of the above

The correct answer is: A).

Regression analysis is the most commonly used tool for summarizing big data and is mostly preferred in economic applications. It can also be used for estimation as well as hypothesis testing.

B is incorrect:Data processing involves the manipulation of raw data to a more useful form for analysis, a process that occurs before analysis.

C is incorrect:Summarization is one of the four categories of data analysis; data analysis tools aim at achieving these categories.