Save 10% on All AnalystPrep 2024 Study Packages with Coupon Code BLOG10.

Big Data: New Tricks for Econometrics

28 Apr 2020

After completing this reading, you should be able to:

Discuss the issues unique to big datasets.
Describe and assess various tools and techniques used to manipulate and analyze big data.
Examine areas for collaboration between econometrics and machine learning.

Issues Unique to Big Datasets

The current increase in usage of computer-mediated transactions has enabled automated data collection from these transactions. These data need to be manipulated and analyzed. The conventional statistical and econometric techniques such as regression may work well. However, there are issues unique to big datasets that require different tools as discussed below:/p>

Requires More Powerful Data Manipulation Tools

The large size of the data involved requires more powerful data manipulation tools. Besides the need for massive storage, there is a need for additional computing resources to process the huge dataset.

Requires Variable Selection

There may be more potential predictors than appropriate for estimation. Therefore, variable selection is required to select the most appropriate subset of predictors. Unnecessary predictors may lead to more noise in the estimation of other quantities of interest.

Allows for More Flexible Modeling Algorithms

Large datasets may allow for more flexible and efficient modeling algorithms than simple linear models. Machine learning techniques may be more appropriate tools for modeling complex relationships. These techniques include decision trees, deep learning, support vector machines, among others.

Tools and Techniques used to Manipulate and Analyze Big Data

Tools to Manipulate Big Data

Historically, economists have dealt with small sizes of data that could fit in a spreadsheet. However, this is changing as new and more-detailed data has become available. One reason for more-detailed data is the advancement of the internet since everything on the internet is recorded. Additionally, there is increased usage of computer-mediated transactions leading to the emergence of more detailed data. The following are the tools for manipulating big data.

Relational Databases

Relational databases offer a flexible way of storage, manipulation, and retrieval of data using a structured query language (SQL). SQL is easy to learn and very useful when dealing with medium-sized datasets. However, standard relational databases may not be well suited when dealing with several millions of observations or several gigabytes of data as they become unmanageable.

Nonrelational Databases

These databases are more appropriate for managing several gigabytes of data. Nonrelational databases, also called NoSQL databases, gives a means for storing and extracting data that is not modeled in tabular relations used in relational databases. Many firms find it necessary to develop systems that can process billions of transactions per day. Since the transactions are computer-mediated, the data is readily available. Some of the tools for manipulating big data based on nonrelational databases include the following:

Google File System or Hadoop File System

This is a system that supports large files. The files are too large to the extent that they must be distributed across hundreds or thousands of computers.

Bigtable/Cassandra

This is a data table built on the Google file system that supports sparse semistructured data and can stretch over many computers.

MapReduce/Hadoop

This is a system used to access and manipulate large data structures such as Bigtables. MapReduce allows a user to use multiple machines to access the required data or database. The user query is mapped to the machines and then applied in parallel to different fragments of the data. The partial calculations are then combined to create the required summary table.

Sawzall/Pig

This is a procedural domain-specific programming language used to create MapReduce jobs.

Go

Go is a flexible open-source general-purpose computer language that enables an easier parallel data processing.

Dremel, BigQuery/Hive, Drill, Impala

This is a tool that lets data queries be written in a simplified form of the structured query language (SQL). Dremel makes it possible to run an SQL query on a petabyte of data (1,000 terabytes) in a few seconds.

Tools for Analyzing Big Data

After the big data is processed with the help of the manipulating tools, the outcome is usually a summarized table of data that is directly human-readable or can be loaded into an SQL database, a spreadsheet, or a statistical package. If the outcome is still inconveniently large, then it is more appropriate to select a subsample for statistical analysis.

There are four categories of data analysis in statistics and econometrics; they include the following:

Prediction
Summarization
Estimation
Hypothesis-testing

The tools for big data analysis are aimed at achieving one or more of the above-named categories. The following are some of the tools for analyzing big data.

Regression Analysis

Regression analysis is the most common tool used for summarization and is mostly preferred in economic applications. Elastic net regression is used for estimation. In order to formulate a statistical prediction, the main concern is understanding the conditional distribution of a variable y given other variables \(x=(x_1,…,x_p)\). The analyst will have observed values of y and x and is interested in computing a good prediction of y given new values of x.

Machine Learning

Machine learning is majorly concerned with prediction. It is used to develop high-performance computer systems to provide predictions in the presence of computational constraints. Consider the conditional distribution of a variable y given other variables \(x=(x_1,…,x_p)\). The x-variables are called predictors or features in machine learning. A function is generated to provide a “good” prediction of y as a function of x. Usually, “good” means it minimizes some loss function such as the sum of squared residuals, and so on. While most analysts would think of using a linear or logistic regression when solving a prediction problem, there are better choices, especially when a lot of data is available. These choices include nonlinear regression such as classification and regression trees (CART), random forests, and penalized regression such as LASSO.

Data Science

Data science is majorly concerned with prediction and summarization, but also with data manipulation, visualization, and other similar tasks.

Collaborations Between Econometrics and Machine Learning

Econometrics’ major objective is to find the causal relationship of data while machine learning’s aims at achieving a predictive accuracy. Econometrics and machine learning, thus, differ in focus, purpose, and techniques. Nonetheless, both the techniques perform well in their separate orbits. However, due to the increase in big data and the demand for solving complex problems, there are emerging trends of integrating econometrics and machine learning. The following are some of the areas where there exist opportunities for productive collaboration between econometrics and machine learning.

Causal Inference

This is the most important area for collaboration between econometrics and machine learning. Econometricians have developed various tools for causal inference. These tools include instrumental variables, difference-in-differences, regression discontinuity, and various forms of natural and designed experiments. Although most of machine learning work mostly deals with pure prediction, theoretical computer scientists have made various contributions to causal modeling. However, these theoretical advances have not yet been incorporated into machine learning practice to a significant level.

Time Series Models

Machine learning techniques mostly use cross-sectional data, which is assumed to be independent and identically distributed. However, bayesian structural time series (BSTS) models indicate that some of these techniques can be applied for time series models. Additionally, machine learning techniques may also be used to look at panel data.

Practice Question

Which of the following is a tool for analyzing big data?

A. Regression analysis

B. Data processing

C. Summarization

D. All of the above

The correct answer is: A).

Regression analysis is the most commonly used tool for summarizing big data and is mostly preferred in economic applications. It can also be used for estimation as well as hypothesis testing.

B is incorrect: Data processing involves the manipulation of raw data to a more useful form for analysis, a process that occurs before analysis.

C is incorrect: Summarization is one of the four categories of data analysis; data analysis tools aim at achieving these categories.

Offered by AnalystPrep

Swaps

Principles for Sound Stress Testing – Practices and Supervision

Country Risk: Determinants, Measures, and Implications

Daniel Glyn

2021-03-24

I have finished my FRM1 thanks to AnalystPrep. And now using AnalystPrep for my FRM2 preparation. Professor Forjan is brilliant. He gives such good explanations and analogies. And more than anything makes learning fun. A big thank you to Analystprep and Professor Forjan. 5 stars all the way!

michael walshe

2021-03-18

Professor James' videos are excellent for understanding the underlying theories behind financial engineering / financial analysis. The AnalystPrep videos were better than any of the others that I searched through on YouTube for providing a clear explanation of some concepts, such as Portfolio theory, CAPM, and Arbitrage Pricing theory. Watching these cleared up many of the unclarities I had in my head. Highly recommended.