Save 10% on All AnalystPrep 2024 Study Packages with Coupon Code BLOG10.

quantitative-methods

Big Data

14 Aug 2023

Big data is a term that describes large, complex datasets. These datasets are analyzed with computers to uncover patterns and trends, particularly those related to human behavior. Big data includes traditional sources like company reports and government data and non-traditional sources like social media, sensors, electronic devices, and data generated as a byproduct of a company’s operations.

Characteristics of Big Data

Volume: The amount of data collected in various forms, including files, records, tables, etc. Quantities of data reach almost incomprehensible proportions.

Velocity: The speed of data processing can be extremely high. In most cases, we deal with real-time data.

Variety: The number of types/formats of data. The data could be structured (e.g., SQL tables or CSV files), semi-structured (e.g., HT ML code), or unstructured (e.g., video messages).

Veracity: This is the trustworthiness and reliability of data sources. Veracity is crucial when using big data for making predictions or drawing conclusions. Big data makes it challenging to distinguish between data quality and quantity.

Types of Big Data

Big Data can be structured, unstructured, or semi-structured:

Structured data refers to information with a high degree of organization. Items can be organized in tables and stored in a database where each field represents the same type of information.

Unstructured data refers to information with a low degree of organization. Items such as text messages, tweets, emails, voice recordings, pictures, blogs, scanners, and sensors are unorganized and cannot be presented in tabular form.

Semi-structured data may have the qualities of both structured and unstructured data.

Sources of Big Data

Financial markets: Equity, swaps, futures, options, and other derivatives.
Businesses: Financial statements, credit card purchases, and commercial transactions.
Governments: Payroll, economic, trade, employment data, etc.
Individuals: Product reviews, credit card purchases, social media posts, etc.
Sensors: Shipping cargo information, traffic data, and satellite imagery.
The Internet of Things: data generated by ‘smart’ buildings through fittings such as CCT V cameras, vehicles, home appliances, etc.

Professional investors, particularly quantitative ones, use alternative data sources in their financial analysis and decision-making. These sources significantly influence how they conduct their processes. They use alternative data to support data-driven investment models and decisions.

The following are the top three alternative data sources:

People-generated data: This data is unstructured and is primarily accessed through website clicks and page visits
Commercial operations data: This includes data on credit cards and corporate exhaust. It includes information from business transactions like point-of-sale records and banking activities. This data is typically structured.
Data produced by sensors: This data is typically unstructured and is gathered through satellites, smartphones, cameras, RFID chips, and webcams.

Investment professionals must consider legal and ethical aspects when they use non-public information. Web data scraping can gather personal data that might be legally protected or disclosed without people’s knowledge or consent.

Big Data Challenges

Quality: Important questions include, but are not limited to, “Does the dataset contain selection bias, missing data, or outliers?”
Volume: Is the quantity of data gathered adequate?
Appropriateness: Is the dataset suitable for the chosen analysis method?

Experts have created artificial intelligence (AI) and machine learning methods to handle large and intricate alternative datasets. These technologies help in understanding and evaluating this vast and complex data.

Artificial Intelligence (AI) and Machine Learning (ML)

Artificial Intelligence

In broad terms, artificial intelligence refers to machines that can perform tasks in “intelligent” ways. It has much to do with developing computer systems that exhibit cognitive and decision-making abilities comparable to or superior to humans. It is the broader concept of machines being able to carry out tasks in a way that we would consider “smart.”

Early AI took the shape of expert systems, using “if-then” computer programming to mimic human knowledge and analysis. Neural networks, another early form, mimicked human brain functions in learning and processing information.

Machine Learning

Machine learning is a current application of AI that revolves around the idea that we should really just give machines access to data and let them learn by themselves without making any assumptions about the underlying probability distribution.

The idea is that when exposed to more data, machines can make changes independently and come up with solutions to problems without reliance on human expertise – find and apply the pattern.

In the context of investment, machine learning requires big data for training. The growth of big data has enabled AI algorithms to improve modeling and predictive accuracy.

In machine learning (ML), a computer algorithm receives inputs, which can be datasets or variables, as well as outputs, representing the target data. The algorithm then learns how to effectively model inputs into outputs or describe a data structure. It learns by identifying data relationships and using this knowledge to enhance its learning process.

The ML divides the dataset into three unique types: a training dataset, a validation dataset, and a test dataset. A training dataset allows the algorithm to identify the link between inputs and outputs based on the historical pattern in the data. These relationships are then validated, and the model is adjusted using the validation dataset.

As the name suggests, the test dataset tests the model’s strength in predicting well on the new data. Note that machine learning still needs human intervention to understand the underlying data and choose suitable techniques for data analysis. In other words, before data is utilized, it must be cleaned and free from bias and spurious data.

Causes of Errors in Machine Learning

Overfitting the Data

The model overfits the data when it discovers “false” associations or “unsubstantiated” patterns that cause prediction errors and wrong forecasts. In other words, overfitting happens when the ML model is overtrained on the data and considers the noise in the data as true parameters.

Underfitting the Data

Underfitting of data occurs when the model considers the true parameters as noise and is unable to identify the relationship within the training data. In other words, the model is too simple to recognize patterns in the data.

Black Box Problem

Machine learning models don’t use explicit rules like traditional software. They learn from lots of data during training. This makes ML models, such as black boxes, sometimes give results that are hard to understand or describe.

Types of Machine Learning

Supervised Learning

Under supervised learning, computers learn to model data based on labeled training data containing inputs and the desired outputs. After “learning” how best to model the relationships for the labeled data, the algorithms are employed to predict the results for the new datasets.

Unsupervised Learning

In unsupervised learning, computers get input data without labels and have to describe it, often by grouping data points. They learn from unlabeled data and react based on commonalities. For example, grouping companies based on their financial, not geographical or industrial, characteristics is unsupervised learning.

Deep Learning

Deep learning involves computers using neural networks to process data in multiple stages, identifying complex patterns. It employs both supervised and unsupervised machine learning methods.

Question

Machine learning refers to one of the following:

Autonomous acquisition of knowledge through the use of computer programs.

Ability of machines to execute coded instructions.

Selective acquisition of knowledge through the use of computer programs.

Solution

The correct answer is A.

Machine learning means computers independently acquire knowledge through programs, enabling them to solve problems without human input. It’s about computers learning and making decisions on their own.

Sergio Torrico

2021-07-23

Excelente para el FRM 2 Escribo esta revisión en español para los hispanohablantes, soy de Bolivia, y utilicé AnalystPrep para dudas y consultas sobre mi preparación para el FRM nivel 2 (lo tomé una sola vez y aprobé muy bien), siempre tuve un soporte claro, directo y rápido, el material sale rápido cuando hay cambios en el temario de GARP, y los ejercicios y exámenes son muy útiles para practicar.

diana

2021-07-17

So helpful. I have been using the videos to prepare for the CFA Level II exam. The videos signpost the reading contents, explain the concepts and provide additional context for specific concepts. The fun light-hearted analogies are also a welcome break to some very dry content. I usually watch the videos before going into more in-depth reading and they are a good way to avoid being overwhelmed by the sheer volume of content when you look at the readings.

Kriti Dhawan

2021-07-16

A great curriculum provider. James sir explains the concept so well that rather than memorising it, you tend to intuitively understand and absorb them. Thank you ! Grateful I saw this at the right time for my CFA prep.

nikhil kumar

2021-06-28

Very well explained and gives a great insight about topics in a very short time. Glad to have found Professor Forjan's lectures.

Marwan

2021-06-22

Great support throughout the course by the team, did not feel neglected

Benjamin anonymous

2021-05-10

I loved using AnalystPrep for FRM. QBank is huge, videos are great. Would recommend to a friend

Daniel Glyn

2021-03-24

I have finished my FRM1 thanks to AnalystPrep. And now using AnalystPrep for my FRM2 preparation. Professor Forjan is brilliant. He gives such good explanations and analogies. And more than anything makes learning fun. A big thank you to Analystprep and Professor Forjan. 5 stars all the way!

michael walshe

2021-03-18

Professor James' videos are excellent for understanding the underlying theories behind financial engineering / financial analysis. The AnalystPrep videos were better than any of the others that I searched through on YouTube for providing a clear explanation of some concepts, such as Portfolio theory, CAPM, and Arbitrage Pricing theory. Watching these cleared up many of the unclarities I had in my head. Highly recommended.

Introduction to Big Data Techniques

Applications of Big Data and Data Science

quantitative-methods

Calculating Probabilities Given the Di ...

We can calculate and interpret probabilities of random variables that assume either the... Read More

quantitative-methods

Updating Probability Using Bayes’ Fo ...

Bayes’ formula is used to calculate an updated or posterior probability given a... Read More

quantitative-methods

Calculating Covariance Given a Joint P ...

quantitative-methods

Sampling Error Explained