Assumptions Underlying Linear Regression
Assume that we have samples of size \(n\) for dependent variable \(Y\) and... Read More
Big data is a term that describes large, complex datasets. These datasets are analyzed with computers to uncover patterns and trends, particularly those related to human behavior. Big data includes traditional sources like company reports and government data and non-traditional sources like social media, sensors, electronic devices, and data generated as a byproduct of a company’s operations.
Volume: The amount of data collected in various forms, including files, records, tables, etc. Quantities of data reach almost incomprehensible proportions.
Velocity: The speed of data processing can be extremely high. In most cases, we deal with real-time data.
Variety: The number of types/formats of data. The data could be structured (e.g., SQL tables or CSV files), semi-structured (e.g., HT ML code), or unstructured (e.g., video messages).
Veracity: This is the trustworthiness and reliability of data sources. Veracity is crucial when using big data for making predictions or drawing conclusions. Big data makes it challenging to distinguish between data quality and quantity.
Big Data can be structured, unstructured, or semi-structured:
Structured data refers to information with a high degree of organization. Items can be organized in tables and stored in a database where each field represents the same type of information.
Unstructured data refers to information with a low degree of organization. Items such as text messages, tweets, emails, voice recordings, pictures, blogs, scanners, and sensors are unorganized and cannot be presented in tabular form.
Semi-structured data may have the qualities of both structured and unstructured data.
Professional investors, particularly quantitative ones, use alternative data sources in their financial analysis and decision-making. These sources significantly influence how they conduct their processes. They use alternative data to support data-driven investment models and decisions.
The following are the top three alternative data sources:
Investment professionals must consider legal and ethical aspects when they use non-public information. Web data scraping can gather personal data that might be legally protected or disclosed without people’s knowledge or consent.
Experts have created artificial intelligence (AI) and machine learning methods to handle large and intricate alternative datasets. These technologies help in understanding and evaluating this vast and complex data.
In broad terms, artificial intelligence refers to machines that can perform tasks in “intelligent” ways. It has much to do with developing computer systems that exhibit cognitive and decision-making abilities comparable to or superior to humans. It is the broader concept of machines being able to carry out tasks in a way that we would consider “smart.”
Early AI took the shape of expert systems, using “if-then” computer programming to mimic human knowledge and analysis. Neural networks, another early form, mimicked human brain functions in learning and processing information.
Machine learning is a current application of AI that revolves around the idea that we should really just give machines access to data and let them learn by themselves without making any assumptions about the underlying probability distribution.
The idea is that when exposed to more data, machines can make changes independently and come up with solutions to problems without reliance on human expertise – find and apply the pattern.
In the context of investment, machine learning requires big data for training. The growth of big data has enabled AI algorithms to improve modeling and predictive accuracy.
In machine learning (ML), a computer algorithm receives inputs, which can be datasets or variables, as well as outputs, representing the target data. The algorithm then learns how to effectively model inputs into outputs or describe a data structure. It learns by identifying data relationships and using this knowledge to enhance its learning process.
The ML divides the dataset into three unique types: a training dataset, a validation dataset, and a test dataset. A training dataset allows the algorithm to identify the link between inputs and outputs based on the historical pattern in the data. These relationships are then validated, and the model is adjusted using the validation dataset.
As the name suggests, the test dataset tests the model’s strength in predicting well on the new data. Note that machine learning still needs human intervention to understand the underlying data and choose suitable techniques for data analysis. In other words, before data is utilized, it must be cleaned and free from bias and spurious data.
The model overfits the data when it discovers “false” associations or “unsubstantiated” patterns that cause prediction errors and wrong forecasts. In other words, overfitting happens when the ML model is overtrained on the data and considers the noise in the data as true parameters.
Underfitting of data occurs when the model considers the true parameters as noise and is unable to identify the relationship within the training data. In other words, the model is too simple to recognize patterns in the data.
Machine learning models don’t use explicit rules like traditional software. They learn from lots of data during training. This makes ML models, such as black boxes, sometimes give results that are hard to understand or describe.
Under supervised learning, computers learn to model data based on labeled training data containing inputs and the desired outputs. After “learning” how best to model the relationships for the labeled data, the algorithms are employed to predict the results for the new datasets.
In unsupervised learning, computers get input data without labels and have to describe it, often by grouping data points. They learn from unlabeled data and react based on commonalities. For example, grouping companies based on their financial, not geographical or industrial, characteristics is unsupervised learning.
Deep learning involves computers using neural networks to process data in multiple stages, identifying complex patterns. It employs both supervised and unsupervised machine learning methods.
Question
Machine learning refers to one of the following:
- Autonomous acquisition of knowledge through the use of computer programs.
- Ability of machines to execute coded instructions.
- Selective acquisition of knowledge through the use of computer programs.
Solution
The correct answer is A.
Machine learning means computers independently acquire knowledge through programs, enabling them to solve problems without human input. It’s about computers learning and making decisions on their own.