Data Exploration

Data Exploration

The main objective of data exploration is to investigate and comprehend data distributions and relationships. Data exploration involves three critical tasks: exploratory data analysis, feature selection, and feature engineering.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the first step in exploring data. The data is summarized and observed using exploratory graphs, charts, and other visualizations. The main objective of EDA is to act as a communication medium among project stakeholders and analysts. Additionally, EDA is intended to aid in: understanding data properties, finding patterns and relationships in data, inspecting basic questions and hypotheses, documenting data distributions, and planning modeling strategies for the next steps.

Definitions

Feature selection is a process of selecting only features that contribute most to the prediction variable or output from the dataset for ML model training. Selecting fewer features decreases ML model complexity and training time.

Feature engineering is a process of generating new features by transforming existing features. Model performance depends heavily on feature selection and engineering.

Structured Data

1. Exploratory Data Analysis

For structured data, EDA can either be performed on one-dimension (a single feature) or multi-dimension (multiple features). Whereas histograms, bar charts, bar plots, and density plots are one dimension, visualizations, line graphs, and scatterplots visualize multi-dimension data. Additionally, descriptive statistics such as central tendency measures, minimum and maximum values for continuous data are useful to summarize data. Counts and frequencies for categorical data can be used to gain insight into the distribution of possible values.

2. Feature Selection

Features of structured data are represented by different columns of data in a table or matrix. The objective of the feature selection process is to assist in identifying significant features that, when used in a model, retain the essential patterns and complexities of the larger dataset. Further, these features should require less data overall.

Feature selection methods are utilized to rank all features. If the target variable of interest is discrete, such techniques as chi-square test, correlation coefficient, and information gain would be applicable. These are univariate techniques that score feature variables individually.

Both dimensionality reduction and feature selection seek to reduce the number of features in a data set. The dimensionality reduction method generates new combinations of features that do not correlate. However, feature selection includes and excludes features present in the data without altering them.

3. Feature Engineering

Feature engineering is the process of optimizing and improving the features of the data further. Feature engineering techniques methodically modify, decompose, or combine existing features to produce more significant features. More essential features allow an ML model to train more rapidly and efficiently. This process depends on the context of the project, domain of the data, and nature of the problem. An existing feature can be engineered to a new feature or decomposed to multiple features.

Categorical variables can be converted into a binary form (0 or 1) for machine-reading, a process called one-hot encoding. For example, if a single categorical feature represents gender identities with four possible values—male, female, transgender, gender-neutral—then these values can be decomposed into four new features, one for each possible value (e.g., is_male, is_female) filled with 0s (for false) and 1s (for true).

Unstructured Data

1. Exploratory Data Analysis

Text data incorporates a collection of texts (also known as a corpus) that are sequences of tokens. It is useful to perform EDA of text data by calculating basic text statistics on the tokens. These may include term frequency (TF), which is the ratio of the number of times a given token occurs in all the texts in the dataset to the total number of tokens. Some examples of basic text statistics include word associations, average word and sentence length, and word and syllable counts. These statistics reveal patterns in the co-occurrence of words.

Text modeling involves identifying the words which are most informative in a text by computing the term frequency of each word. The words with high term frequency values are removed since they are likely to be the stop words, making the resulting bag-of-words more compact. The chi-square measure of word association applies to sentiment analysis and text classification application to aid in understanding the significant word appearances in negative and positive sentences in the text.

Similar to structured data, bar charts, and word clouds can be used to visualize text data. Word clouds can be made to visualize the most informative words and their term frequency values. Varying font sizes can show the most commonly occurring words. Further, color is used to add more dimensions, such as frequency and length of words.

2. Feature Selection

Feature selection for text data entails selecting a subset of tokens in a data set to effectively reduce the bag-of-words size, making the ML model more efficient and less complicated. Feature selection eliminates noisy features from the dataset. The popular feature selection methods in text data are as follows:

  • Frequency measures are used for vocabulary pruning to eliminate noise features. The tokens with very high and low term frequency values are filtered across all the texts. Noise features can be stopwords that usually occur repeatedly in all the texts across the dataset. On the other end, noise features can be rare terms that are present in only a few text files. Document frequency (DF) is equivalent to the number of documents (texts) that contain the respective token divided by the total number of documents. It helps to discard the noise features that carry no specific information about the text class and are present across all texts.
  • Chi-square test tests the independence of two events: occurrence of the token and occurrence of the class. Tokens with the highest chi-square test statistic values are selected as features for ML model training because of their higher discriminatory potential.
  • Mutual information (MI) gauges the amount of information contributed by a token to a class of texts. The mutual information value is equivalent to 0 if the token’s distribution in all text classes is the same. Otherwise, the MI value approaches 1 as the token in any one class tends to occur more frequently in only that particular class of text.

3. Feature Engineering

Similar to structured data, financial engineering is a fundamental step that dramatically improves ML model training. Some techniques for feature engineering include:

  • Numbers: Different numbers are converted into different tokens. For example, 5-digit numbers can be replaced with “/number5/,” 10-digit numbers with “/number10/,” and so forth.
  • N-grams: These refer to n consecutive words. Multi-word patterns that are discriminative can be identified, and their association kept intact. For example, when referring to “stock market,” which refers to an economic context, a bigram would be applicable as it treats the two adjacent words as a single token, i.e., stock_market.
  • Name entity recognition (NER): This is an algorithm that takes individual tokens as inputs and pinpoints the relevant nouns such as a person, location, and organization. For example, the NER tag for the tokens, “CFA,” and “institute” is an “ORGANIZATION.”
  • Parts of speech (POS): Similar to NER, parts of speech uses language structure and dictionaries to tag every token in the text with a corresponding part of speech. For example, the POS tag for both of the tokens, “CFA,” and “institute” is “NNP,” which refers to a proper noun. POS tags can be useful for separating verbs and nouns for text analytics.

Note that the fundamental objective of feature engineering maintaining the semantic essence of the text while simplifying and converting it into structured data for ML.

Question

Amelia Parker is a junior analyst at ABC Investment Ltd. Parker is building an ML model that has an improved predictive power. She plans to improve the existing model that purely relies on structured financial data by incorporating finance-related text data derived from news articles and tweets relating to the company.

After preparing and wrangling the raw text data, Parker performs exploratory data analysis. She creates and analyzes a visualization that shows the most informative words in the dataset based on their term frequency (TF) values to assist in feature selection. However, she is concerned that some tokens are noise features for ML model training; therefore, she wants to remove them.

To address her concern in the exploratory data analysis, Parker is most likely to focus on those tokens that have:

     A. Low chi-square statistics.

     B. Very low and very high term frequency (TF) values.

     C. Low mutual information (ML) value.

Solution

The correct is B.

Frequency measures are used for vocabulary pruning to eliminate noise features. The tokens with very high and low TF values are filtered across all the texts. Noise features are both the most recurrent and most rare tokens in the dataset. On one end, noise features can be stopwords that are typically present frequently in all the texts across the dataset.

On the other end, noise features can be sparse terms that are present in only a few text files. Recurring tokens strain the ML model to choose a decision boundary among the texts during text classification as the terms are present across all the texts. The sparse tokens mislead the ML model into classifying texts containing the rare terms into a specific class, a case of overfitting. Thus, pinpointing and eliminating noise features are essential steps for feature selection procedures.

A is incorrect. Chi-square test tests the independence of two events: occurrence of the token and occurrence of the class. Tokens with the highest chi-square test statistic values are selected as features for ML model training because of their due to higher discriminatory potential.

C is incorrect. Mutual information (MI) gauges the amount of information contributed by a token to a class of texts. The mutual information value is equivalent to 0 if the token’s distribution in all text classes is the same. Otherwise, the MI value approaches 1 as the token in any one class tends to occur more frequently in only that particular class of text.

Reading 7: Big Data Projects

LOS 7 (c) Describe objectives, methods, and examples of data exploration

Shop CFA® Exam Prep

Offered by AnalystPrep

Featured Shop FRM® Exam Prep Learn with Us

    Subscribe to our newsletter and keep up with the latest and greatest tips for success
    Shop Actuarial Exams Prep Shop Graduate Admission Exam Prep


    Daniel Glyn
    Daniel Glyn
    2021-03-24
    I have finished my FRM1 thanks to AnalystPrep. And now using AnalystPrep for my FRM2 preparation. Professor Forjan is brilliant. He gives such good explanations and analogies. And more than anything makes learning fun. A big thank you to Analystprep and Professor Forjan. 5 stars all the way!
    michael walshe
    michael walshe
    2021-03-18
    Professor James' videos are excellent for understanding the underlying theories behind financial engineering / financial analysis. The AnalystPrep videos were better than any of the others that I searched through on YouTube for providing a clear explanation of some concepts, such as Portfolio theory, CAPM, and Arbitrage Pricing theory. Watching these cleared up many of the unclarities I had in my head. Highly recommended.
    Nyka Smith
    Nyka Smith
    2021-02-18
    Every concept is very well explained by Nilay Arun. kudos to you man!
    Badr Moubile
    Badr Moubile
    2021-02-13
    Very helpfull!
    Agustin Olcese
    Agustin Olcese
    2021-01-27
    Excellent explantions, very clear!
    Jaak Jay
    Jaak Jay
    2021-01-14
    Awesome content, kudos to Prof.James Frojan
    sindhushree reddy
    sindhushree reddy
    2021-01-07
    Crisp and short ppt of Frm chapters and great explanation with examples.