Multiple Regression Model
Consider the multiple regression of the price of the US Dollar index on... Read More
The main objective of data exploration is to investigate and comprehend data distributions and relationships. Data exploration involves three critical tasks: exploratory data analysis, feature selection, and feature engineering.
Exploratory Data Analysis (EDA) is the first step in exploring data. The data is summarized and observed using exploratory graphs, charts, and other visualizations. The main objective of EDA is to act as a communication medium among project stakeholders and analysts. Additionally, EDA is intended to aid in: understanding data properties, finding patterns and relationships in data, inspecting basic questions and hypotheses, documenting data distributions, and planning modeling strategies for the next steps.
Feature selection is a process of selecting only features that contribute most to the prediction variable or output from the dataset for ML model training. Selecting fewer features decreases ML model complexity and training time.
Feature engineering is a process of generating new features by transforming existing features. Model performance depends heavily on feature selection and engineering.
For structured data, EDA can either be performed on one-dimension (a single feature) or multi-dimension (multiple features). Whereas histograms, bar charts, bar plots, and density plots are one dimension, visualizations, line graphs, and scatterplots visualize multi-dimension data. Additionally, descriptive statistics such as central tendency measures, minimum and maximum values for continuous data are useful to summarize data. Counts and frequencies for categorical data can be used to gain insight into the distribution of possible values.
Features of structured data are represented by different columns of data in a table or matrix. The objective of the feature selection process is to assist in identifying significant features that, when used in a model, retain the essential patterns and complexities of the larger dataset. Further, these features should require less data overall.
Feature selection methods are utilized to rank all features. If the target variable of interest is discrete, such techniques as chi-square test, correlation coefficient, and information gain would be applicable. These are univariate techniques that score feature variables individually.
Both dimensionality reduction and feature selection seek to reduce the number of features in a data set. The dimensionality reduction method generates new combinations of features that do not correlate. However, feature selection includes and excludes features present in the data without altering them.
Feature engineering is the process of optimizing and improving the features of the data further. Feature engineering techniques methodically modify, decompose, or combine existing features to produce more significant features. More essential features allow an ML model to train more rapidly and efficiently. This process depends on the context of the project, domain of the data, and nature of the problem. An existing feature can be engineered to a new feature or decomposed to multiple features.
Categorical variables can be converted into a binary form (0 or 1) for machine-reading, a process called one-hot encoding. For example, if a single categorical feature represents gender identities with four possible values—male, female, transgender, gender-neutral—then these values can be decomposed into four new features, one for each possible value (e.g., is_male, is_female) filled with 0s (for false) and 1s (for true).
Text data incorporates a collection of texts (also known as a corpus) that are sequences of tokens. It is useful to perform EDA of text data by calculating basic text statistics on the tokens. These may include term frequency (TF), which is the ratio of the number of times a given token occurs in all the texts in the dataset to the total number of tokens. Some examples of basic text statistics include word associations, average word and sentence length, and word and syllable counts. These statistics reveal patterns in the co-occurrence of words.
Text modeling involves identifying the words which are most informative in a text by computing the term frequency of each word. The words with high term frequency values are removed since they are likely to be the stop words, making the resulting bag-of-words more compact. The chi-square measure of word association applies to sentiment analysis and text classification application to aid in understanding the significant word appearances in negative and positive sentences in the text.
Similar to structured data, bar charts, and word clouds can be used to visualize text data. Word clouds can be made to visualize the most informative words and their term frequency values. Varying font sizes can show the most commonly occurring words. Further, color is used to add more dimensions, such as frequency and length of words.
Feature selection for text data entails selecting a subset of tokens in a data set to effectively reduce the bag-of-words size, making the ML model more efficient and less complicated. Feature selection eliminates noisy features from the dataset. The popular feature selection methods in text data are as follows:
Similar to structured data, financial engineering is a fundamental step that dramatically improves ML model training. Some techniques for feature engineering include:
Note that the fundamental objective of feature engineering maintaining the semantic essence of the text while simplifying and converting it into structured data for ML.
Question
Amelia Parker is a junior analyst at ABC Investment Ltd. Parker is building an ML model that has an improved predictive power. She plans to improve the existing model that purely relies on structured financial data by incorporating finance-related text data derived from news articles and tweets relating to the company.
After preparing and wrangling the raw text data, Parker performs exploratory data analysis. She creates and analyzes a visualization that shows the most informative words in the dataset based on their term frequency (TF) values to assist in feature selection. However, she is concerned that some tokens are noise features for ML model training; therefore, she wants to remove them.
To address her concern in the exploratory data analysis, Parker is most likely to focus on those tokens that have:
A. Low chi-square statistics.
B. Very low and very high term frequency (TF) values.
C. Low mutual information (ML) value.
Solution
The correct is B.
Frequency measures are used for vocabulary pruning to eliminate noise features. The tokens with very high and low TF values are filtered across all the texts. Noise features are both the most recurrent and most rare tokens in the dataset. On one end, noise features can be stopwords that are typically present frequently in all the texts across the dataset.
On the other end, noise features can be sparse terms that are present in only a few text files. Recurring tokens strain the ML model to choose a decision boundary among the texts during text classification as the terms are present across all the texts. The sparse tokens mislead the ML model into classifying texts containing the rare terms into a specific class, a case of overfitting. Thus, pinpointing and eliminating noise features are essential steps for feature selection procedures.
A is incorrect. Chi-square test tests the independence of two events: occurrence of the token and occurrence of the class. Tokens with the highest chi-square test statistic values are selected as features for ML model training because of their due to higher discriminatory potential.
C is incorrect. Mutual information (MI) gauges the amount of information contributed by a token to a class of texts. The mutual information value is equivalent to 0 if the token’s distribution in all text classes is the same. Otherwise, the MI value approaches 1 as the token in any one class tends to occur more frequently in only that particular class of text.
Reading 7: Big Data Projects
LOS 7 (c) Describe objectives, methods, and examples of data exploration