Preparing, Wrangling, and Exploring Textual Data for Financial Forecasting

Preparing, Wrangling, and Exploring Textual Data for Financial Forecasting

Sentiment analysis refers to the analysis of opinions or emotions from text data. In other words, it refers to how positive, negative, or neutral a particular phrase or statement is regarding a “target.” Such sentiment can provide critical predictive power for forecasting stock price movements for companies.

Here, we delve into a financial forecasting project to examine how effectively text data sentiments, from financial and economic news sources, can be classified. This project uses sentences from a text file Sentences_50Agree labeled as positive, neutral, or negative sentiment class.

The following table shows a sample data table corpus made up of 6 sentences from the Sentences_50Agree text file. A corpus is a collection of text data in any form, including lists, matrix, or data table forms.

$$\small{\begin{array}{l|l} \textbf{Sentence}&\textbf{Sentiment}\\ \hline\text{Cramo slipped to a pretax loss of EUR 6.7 million from a pretax profit of EUR 58.9 million.}&\text{Negative}\\ \hline\text{Profit before taxes decreased to EUR 31.6 mn from EUR 50.0 mn the year before.}&\text{Negative}\\ \hline\text{Also, construction expenses have gone up in Russia.}&\text{Negative}\\ \hline{\text{Finnish Bore that is owned by the Rettig family, has grown recently}\\ \text{through the acquisition of smaller shipping companies.}}&\text{Positive}\\ \hline{\text{The plan is estimated to generate some EUR 5 million (USD 6.5 m) in}\\ \text{cost savings on an annual basis.}}&\text{Positive}\\ \hline\text{Earnings per share EPS are seen at EUR 0.56}&\text{Positive}\\  \end{array}}$$

Text Preparation/Cleansing

Whenever the data is obtained from different sources, it is gathered in raw format, which is not feasible for analysis. The first step of converting the raw data into a proper format is data cleansing.

Text preparation entails removing, or incorporating appropriate substitutions for, possible extraneous information present in the text. In this case, we will eliminate punctuations, numbers, and white spaces that are not necessary for model training as follows:

Step1: Remove HTML tags.

There are no HTML tags present in this sample.

Step 2: Remove punctuations.

Percentage and dollar symbols are substituted with word annotations to retain their importance in the textual data. Further, periods, semi-colons, and commas are removed. Regex is commonly applied to remove or replace punctuations.

Step3: Remove numbers.

Numbers present in the text should be removed as they do not have significant use for sentiment analysis. It is crucial to note that sentiment analysis seeks to understand the context in which the numbers are used. Additionally, before removing numbers, abbreviations representing orders of magnitude, such as million (commonly represented by “m,” “mln,” or “mn”), billion, or trillion, should be replaced with the complete word.

Step 4: Remove white spaces

Extra spaces such as tabs, line breaks, and new lines should be identified and removed to keep the text intact and clean. The stripWhitespace function in R can be utilized to can be used to eliminate unnecessary white spaces from the text

The cleansed data is free of punctuations and numbers, with useful substitutions, as shown in the table below:

$$\small{\begin{array}{l|l} \textbf{Sentence}&\textbf{Sentiment}\\ \hline\text{Cramo slipped to a pretax loss of a EUR million from a pretax profit of EUR million}&\text{Negative}\\ \hline\text{Profit before taxes decreased to EUR million from EUR million the year before}&\text{Negative}\\ \hline\text{Also, construction expenses have gone up in Russia}&\text{Negative}\\ \hline{\text{Finnish Bore that is owned by the Rettig family has grown recently}\\ \text{through the acquisition of smaller shipping companies}}&\text{Positive}\\ \hline{\text{The plan is estimated to generate some EUR million USD million in cost}\\ \text{savings on an annual basis}}&\text{Positive}\\ \hline\text{Earnings per share EPS are seen at EUR}&\text{Positive}\\  \end{array}}$$

Text Wrangling

After textual data is cleansed, it should be normalized. The normalization process in text processing entails the following:

Step 1: Lowercasing the alphabet aids the computer to process identical words appropriately. For example, “AND,” “And,” and “and”).

Step 2: Stop words such as “the,” “for,” and “are,” usually are removed to reduce the number of tokens involved in the training set for ML training purposes. However, some stop words such as not, more, very, and few are not eliminated as they carry significant meaning in the financial texts that are useful for sentiment prediction.

Step 3: Stemming a process of linguistic normalization, which reduces words to their word root word. For example, connection, connected, connecting word reduce to a common word “connect.”

Step 4: Lemmatization is identical to stemming except that it removes endings only if the base form is present in a dictionary. Lemmatization is much more costly and advanced relative to stemming.

Step 5: Tokenization is the process of breaking down a text paragraph into smaller chunks, such as words. A token is a single entity that is a building block for a sentence or a paragraph.

Data Exploration

Exploratory Data Analysis

Exploratory data analysis (EDA) performed on text data provides insights on word distribution in the text. These word counts can be used to examine words that are most commonly and least commonly present in the texts, i.e., outliers.

The graph below is a frequency distribution of the sample tokens before stop words are eliminated.

frequency distributionFrom the figure, we observe that “currencysign,” and “million” are the most repeated words due to the financial nature of the data.

A word cloud is a data visualization approach used for representing text data. Word clouds can be made to visualize the most informative words and their term frequency (TF) values. Varying font sizes can show the most commonly occurring words. Further, color is used to add more dimensions, such as frequency and length of words.

The feature selection process will eliminate common words and highlight useful words for better model training.


Which of the following visualizations is most likely to be appropriate in the exploratory data analysis step if our objective is to create a visualization that shows the most informative words in the dataset based on their term frequency (TF) values?

    A. Scatter plot.

    B. Word cloud.

    C. Document term matrix.


The correct answer is B.

A word cloud is a common visualization when working with text data as it can be made to visualize the most informative words and their TF values. Varying font sizes can show the most commonly occurring words in the dataset, and the color is used to add more dimensions, such as frequency and length of words.

A is incorrect.  A scatter plot is a two-dimensional chart that can be employed to summarize and approximately measure relationships between two or more features.

C is incorrect. A document term matrix (DTM) is a matrix where each row belongs to a text file, and each column represents a token. The number of rows is equivalent to the number of text files in a sample text dataset. The number of columns is equal to the number of tokens from the BOW built using all the text files in the same token is present in each document. A DTM is not a visualization tool.

Reading 7: Big Data Projects

LOS 7 (e) Describe preparing, wrangling, and exploring text-based data for financial forecasting

Shop CFA® Exam Prep

Offered by AnalystPrep

Featured Shop FRM® Exam Prep Learn with Us

    Subscribe to our newsletter and keep up with the latest and greatest tips for success
    Shop Actuarial Exams Prep Shop Graduate Admission Exam Prep

    Daniel Glyn
    Daniel Glyn
    I have finished my FRM1 thanks to AnalystPrep. And now using AnalystPrep for my FRM2 preparation. Professor Forjan is brilliant. He gives such good explanations and analogies. And more than anything makes learning fun. A big thank you to Analystprep and Professor Forjan. 5 stars all the way!
    michael walshe
    michael walshe
    Professor James' videos are excellent for understanding the underlying theories behind financial engineering / financial analysis. The AnalystPrep videos were better than any of the others that I searched through on YouTube for providing a clear explanation of some concepts, such as Portfolio theory, CAPM, and Arbitrage Pricing theory. Watching these cleared up many of the unclarities I had in my head. Highly recommended.
    Nyka Smith
    Nyka Smith
    Every concept is very well explained by Nilay Arun. kudos to you man!
    Badr Moubile
    Badr Moubile
    Very helpfull!
    Agustin Olcese
    Agustin Olcese
    Excellent explantions, very clear!
    Jaak Jay
    Jaak Jay
    Awesome content, kudos to Prof.James Frojan
    sindhushree reddy
    sindhushree reddy
    Crisp and short ppt of Frm chapters and great explanation with examples.