{"id":12352,"date":"2021-03-09T16:26:41","date_gmt":"2021-03-09T16:26:41","guid":{"rendered":"https:\/\/analystprep.com\/study-notes\/?p=12352"},"modified":"2026-06-11T10:47:30","modified_gmt":"2026-06-11T10:47:30","slug":"feature-extraction-selection-engineering-textual-data","status":"publish","type":"post","link":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/","title":{"rendered":"Feature Extraction, Selection, and Engineering of Textual Data"},"content":{"rendered":"<p><script type=\"application\/ld+json\">\n{\n  \"@context\": \"https:\/\/schema.org\",\n  \"@type\": \"ImageObject\",\n  \"url\": \"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/233-words.png\",\n  \"caption\": \"\",\n  \"width\": 987,\n  \"height\": 226,\n  \"copyrightNotice\": \"\u00a9 2024 AnalystPrep\",\n  \"acquireLicensePage\": \"https:\/\/analystprep.com\/license-info\",\n  \"creditText\": \"AnalystPrep Design Team\",\n  \"creator\": {\n    \"@type\": \"Organization\",\n    \"name\": \"AnalystPrep\"\n  }\n}\n<\/script> <script type=\"application\/ld+json\">\n{\n  \"@context\": \"https:\/\/schema.org\",\n  \"@type\": \"ImageObject\",\n  \"url\": \"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/rereara.png\",\n  \"caption\": \"P\",\n  \"width\": 1024,\n  \"height\": 430,\n  \"copyrightNotice\": \"\u00a9 2024 AnalystPrep\",\n  \"acquireLicensePage\": \"https:\/\/analystprep.com\/license-info\",\n  \"creditText\": \"AnalystPrep Design Team\",\n  \"creator\": {\n    \"@type\": \"Organization\",\n    \"name\": \"AnalystPrep\"\n  }\n}\n<\/script> <script type=\"application\/ld+json\">\n{\n  \"@context\": \"https:\/\/schema.org\",\n  \"@type\": \"QAPage\",\n  \"mainEntity\": {\n    \"@type\": \"Question\",\n    \"name\": \"The TF\u2013IDF (term frequency-inverse document frequency) for the token \u201cdecrease\u201d in sentence 647 in term frequency measures Table 2 is most likely:\",\n    \"text\": \"The TF\u2013IDF (term frequency-inverse document frequency) for the token \u201cdecrease\u201d in sentence 647 in term frequency measures Table 2 is most likely: A. 19.64%. B. 20.58%. C. 22.53%.\",\n    \"answerCount\": 1,\n    \"acceptedAnswer\": {\n      \"@type\": \"Answer\",\n      \"text\": \"The correct answer is C. TF\u2013IDF = TF \u00d7 IDF. Using TF = 0.037 and IDF = 6.083, TF\u2013IDF = 0.2253 or 22.53%. This value reflects that the token occurs frequently within the sentence but relatively infrequently across the corpus, making it an important differentiating term.\"\n    }\n  }\n}\n<\/script> <iframe loading=\"lazy\" title=\"YouTube video player\" src=\"https:\/\/www.youtube.com\/embed\/ifHmwpgHWYY?si=FCvBJu4Us_j2s83X\" width=\"560\" height=\"315\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/p>\n<h2>Feature Extraction<\/h2>\n<p>Feature extraction entails mapping the textual data to real-valued vectors. After the text has been normalized, the next step is to create a bag-of-words (BOW). It is a representation of analyzing text. It does not, however, represent the word sequences or positions.<\/p>\n<p>The following figure represents a BOW of the textual data extracted from the Sentences_50Agree file for the first 233 words:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-12353 aligncenter\" src=\"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/233-words-300x69.png\" alt=\"\" width=\"677\" height=\"156\" srcset=\"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/233-words-300x69.png 300w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/233-words-768x176.png 768w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/233-words-400x92.png 400w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/233-words.png 987w\" sizes=\"auto, (max-width: 677px) 100vw, 677px\" \/><\/p>\n<h2>Feature Selection<\/h2>\n<p>Feature selection for text data entails keeping the useful tokens in the BOW that are informative and eliminating different classes of texts, i.e., those with positive sentiment and those with negative sentiment. The process of quantifying how important tokens are in a sentence and the corpus as a whole is known as <strong>frequency analysis<\/strong>. It assists in filtering unnecessary tokens (or features).<\/p>\n<p>Term frequency (TF) is calculated and examined to pinpoint noisy terms, i.e., outlier terms. TF at the corpus level\u2014also known as collection frequency (CF)\u2014is the number of times a given word appears in the whole corpus divided by the total number of words in the corpus. Terms with low TF are mostly rare terms, for example, proper nouns or sparse terms, which rarely appear in the data and do not contribute to the differentiating sentiment. On the other hand, terms with high TF are mostly stop words, present in most sentences, so do not contribute to the differentiating sentiment. These terms are eliminated before forming the final document term matrix (DTM).<\/p>\n<p>The next step involves constructing the DTM for ML training. Different TF measures are calculated to fill in the cells of the DTM as follows<\/p>\n<p><strong><em>1. SentenceNo:<\/em> <\/strong>This is a unique identification number given to each sentence in the order they appear in the original dataset. For example, sentence number 11 is a sentence in row 11 from the data table.<\/p>\n<p><strong><em>2. TotalWordsInSentence<\/em>:<\/strong> It is the count of the total number of words present in the sentence.<\/p>\n<p><strong><em>3. Word:<\/em> <\/strong>This is a word token that is present in the corresponding sentence.<\/p>\n<p><strong><em>4. TotalWordCount:<\/em> <\/strong>This is the total number of occurrences of the word in the entire corpus or collection.<\/p>\n<p>$$\\text{TF (Collection level)}=\\frac{\\text{TotalWordCount}}{\\text{Total number of words in collection}}$$<\/p>\n<p><em><strong>5. WordCountInSentence: <\/strong><\/em>This refers to the number of times the token is present in the corresponding sentence. For example, in the sentence, \u201cThe profit before taxes decreased to EUR 31.6 million from EUR 50.0 million the year before.\u201d The token \u201cthe\u201d is present two times.<\/p>\n<p><strong><em>6. SentenceCountWithWord:<\/em> <\/strong>This is the number of sentences in which the word is present.<\/p>\n<p><strong><em>7. Term Frequency (TF) at Sentence Level:<\/em> <\/strong>This is the proportion of the number of times a word is present in a sentence to the total number of words in that sentence.<\/p>\n<p>$$\\text{TF at Sentence Level}=\\frac{\\text{WordCountInSentence}}{\\text{TotalWordsInSentence}}$$<\/p>\n<p><strong><em>8. Document frequency (DF)<\/em>:<\/strong> This is calculated as the number of sentences that contain a given word divided by the total number of sentences. DF is essential since words frequently occurring across sentences provide no differentiating information in each sentence.<\/p>\n<p>$$\\text{DF}=\\frac{\\text{SentenceCountWithWord}}{\\text{Total number of sentences}}$$<\/p>\n<p><strong>9.\u00a0<em>Inverse Document Frequency (IDF)<\/em>:<\/strong> This is a relative measure of how unique a term is across the entire corpus. A low IDF implies a high word frequency in the text<\/p>\n<p>$$\\text{IDF}=\\text{log}\\bigg(\\frac{1}{\\text{DF}}\\bigg)$$<\/p>\n<p>10.\u00a0<em>TF\u2013IDF:<\/em> TF at the sentence level is multiplied by the IDF of a word across the entire dataset to get a complete representation of the value of each word. High TF\u2013IDF values indicate words that appear more frequently within a smaller number of documents. This signifies relatively more unique terms that are crucial. The converse is true. TF\u2013IDF values can serve as word feature values for training an ML model<\/p>\n<p>$$\\text{TF}-\\text{IDF}=\\text{TF}\\times\\text{IDF}$$<\/p>\n<div style=\"text-align: center; margin: 30px 0;\"><a style=\"display: inline-block; background: #2f6fdf; color: #ffffff; padding: 15px 40px; border-radius: 30px; text-decoration: none; font-size: 18px; font-weight: 400;\" href=\"https:\/\/analystprep.com\/free-trial\/\" target=\"_blank\" rel=\"noopener noreferrer\"> Master feature engineering techniques with our Free Trial <\/a><\/div>\n<h2>Feature Engineering<\/h2>\n<p>Up to now, we have been dealing with single words or tokens, which, combined with the bag of words model, implied the assumption of independent words. One way of minimizing the effect of this simplification by adding some context to the words is to use <em>n<\/em>-grams. <em>N<\/em>-grams helps us to understand the sentiment of a sentence as a whole.<\/p>\n<p>An <em>n<\/em>-gram refers to a set of <em>n<\/em> consecutive words that can be used as the building blocks of an ML model. Notice that we have been using the <em>n<\/em>-gram model for the particular case of n equals one, which is also called a unigram (for <em>n<\/em>=2, the n-gram model is a bigram, for <em>n<\/em>=3 trigram). When dealing with <em>n<\/em>-grams, unique tokens to denote the beginning and end of a sentence are sometimes used. Bigram tokens help keep negations intact in the text, which is crucial for sentiment prediction.<\/p>\n<p>The corresponding word frequency measures for the DTM are then computed based on the new BOW formed from n-grams.<\/p>\n<p>The following figure shows a sample of 3-gram tokens for the dataset sentences_50Agree for the first 233 words.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-12364 aligncenter\" src=\"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/rereara-300x126.png\" alt=\"\" width=\"726\" height=\"305\" srcset=\"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/rereara-300x126.png 300w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/rereara-768x323.png 768w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/rereara.png 1024w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/rereara-400x168.png 400w\" sizes=\"auto, (max-width: 726px) 100vw, 726px\" \/><\/p>\n<h3>Example: Calculating and Interpreting Term Frequency Measures<\/h3>\n<p>Peter Smith is a data scientist and wants to use the textual data from sentences_50Agree file to develop sentiment indicators for forecasting future stock price movements. Smith has assembled a BOW from the corpus of text being examined and has pulled the following abbreviated term frequency measures tables. The text file has a total of 112,622 non-unique tokens that are in 4,820 sentences.<\/p>\n<h6 style=\"text-align: center;\">Term Frequency Measures Table 1<\/h6>\n<p>$$\\small{\\begin{array}{l|c|c|c|c|c} \\text{Sentence No} &amp; {\\text{TotalWordsIn}\\\\ \\text{Sentence}} &amp; \\text{Word} &amp; {\\text{TotalWord}\\\\ \\text{Count}} &amp; {\\text{WordCountIn}\\\\ \\text{Sentence}} &amp; {\\text{SentenceCount}\\\\ \\text{WithWord}}\\\\ \\hline &lt;\\text{int}&gt;&amp; &lt;\\text{int}&gt; &amp; &lt;\\text{chr}&gt; &amp; &lt;\\text{int}&gt; &amp; &lt;\\text{int}&gt; &amp; &lt;\\text{int}&gt;\\\\ \\hline7 &amp; 43 &amp; \\text{a} &amp; 1745 &amp; 3 &amp; 1373 \\\\ \\hline4 &amp; 34 &amp; \\text{the} &amp; 6068 &amp; 5 &amp; 3678\\\\ \\hline22 &amp; 43 &amp; \\text{of} &amp; 3213 &amp; 3 &amp; 2120\\\\ \\hline22 &amp; 43 &amp; \\text{the} &amp; 6068 &amp; 4 &amp; 3678\\\\ \\hline4792 &amp; 28 &amp; \\text{a} &amp; 1745 &amp; 3 &amp; 1373\\\\\u00a0 \\end{array}}$$<\/p>\n<h6 style=\"text-align: center;\">Term Frequency Measures Table 2<\/h6>\n<p>$$\\small{\\begin{array}{l|c|c|c|c|c} \\text{SentenceNo} &amp; {\\text{TotalWordsIn}\\\\ \\text{Sentence}} &amp; \\text{Word} &amp; {\\text{TotalWord}\\\\ \\text{Count}} &amp; {\\text{WordCountIn}\\\\ \\text{Sentence}} &amp; {\\text{SentenceCount}\\\\ \\text{WithWord}}\\\\ \\hline&lt;\\text{int}&gt; &amp; &lt;\\text{int}&gt; &amp; &lt;\\text{chr}&gt; &amp; &lt;\\text{int}&gt; &amp; &lt;\\text{int}&gt; &amp;&lt;\\text{int}&gt;\\\\ \\hline4 &amp; 34 &amp; \\text{increase} &amp; 103 &amp; 3 &amp; 100\\\\ \\hline647 &amp; 27 &amp; \\text{decrease} &amp; 11 &amp; 1 &amp; 11\\\\ \\hline792 &amp; 12 &amp; \\text{great} &amp; 6 &amp; 1 &amp; 6\\\\ \\hline4508 &amp; 47 &amp; \\text{drop} &amp; 7 &amp; 1 &amp; 7\\\\ \\hline33 &amp; 36 &amp; \\text{rise} &amp; 17 &amp; 1 &amp; 17\\\\\u00a0 \\end{array}}$$<\/p>\n<ol>\n<li>Determine and interpret term frequency (TF) at the collection level and at the sentence level for the token \u201cthe\u201d in sentence 4 in term frequency measures Table 1.<\/li>\n<li>Determine and interpret term frequency (TF) at the collection level, and the sentence level for the token \u201cdecrease\u201d in sentence 647 in term frequency measures Table 2.<\/li>\n<li>Determine and interpret TF\u2013IDF (term frequency-inverse document frequency) for the token \u201cthe\u201d in sentence 4 in term frequency measures Table 1.<\/li>\n<\/ol>\n<h5>Solution to 1:<\/h5>\n<p>$$\\begin{align*}\\text{TF (Collection Level)}&amp;=\\frac{\\text{Total Word Count}}{\\text{Total number of words in collection}}\\\\&amp;=\\frac{6,068}{112,622}\\\\&amp;=5.39\\%\\end{align*}$$<\/p>\n<p>TF at the collection level is an indicator of the frequency that a token is used throughout the whole collection of texts (here, 112,622). It is vital for identifying outlier words: Tokens with the highest TF values are mostly stop words that do not contribute to differentiating the sentiment embedded in the text (such as \u201cthe\u201d).<\/p>\n<p>$$\\begin{align*}\\text{TF at Sentence Level}&amp;=\\frac{\\text{Word Count In Sentence}}{\\text{Total Words In Sentence}}\\\\&amp;=\\frac{5}{34}\\\\&amp;=0.1471 \\ or \\ 14.71\\%\\end{align*}$$<\/p>\n<p>TF at the sentence level is an indicator of the frequency that a token is used in a particular sentence. Therefore, it is useful for understanding the importance of the specific token in a given sentence.<\/p>\n<h5>Solution to 2:<\/h5>\n<p>For the token \u201cdecrease,\u201d the term frequency (TF) at the collection level is given by:<\/p>\n<p>$$=\\frac{11}{112,622}=0.0098\\%$$<\/p>\n<p>This token has a meager TF value. Recall that tokens with the lowest TF values are mostly proper nouns or sparse terms that are also not important to the meaning of the text.<\/p>\n<p>$$\\begin{align*}\\text{TF at Sentence Level for the token &#8220;decrease&#8221;}&amp;=\\frac{1}{27}\\\\&amp;=0.037 \\ or \\ 3.70\\%\\end{align*}$$<\/p>\n<p>TF at the sentence level is an indicator of the frequency that a word is used in a particular sentence. Therefore, it is useful for understanding the importance of the specific token in a given sentence.<\/p>\n<h5>Solution to 3:<\/h5>\n<p>To calculate TF\u2013IDF, document frequency (DF) and inverse document frequency (IDF) also need to be calculated.<\/p>\n<p>$$\\text{DF}=\\frac{\\text{Sentence Count with Word}}{\\text{Total number of sentences}}$$<\/p>\n<p>For token \u201cthe\u201d in sentence 4,<\/p>\n<p>$$\\begin{align*}\\text{DF}&amp;=\\frac{3,678}{4,820}\\\\&amp;=0.7631or76.31\\%\\end{align*}$$<\/p>\n<p>Document frequency is important since tokens frequently occurring across sentences (such as \u201cthe\u201d) provide no differentiating information in each sentence.<\/p>\n<p>IDF is a relative measure of how important a term is across the entire corpus:<\/p>\n<p>$$\\text{IDF}=\\text{log}\\bigg(\\frac{1}{\\text{DF}}\\bigg)$$<\/p>\n<p>For token \u201cthe\u201d in sentence 4:<\/p>\n<p>$$\\begin{align*}\\text{IDF}&amp;=\\text{log}\\bigg(\\frac{1}{0.7631}\\bigg)\\\\&amp;=0.2704\\end{align*}$$<\/p>\n<p>Using TF and IDF, TF\u2013IDF can now be calculated as:<\/p>\n<p>$$\\text{TF}-\\text{IDF}=\\text{TF}\\times\\text{IDF}$$<\/p>\n<p>For token &#8220;the&#8221; in sentence 4, \\(TF-IDF=0.1471\\times0.2704=0.0398 \\ or \\ 3.98%\\)<\/p>\n<p>As TF\u2013IDF combines TF at the sentence level with IDF across the entire corpus, it provides a complete representation of the value of each word. A low TF\u2013IDF value indicates tokens that appear in most of the sentences and are not discriminative (such as \u201cthe\u201d). TF\u2013IDF values are useful in extracting the key terms in a document for use as features for training an ML model.<\/p>\n<blockquote>\n<h2>Question<\/h2>\n<p>The TF\u2013IDF (term frequency-inverse document frequency) for the token \u201cdecrease\u201d in sentence 647 in term frequency measures Table 2 is <em>most likely:<\/em><\/p>\n<p>\u00a0 \u00a0 \u00a0 A. 19.64%.<\/p>\n<p>\u00a0 \u00a0 \u00a0 B. 20.58%.<\/p>\n<p>\u00a0 \u00a0 \u00a0 C. 22.53%.<\/p>\n<h3>Solution<\/h3>\n<p><strong>The correct answer is C.<\/strong><\/p>\n<p>To calculate TF\u2013IDF, document frequency (DF), and inverse document frequency (IDF) also need to be calculated.<\/p>\n<p>$$\\text{DF}=\\frac{\\text{Sentence Count With Word}}{\\text{Total number of sentences}}$$<\/p>\n<p>For token &#8220;decrease&#8221; in sentence 647,<\/p>\n<p>$$\\begin{align*}\\text{DF}&amp;=\\frac{11}{4,820}\\\\&amp;=0.00228or0.228\\%\\end{align*}$$<\/p>\n<p>Document frequency is important since tokens frequently occurring across sentences (such as \u201cthe\u201d) provide no differentiating information in each sentence.<\/p>\n<p>IDF is a relative measure of how important a term is across the entire corpus.<\/p>\n<p>$$\\text{IDF}=\\text{log}\\bigg(\\frac{1}{\\text{DF}}\\bigg)$$<\/p>\n<p>For token &#8220;decrease&#8221; in sentence 647:<\/p>\n<p>$$\\begin{align*}\\text{IDF}&amp;=\\text{log}\\bigg(\\frac{1}{0.00228}\\bigg)\\\\&amp;=6.083\\end{align*}$$<\/p>\n<p>Using TF and IDF, TF-IDF can now be calculated as:<\/p>\n<p>$$\\text{TF}-\\text{IDF}=\\text{TF}\\times\\text{IDF}$$<\/p>\n<p>For token &#8220;the&#8221; in sentence 647, \\(\\text{TF}-\\text{IDF}=0.037\\times6.083=0.2253, or 22.53\\%\\)<\/p>\n<p>As TF\u2013IDF combines TF at the sentence level with IDF across the entire corpus, it provides a complete representation of the value of each word. A high TF\u2013IDF value indicates the word appears many times within a small number of documents, signifying an important yet unique term within a sentence (such as \u201cdecrease\u201d).<\/p>\n<\/blockquote>\n<p>Reading 7: Big Data Projects<\/p>\n<p><em>LOS 7 (f) Describe methods for extracting, selecting and engineering features from textual data<\/em><\/p>\n<div style=\"text-align: center; margin: 50px auto 30px auto; max-width: 850px;\"><a style=\"display: inline-block; background: #2f6fdf; color: #ffffff; padding: 14px 36px; border-radius: 30px; text-decoration: none; font-size: 16px; font-weight: 600;\" href=\"https:\/\/analystprep.com\/free-trial\/\" target=\"_blank\" rel=\"noopener noreferrer\"> Start Free Trial \u2192 <\/a><\/p>\n<p style=\"margin-top: 18px; font-size: 16px; line-height: 1.6; color: #444; max-width: 750px; margin-left: auto; margin-right: auto;\">Practice feature extraction, feature selection, feature engineering, and textual data analysis techniques with CFA Level II study notes, practice questions, video lessons, and mock exams.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Feature Extraction Feature extraction entails mapping the textual data to real-valued vectors. After the text has been normalized, the next step is to create a bag-of-words (BOW). It is a representation of analyzing text. It does not, however, represent the&#8230;<\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[102,229],"tags":[273,216,271,230,272],"class_list":["post-12352","post","type-post","status-publish","format-standard","hentry","category-cfa-level-2","category-quantitative-method","tag-and-engineering-of-textual-data","tag-cfa-level-2","tag-feature-extraction","tag-quantitative-method","tag-selection","blog-post","no-post-thumbnail","animate"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>TF\u2013IDF Feature Extraction | CFA Level II Notes<\/title>\n<meta name=\"description\" content=\"Learn TF\u2013IDF feature extraction, document frequency, and CF candidate frequency for analyzing financial text and forecasting with CFA Level 2 techniques.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"TF\u2013IDF Feature Extraction | CFA Level II Notes\" \/>\n<meta property=\"og:description\" content=\"Learn TF\u2013IDF feature extraction, document frequency, and CF candidate frequency for analyzing financial text and forecasting with CFA Level 2 techniques.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/\" \/>\n<meta property=\"og:site_name\" content=\"CFA, FRM, and Actuarial Exams Study Notes\" \/>\n<meta property=\"article:published_time\" content=\"2021-03-09T16:26:41+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-06-11T10:47:30+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/233-words.png\" \/>\n\t<meta property=\"og:image:width\" content=\"987\" \/>\n\t<meta property=\"og:image:height\" content=\"226\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Irene R\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Irene R\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/cfa-level-2\\\/quantitative-method\\\/feature-extraction-selection-engineering-textual-data\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/cfa-level-2\\\/quantitative-method\\\/feature-extraction-selection-engineering-textual-data\\\/\"},\"author\":{\"name\":\"Irene R\",\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/#\\\/schema\\\/person\\\/7002f30d8f174958802c1c30b167eaf5\"},\"headline\":\"Feature Extraction, Selection, and Engineering of Textual Data\",\"datePublished\":\"2021-03-09T16:26:41+00:00\",\"dateModified\":\"2026-06-11T10:47:30+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/cfa-level-2\\\/quantitative-method\\\/feature-extraction-selection-engineering-textual-data\\\/\"},\"wordCount\":1914,\"image\":{\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/cfa-level-2\\\/quantitative-method\\\/feature-extraction-selection-engineering-textual-data\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/233-words-300x69.png\",\"keywords\":[\"and Engineering of Textual Data\",\"CFA-level-2\",\"Feature Extraction\",\"Quantitative Method\",\"Selection\"],\"articleSection\":[\"CFA Level II Study Notes\",\"Quantitative Method\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/cfa-level-2\\\/quantitative-method\\\/feature-extraction-selection-engineering-textual-data\\\/\",\"url\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/cfa-level-2\\\/quantitative-method\\\/feature-extraction-selection-engineering-textual-data\\\/\",\"name\":\"TF\u2013IDF Feature Extraction | CFA Level II Notes\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/cfa-level-2\\\/quantitative-method\\\/feature-extraction-selection-engineering-textual-data\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/cfa-level-2\\\/quantitative-method\\\/feature-extraction-selection-engineering-textual-data\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/233-words-300x69.png\",\"datePublished\":\"2021-03-09T16:26:41+00:00\",\"dateModified\":\"2026-06-11T10:47:30+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/#\\\/schema\\\/person\\\/7002f30d8f174958802c1c30b167eaf5\"},\"description\":\"Learn TF\u2013IDF feature extraction, document frequency, and CF candidate frequency for analyzing financial text and forecasting with CFA Level 2 techniques.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/cfa-level-2\\\/quantitative-method\\\/feature-extraction-selection-engineering-textual-data\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/cfa-level-2\\\/quantitative-method\\\/feature-extraction-selection-engineering-textual-data\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/cfa-level-2\\\/quantitative-method\\\/feature-extraction-selection-engineering-textual-data\\\/#primaryimage\",\"url\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/233-words.png\",\"contentUrl\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/233-words.png\",\"width\":987,\"height\":226},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/cfa-level-2\\\/quantitative-method\\\/feature-extraction-selection-engineering-textual-data\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Feature Extraction, Selection, and Engineering of Textual Data\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/#website\",\"url\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/\",\"name\":\"CFA, FRM, and Actuarial Exams Study Notes\",\"description\":\"Question Bank and Study Notes for the CFA, FRM, and Actuarial exams\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/#\\\/schema\\\/person\\\/7002f30d8f174958802c1c30b167eaf5\",\"name\":\"Irene R\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/33caf1e1bcb63ee970b36351f165c7bc714b19614993ab9c2c8bf36273b7df48?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/33caf1e1bcb63ee970b36351f165c7bc714b19614993ab9c2c8bf36273b7df48?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/33caf1e1bcb63ee970b36351f165c7bc714b19614993ab9c2c8bf36273b7df48?s=96&d=mm&r=g\",\"caption\":\"Irene R\"},\"url\":\"https:\\\/\\\/analystprep.com\\\/study-notes\\\/author\\\/irene\\\/\"}]}<\/script>\n<meta property=\"og:video\" content=\"https:\/\/www.youtube.com\/embed\/ifHmwpgHWYY\" \/>\n<meta property=\"og:video:type\" content=\"text\/html\" \/>\n<meta property=\"og:video:duration\" content=\"3468\" \/>\n<meta property=\"og:video:width\" content=\"480\" \/>\n<meta property=\"og:video:height\" content=\"270\" \/>\n<meta property=\"ya:ovs:adult\" content=\"false\" \/>\n<meta property=\"ya:ovs:upload_date\" content=\"2021-03-09T16:26:41+00:00\" \/>\n<meta property=\"ya:ovs:allow_embed\" content=\"true\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"TF\u2013IDF Feature Extraction | CFA Level II Notes","description":"Learn TF\u2013IDF feature extraction, document frequency, and CF candidate frequency for analyzing financial text and forecasting with CFA Level 2 techniques.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/","og_locale":"en_US","og_type":"article","og_title":"TF\u2013IDF Feature Extraction | CFA Level II Notes","og_description":"Learn TF\u2013IDF feature extraction, document frequency, and CF candidate frequency for analyzing financial text and forecasting with CFA Level 2 techniques.","og_url":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/","og_site_name":"CFA, FRM, and Actuarial Exams Study Notes","article_published_time":"2021-03-09T16:26:41+00:00","article_modified_time":"2026-06-11T10:47:30+00:00","og_image":[{"width":987,"height":226,"url":"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/233-words.png","type":"image\/png"}],"author":"Irene R","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Irene R","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/#article","isPartOf":{"@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/"},"author":{"name":"Irene R","@id":"https:\/\/analystprep.com\/study-notes\/#\/schema\/person\/7002f30d8f174958802c1c30b167eaf5"},"headline":"Feature Extraction, Selection, and Engineering of Textual Data","datePublished":"2021-03-09T16:26:41+00:00","dateModified":"2026-06-11T10:47:30+00:00","mainEntityOfPage":{"@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/"},"wordCount":1914,"image":{"@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/#primaryimage"},"thumbnailUrl":"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/233-words-300x69.png","keywords":["and Engineering of Textual Data","CFA-level-2","Feature Extraction","Quantitative Method","Selection"],"articleSection":["CFA Level II Study Notes","Quantitative Method"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/","url":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/","name":"TF\u2013IDF Feature Extraction | CFA Level II Notes","isPartOf":{"@id":"https:\/\/analystprep.com\/study-notes\/#website"},"primaryImageOfPage":{"@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/#primaryimage"},"image":{"@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/#primaryimage"},"thumbnailUrl":"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/233-words-300x69.png","datePublished":"2021-03-09T16:26:41+00:00","dateModified":"2026-06-11T10:47:30+00:00","author":{"@id":"https:\/\/analystprep.com\/study-notes\/#\/schema\/person\/7002f30d8f174958802c1c30b167eaf5"},"description":"Learn TF\u2013IDF feature extraction, document frequency, and CF candidate frequency for analyzing financial text and forecasting with CFA Level 2 techniques.","breadcrumb":{"@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/#primaryimage","url":"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/233-words.png","contentUrl":"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/233-words.png","width":987,"height":226},{"@type":"BreadcrumbList","@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/feature-extraction-selection-engineering-textual-data\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/analystprep.com\/study-notes\/"},{"@type":"ListItem","position":2,"name":"Feature Extraction, Selection, and Engineering of Textual Data"}]},{"@type":"WebSite","@id":"https:\/\/analystprep.com\/study-notes\/#website","url":"https:\/\/analystprep.com\/study-notes\/","name":"CFA, FRM, and Actuarial Exams Study Notes","description":"Question Bank and Study Notes for the CFA, FRM, and Actuarial exams","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/analystprep.com\/study-notes\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/analystprep.com\/study-notes\/#\/schema\/person\/7002f30d8f174958802c1c30b167eaf5","name":"Irene R","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/33caf1e1bcb63ee970b36351f165c7bc714b19614993ab9c2c8bf36273b7df48?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/33caf1e1bcb63ee970b36351f165c7bc714b19614993ab9c2c8bf36273b7df48?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/33caf1e1bcb63ee970b36351f165c7bc714b19614993ab9c2c8bf36273b7df48?s=96&d=mm&r=g","caption":"Irene R"},"url":"https:\/\/analystprep.com\/study-notes\/author\/irene\/"}]},"og_video":"https:\/\/www.youtube.com\/embed\/ifHmwpgHWYY","og_video_type":"text\/html","og_video_duration":"3468","og_video_width":"480","og_video_height":"270","ya_ovs_adult":"false","ya_ovs_upload_date":"2021-03-09T16:26:41+00:00","ya_ovs_allow_embed":"true"},"_links":{"self":[{"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/posts\/12352","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/comments?post=12352"}],"version-history":[{"count":55,"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/posts\/12352\/revisions"}],"predecessor-version":[{"id":44019,"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/posts\/12352\/revisions\/44019"}],"wp:attachment":[{"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/media?parent=12352"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/categories?post=12352"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/tags?post=12352"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}