{"id":12162,"date":"2021-03-08T02:15:02","date_gmt":"2021-03-08T02:15:02","guid":{"rendered":"https:\/\/analystprep.com\/study-notes\/?p=12162"},"modified":"2026-03-18T07:46:36","modified_gmt":"2026-03-18T07:46:36","slug":"preparing-wrangling-data","status":"publish","type":"post","link":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/","title":{"rendered":"Preparing and Wrangling Data"},"content":{"rendered":"<p><script type=\"application\/ld+json\">\n{\n  \"@context\": \"https:\/\/schema.org\",\n  \"@type\": \"ImageObject\",\n  \"@id\": \"https:\/\/analystprep.com\/study-notes\/images\/preparing-wrangling-data-img-19\",\n  \"contentUrl\": \"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19-1536x1334.jpg\",\n  \"url\": \"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19-1536x1334.jpg\",\n  \"caption\": \"Preparing and Wrangling Data \u2014 Example Image 19\",\n  \"width\": 1536,\n  \"height\": 1334,\n  \"copyrightNotice\": \"\u00a9 2024 AnalystPrep\",\n  \"acquireLicensePage\": \"https:\/\/analystprep.com\/license-info\",\n  \"creditText\": \"AnalystPrep Design Team\",\n  \"creator\": {\n    \"@type\": \"Organization\",\n    \"name\": \"AnalystPrep\"\n  },\n  \"isPartOf\": {\n    \"@type\": \"WebPage\",\n    \"@id\": \"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/\"\n  }\n}\n<\/script> <script type=\"application\/ld+json\">\n{\n  \"@context\": \"https:\/\/schema.org\",\n  \"@type\": \"ImageObject\",\n  \"@id\": \"https:\/\/analystprep.com\/study-notes\/images\/preparing-wrangling-data-img-28\",\n  \"contentUrl\": \"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_28-scaled.jpg\",\n  \"url\": \"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_28-scaled.jpg\",\n  \"caption\": \"Preparing and Wrangling Data \u2014 Example Image 28\",\n  \"width\": 1269,\n  \"height\": 2048,\n  \"copyrightNotice\": \"\u00a9 2024 AnalystPrep\",\n  \"acquireLicensePage\": \"https:\/\/analystprep.com\/license-info\",\n  \"creditText\": \"AnalystPrep Design Team\",\n  \"creator\": {\n    \"@type\": \"Organization\",\n    \"name\": \"AnalystPrep\"\n  },\n  \"isPartOf\": {\n    \"@type\": \"WebPage\",\n    \"@id\": \"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/\"\n  }\n}\n<\/script><\/p>\n<h3 id=\"mce_22\" class=\"editor-rich-text__tinymce mce-content-body\" data-is-placeholder-visible=\"false\"><iframe loading=\"lazy\" title=\"YouTube video player\" src=\"https:\/\/www.youtube.com\/embed\/ifHmwpgHWYY\" width=\"611\" height=\"344\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/h3>\n<p>Data preparation and wrangling is a crucial step that entails cleaning and organizing raw data in a consolidated format that allows for more convenient consumption of the data. Data collection precedes the data preparation and wrangling stage. Recall that before data collection begins, it is essential to state the problem, define the objectives, identify useful data points, and conceptualize the model. What follows is collecting the relevant data through exploring and downloading raw data from different sources.<\/p>\n<p>Preliminary to delving into data preparation and wrangling, it is vital to differentiate the two forms of data that can be collected, i.e., structured and unstructured data. The figure below gives the different features of the two data forms.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-14925\" src=\"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19.jpg\" alt=\"\" width=\"1590\" height=\"1381\" srcset=\"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19.jpg 1590w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19-300x261.jpg 300w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19-1024x889.jpg 1024w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19-768x667.jpg 768w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19-1536x1334.jpg 1536w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19-400x347.jpg 400w\" sizes=\"auto, (max-width: 1590px) 100vw, 1590px\" \/>The data preparation and wrangling stage involves two crucial tasks: cleansing and preprocessing, respectively.<\/p>\n<p><strong>Data Preparation (Cleansing):<\/strong> This refers to the process of inspecting, pinpointing, and reducing errors in raw data. Raw data can be invalid, inaccurate, incomplete, or have duplicates either due to mistakes during manual data entry or server failures or system bugs for data recorded by the system.<\/p>\n<p><strong>Data Wrangling (Preprocessing):<\/strong> After data is cleaned, it needs to be processed by managing the outliers, extracting handy variables from the existing data points, and scaling the features of the data. This prepares the data for model consumption.<\/p>\n<div style=\"text-align: center; margin: 25px 0;\"><a style=\"display: inline-flex; align-items: center; justify-content: center; padding: 10px 18px; border: 2px solid #1a73e8; border-radius: 999px; color: #1a73e8; text-decoration: none; font-weight: 500; background-color: #f5f9ff; white-space: nowrap;\" href=\"https:\/\/analystprep.com\" target=\"_blank\" rel=\"noopener\"> Apply data wrangling techniques with CFA Level II practice <\/a><\/div>\n<h2>Structured Data<\/h2>\n<h3>Data Preparation (Cleansing)<\/h3>\n<p>To establish possible errors associated with structured data, we look at the following sample of raw data obtained from a credit company.<\/p>\n<p>$$\\small{\\begin{array}{l|l|l|l|l|l|l|l} 1 &amp; \\textbf{ID} &amp; \\textbf{Name} &amp; \\textbf{Gender} &amp; \\textbf{Salary (\\$)} &amp; \\textbf{Loan Amount (\\$)} &amp; \\textbf{Loan Outcome} &amp; \\textbf{Loan Type}\\\\ \\hline 2 &amp; 1 &amp; \\text{Ms. Phi} &amp; \\text{F} &amp; 120,000 &amp; 48,000 &amp; \\text{No default} &amp; \\text{Car}\\\\ \\hline 3 &amp; 2 &amp; \\text{Mr. Psi} &amp; \\text{M} &amp; 100,000 &amp; \\text{Unknown} &amp; \\text{No default} &amp; \\text{Student loan}\\\\ \\hline 4 &amp; 3 &amp; \\text{Ms. Tau} &amp; \\text{F} &amp; (40,000) &amp; 16,000 &amp; \\text{No default} &amp; \\text{Mortgage}\\\\ \\hline 5 &amp; 4 &amp; \\text{Mr. Epsilon} &amp; \\text{F} &amp; 90,000 &amp; 36,000 &amp; \\text{Defaulted} &amp; \\text{Car}\\\\ \\hline 6 &amp; 5 &amp; \\text{Mr. Rho} &amp; \\text{M} &amp; 83,000 &amp; 33,200 &amp; \\text{No default} &amp; {}\\\\ \\hline 7 &amp; 6 &amp; \\text{Mr. Rho} &amp; \\text{M} &amp; 83,000 &amp; 33,200 &amp; \\text{No default} &amp; \\text{Mortgage}\\\\ \\hline 8 &amp; 7 &amp; \\text{Ms. Chi} &amp; \\text{F} &amp; 95,000 &amp; 38,000 &amp; \\text{No default} &amp; \\text{Mortgage}\\\\\u00a0 \\end{array}}$$<\/p>\n<p><strong>1. Incompleteness error:<\/strong> This is missing data due to the absence of some data entries. The data shown for Mr. Rho shows an incompleteness error because the loan type is missing. Missing values should be omitted or replaced with NA. The NA can then be substituted with options such as mean, median, or mode, or 0.<\/p>\n<p><strong>2. Invalidity error:<\/strong> Arises when some data values are out of a meaningful range. The data shown for Mrs. Tau contains invalidity error since salary cannot be negative. Invalid entries should be verified with other data records.<\/p>\n<p><strong>3. Inaccuracy error:<\/strong> Occurs when the data is not a measure of actual value. The data for Mr. Psi indicates an unknown loan amount, yet the loan outcome shows that he did not default. The lender must know how much he lends this particular borrower.<\/p>\n<p><strong>4. Inconsistency error:<\/strong> This is as a result of some of the data conflicting with the corresponding data points or reality. The data for Mr. Epsilon is likely to be inconsistent as the <em>Name<\/em> column contains a male title, but the <em>Gender<\/em> column contains a female. Clarifying the data with another source solves this error.<\/p>\n<p><strong>5. Non-uniformity error:<\/strong> Emerges where the data are ambiguous or non-identical data formats. For example, the monetary unit for the salary and loan amount is $. This ambiguous because the dollar symbol can represent the US dollar, Canadian dollar, or others. Converting the data points into a preferable standard format can resolve this error.<\/p>\n<p><strong>6. Duplication error:<\/strong> This is as a result of the repetition of identical data points. The data shown for Mr. Rho contains duplication error. This error can be resolved by removing the duplicates.<\/p>\n<h3>Data Wrangling (Preprocessing)<\/h3>\n<p>What follows after data cleaning is data preprocessing, which involves transforming and scaling data. These transformations may include:<\/p>\n<p><strong>1. Extraction: <\/strong>This entails extracting a new variable from the current variable to simplify the analysis and use it for the ML model training.<\/p>\n<p><strong>2. Aggregation: <\/strong>Two or more variables can be combined to form one variable to consolidate similar variables.<\/p>\n<p><strong>3. Filtration: <\/strong>Data rows not necessary for the analysis must be pinpointed and filtered.<\/p>\n<p><strong>4. Selection: <\/strong>The data columns that are intuitively not required for the analysis can be eliminated. For example, the <em>Name<\/em> column, in this case, is not required for training the ML model.<\/p>\n<p><strong>5. Conversion: <\/strong>The different variable types in the dataset must be converted into appropriate types to process further and analyze them correctly. For example, <em>Name<\/em> and<em> Loan Type<\/em> are nominal, <em>Salary<\/em> and <em>Loan Amount<\/em> are continuous, and <em>Gender<\/em> and <em>Loan Outcome<\/em> are categorical with 2 classes.<\/p>\n<p>The next step is to identify outliers present the data. For normally distributed data, a data value outside of 3 standard deviations from the mean may be considered an outlier. An interquartile range (IQR) can also be used to identify outliers. Data values outside 1.5 IQR are considered outliers, while those outside 3IQR are extreme values.<\/p>\n<p><em><strong>Trimming\/Truncation<\/strong> <\/em>refers to removing extreme values and outliers from the data set. For example, a 10% trimmed dataset is one for which the 10% highest and the 10% lowest values have been eliminated. On the other hand, winsorization refers to replacing extreme values and outliers with the maximum and the minimum values of data points that are not outliers.<\/p>\n<p><em><strong>Feature scaling<\/strong><\/em> entails making sure that features are on a similar scale, shifting, and changing the scale of data. Scaling follows after eliminating the outliers.<\/p>\n<h4>Scaling Techniques<\/h4>\n<p><strong>1. Normalization <\/strong>is the process of adjusting one or more attributes to the range of 0 to 1. It is sensitive to outliers and can be used when the distribution of the data is unknown. To normalize a random variable X:<\/p>\n<p><strong>Example 1:\u00a0<\/strong><\/p>\n<p>$$X_{i(normalized)}=\\frac{X_{i}-X_{Min}}{X_{Max}-X_{Min}}$$<\/p>\n<p><strong>2.\u00a0Standardization\u00a0<\/strong>typically means adjusting data to have a mean of 0 and a standard deviation of 1 (i.e., unit variance). It is reasonably less sensitive to outliers as it depends on the mean and standard deviation of the data. Standardization applies to data that has a normal distribution.<\/p>\n<p><strong>\u00a0<\/strong><strong>Equation 2:<\/strong><\/p>\n<p>$$X_{i(Standardized)}=\\frac{X_{i}-\\mu}{\\sigma}$$<\/p>\n<p>Where:<\/p>\n<ul>\n<li data-tadv-p=\"keep\">\\(\\mu\\) = The mean of the variable for each observation \\(X_{i}\\); and<\/li>\n<li data-tadv-p=\"keep\">\\(\\sigma\\) = The standard deviation of the feature \\(X\\).<\/li>\n<\/ul>\n<h2>Unstructured (Text) Data<\/h2>\n<p>The objective of data preparation and wrangling of textual data is to transform the unstructured data into structured data. The output of these processes is a document term matrix that can be read by computers. The document term matrix is similar to a data table for structured data. The cleansing and preprocessing of unstructured text data into a structured format is called <strong>text processing.<\/strong> Preparing unstructured data is more challenging relative to structured data. We will use text data related to the English language to demonstrate this section.<\/p>\n<h3>Text Preparation\/Cleansing<\/h3>\n<p>This step involves removing unnecessary HTML tags, punctuation, and white spaces from the raw text. The cleansing process is as follows:<\/p>\n<p><em><strong>Step1:<\/strong> Remove HTML tags.<\/em><\/p>\n<p>HTML tags that are not part of the actual text can be removed using a programming language or a regular expression (regex).<\/p>\n<p><em><strong>Step 2:<\/strong> Remove punctuations.<\/em><\/p>\n<p>Some punctuations, such as percentage signs and question marks, may be useful for ML model training. Therefore, when such punctuation is removed, annotations such as \/percentSign\/ and \/questionMark\/ should be added to maintain their grammatical meaning in the text. Regex is commonly applied to remove or replace punctuations.<\/p>\n<p><em><strong>Step 3:<\/strong> Remove numbers.<\/em><\/p>\n<p>Numbers present in the text should be removed or replaced by annotations such as \/number\/. This is critical because the computers treat each number as a separate word, complicating the analyses or adding noise.<\/p>\n<p><em><strong>Step 4:<\/strong> Remove white spaces.<\/em><\/p>\n<p>Extra spaces such as tabs, line breaks, and new lines should be identified and removed to keep the text intact and clean. The <em>stripWhitespace<\/em> function in R can be utilized to can be used to eliminate unnecessary white spaces from the text.<\/p>\n<h3>Text Wrangling (Preprocessing)<\/h3>\n<p>We begin this section by defining token and tokenization to understand text processing further. A token corresponds to a word, while tokenization is the process of breaking down a given text into separate tokens. For example, \u201cThis is good\u201d has 3 tokens, i.e., \u201cThis,\u201d \u201cis,\u201d and \u201cgood.\u201d<\/p>\n<p>Text data also require normalization, just like structured data. The normalization process in text processing entails the following:<\/p>\n<p><em><strong>Step 1:<\/strong> Lowercasing the alphabet <\/em>aids the computer to process identical words appropriately.<\/p>\n<p><em><strong>Step 2:<\/strong> Stop words<\/em> such as \u201cthe,\u201d \u201cfor,\u201d and \u201care,\u201d usually are removed to reduce the number of tokens involved in the training set for ML training purposes.<\/p>\n<p><em><strong>Step 3:<\/strong> Stemming<\/em> is the process of converting words from their base forms or stems using crude Heuristic rules. For example, the stem of the words \u201cincreased\u201d and \u201cincreasing\u201d is \u201cincreas.\u201d Stemming solves the problem that emerges when some words appear very infrequently in a textual dataset posing the risk of training highly complex models.<\/p>\n<p><em><strong>Step 4:<\/strong> Lemmatization<\/em> is identical to stemming except that it removes endings only if the base form is present in a dictionary. Lemmatization is much more costly and advanced relative to stemming.<\/p>\n<p>What follows after text normalization is creating a <strong>bag-of-words (BOW)<\/strong>. A BOW is a representation for analyzing text. It does not, however, represent the word sequences or positions.<\/p>\n<p>Suppose we have a series of sentences:<\/p>\n<p>\u201cThis is good.\u201d<\/p>\n<p>\u201cThis is valuable.\u201d<\/p>\n<p>\u201cThis is fine.\u201d<\/p>\n<p>The following figure is a Bag-of-Words representation of the three sentences before and after the normalization process.<\/p>\n<p style=\"text-align: left;\" data-tadv-p=\"keep\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-14927\" src=\"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_28-scaled.jpg\" alt=\"BOW\" width=\"1269\" height=\"2048\" srcset=\"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_28-scaled.jpg 1269w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_28-186x300.jpg 186w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_28-634x1024.jpg 634w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_28-768x1240.jpg 768w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_28-951x1536.jpg 951w, https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_28-400x646.jpg 400w\" sizes=\"auto, (max-width: 1269px) 100vw, 1269px\" \/>The final Bow after normalizing is then used to build a <strong><em>document term matrix (DTM)<\/em>.<\/strong> It is a matrix where each row belongs to a text file, and each column represents a token. The number of rows is equivalent to the number of text files in a sample text dataset. The number of columns is equal to the number of tokens from the BOW built using all the text files in the same token is present in each document.<\/p>\n<p>The following figure shows a DTM constructed from the resultant BOW of the three sentences.<\/p>\n<h6 style=\"text-align: center;\">DTM of Three Sentences and Using Normalized BOW Filled with Counts of Occurrence<\/h6>\n<p>$$\\small{\\begin{array}{l|l|l|l} {}&amp;\\textbf{Good}&amp;\\textbf{Valu}&amp;\\textbf{Fine}\\\\ \\hline\\text{Sentence 1}&amp;1&amp;0&amp;0\\\\ \\hline\\text{Sentence 2}&amp;0&amp;1&amp;0\\\\ \\hline\\text{Sentence 3}&amp; 0&amp;0&amp;1\\\\\u00a0 \\end{array}}$$<\/p>\n<p>As mentioned earlier, a BOW does not represent the word sequences or positions, which limits its use for some advanced ML training applications. For example, if a text has the word \u201cno.\u201d It can be treated as a single token and treated as a stop word during normalization. This would fail to imply negative meaning.<\/p>\n<p>N-gram is a technique used to overcome such a problem as it is a representation of word sequences. A two-word sequence is a bigram, a three-word sequence is a trigram, etc.<\/p>\n<blockquote>\n<h2>Question<\/h2>\n<p>A data scientist of a large investment corporation is discussing with her senior manager about the steps involved in preprocessing raw text data. She tells her senior manager that the process can be accomplished in the following three steps:<\/p>\n<p><em><strong>Step 1:<\/strong><\/em> Cleanse the raw text data.<\/p>\n<p><em><strong>Step 2:<\/strong><\/em> Split the cleansed data into a collection of words for them to be normalized.<\/p>\n<p><em><strong>Step 3:<\/strong><\/em> Normalize the collection of words and create a well-defined set of tokens from the normalized words.<\/p>\n<p>The data scientist\u2019s step 2 is <em>most likely<\/em> to be:<\/p>\n<p>\u00a0 \u00a0 \u00a0 A. Lemmatization.<\/p>\n<p>\u00a0 \u00a0 \u00a0 B. Standardization.<\/p>\n<p>\u00a0 \u00a0 \u00a0 C. Tokenization.<\/p>\n<h3>Solution<\/h3>\n<p><em><strong>The correct answer is C.<\/strong><\/em><\/p>\n<p>Tokenization refers to the process of dividing a given text into separate tokens. This step takes place after cleansing the raw text data, i.e., removing HTML tags, numbers, and extra white spaces. The tokens are then normalized to create the bag-of-words (BOW).<\/p>\n<p><em><strong>A is incorrect.<\/strong><\/em>\u00a0Lemmatization is a text normalization technique of reducing inflected words while ensuring that the root word belongs to the language. In Lemmatization, root word is called Lemma. A lemma is the dictionary form or citation form of a set of words. For example, the lemma of the words \u201canalyzed\u201d and \u201canalyzing\u201d is \u201canalyze.\u201d<\/p>\n<p><em><strong>B is incorrect.<\/strong>\u00a0Standardization <\/em>is a scaling technique that entails adjusting data to have a mean \\((\\mu)\\) of 0 and a standard deviation \\((\\sigma)\\) of 1 (i.e., unit variance).<\/p>\n<\/blockquote>\n<p>Reading 7: Big Data Projects<\/p>\n<p><em>LOS 7 (b) Describe objectives, steps, and examples of preparing and wrangling data<\/em><\/p>\n<div style=\"text-align: center; margin: 40px 0;\"><a style=\"display: inline-flex; align-items: center; justify-content: center; padding: 12px 20px; border-radius: 999px; background-color: #1a73e8; color: #ffffff; text-decoration: none; font-weight: 600;\" href=\"https:\/\/analystprep.com\" target=\"_blank\" rel=\"noopener\"> Start Free Trial \u2192 <\/a><\/p>\n<p style=\"font-size: 15px; margin-top: 12px; color: #555;\">Practice data cleansing, preprocessing, and feature transformations for structured datasets with exam-style scenarios.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Data preparation and wrangling is a crucial step that entails cleaning and organizing raw data in a consolidated format that allows for more convenient consumption of the data. Data collection precedes the data preparation and wrangling stage. Recall that before&#8230;<\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[102,229],"tags":[216,263,230],"class_list":["post-12162","post","type-post","status-publish","format-standard","hentry","category-cfa-level-2","category-quantitative-method","tag-cfa-level-2","tag-preparing-and-wrangling-data","tag-quantitative-method","blog-post","no-post-thumbnail","animate"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Data Wrangling and Data Preparation<\/title>\n<meta name=\"description\" content=\"Learn data wrangling techniques and how structured and unstructured data are prepared for analysis and machine learning models.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Data Wrangling and Data Preparation\" \/>\n<meta property=\"og:description\" content=\"Learn data wrangling techniques and how structured and unstructured data are prepared for analysis and machine learning models.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/\" \/>\n<meta property=\"og:site_name\" content=\"CFA, FRM, and Actuarial Exams Study Notes\" \/>\n<meta property=\"article:published_time\" content=\"2021-03-08T02:15:02+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-03-18T07:46:36+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1590\" \/>\n\t<meta property=\"og:image:height\" content=\"1381\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Irene R\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Irene R\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/\"},\"author\":{\"name\":\"Irene R\",\"@id\":\"https:\/\/analystprep.com\/study-notes\/#\/schema\/person\/7002f30d8f174958802c1c30b167eaf5\"},\"headline\":\"Preparing and Wrangling Data\",\"datePublished\":\"2021-03-08T02:15:02+00:00\",\"dateModified\":\"2026-03-18T07:46:36+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/\"},\"wordCount\":2122,\"image\":{\"@id\":\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19.jpg\",\"keywords\":[\"CFA-level-2\",\"Preparing and Wrangling Data\",\"Quantitative Method\"],\"articleSection\":[\"CFA Level II Study Notes\",\"Quantitative Method\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/\",\"url\":\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/\",\"name\":\"Data Wrangling and Data Preparation\",\"isPartOf\":{\"@id\":\"https:\/\/analystprep.com\/study-notes\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19.jpg\",\"datePublished\":\"2021-03-08T02:15:02+00:00\",\"dateModified\":\"2026-03-18T07:46:36+00:00\",\"author\":{\"@id\":\"https:\/\/analystprep.com\/study-notes\/#\/schema\/person\/7002f30d8f174958802c1c30b167eaf5\"},\"description\":\"Learn data wrangling techniques and how structured and unstructured data are prepared for analysis and machine learning models.\",\"breadcrumb\":{\"@id\":\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/#primaryimage\",\"url\":\"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19.jpg\",\"contentUrl\":\"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19.jpg\",\"width\":1590,\"height\":1381},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/analystprep.com\/study-notes\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Preparing and Wrangling Data\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/analystprep.com\/study-notes\/#website\",\"url\":\"https:\/\/analystprep.com\/study-notes\/\",\"name\":\"CFA, FRM, and Actuarial Exams Study Notes\",\"description\":\"Question Bank and Study Notes for the CFA, FRM, and Actuarial exams\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/analystprep.com\/study-notes\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/analystprep.com\/study-notes\/#\/schema\/person\/7002f30d8f174958802c1c30b167eaf5\",\"name\":\"Irene R\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/analystprep.com\/study-notes\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/33caf1e1bcb63ee970b36351f165c7bc714b19614993ab9c2c8bf36273b7df48?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/33caf1e1bcb63ee970b36351f165c7bc714b19614993ab9c2c8bf36273b7df48?s=96&d=mm&r=g\",\"caption\":\"Irene R\"},\"url\":\"https:\/\/analystprep.com\/study-notes\/author\/irene\/\"}]}<\/script>\n<meta property=\"og:video\" content=\"https:\/\/www.youtube.com\/embed\/ifHmwpgHWYY\" \/>\n<meta property=\"og:video:type\" content=\"text\/html\" \/>\n<meta property=\"og:video:duration\" content=\"3468\" \/>\n<meta property=\"og:video:width\" content=\"480\" \/>\n<meta property=\"og:video:height\" content=\"270\" \/>\n<meta property=\"ya:ovs:adult\" content=\"false\" \/>\n<meta property=\"ya:ovs:upload_date\" content=\"2021-03-08T02:15:02+00:00\" \/>\n<meta property=\"ya:ovs:allow_embed\" content=\"true\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Data Wrangling and Data Preparation","description":"Learn data wrangling techniques and how structured and unstructured data are prepared for analysis and machine learning models.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/","og_locale":"en_US","og_type":"article","og_title":"Data Wrangling and Data Preparation","og_description":"Learn data wrangling techniques and how structured and unstructured data are prepared for analysis and machine learning models.","og_url":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/","og_site_name":"CFA, FRM, and Actuarial Exams Study Notes","article_published_time":"2021-03-08T02:15:02+00:00","article_modified_time":"2026-03-18T07:46:36+00:00","og_image":[{"width":1590,"height":1381,"url":"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19.jpg","type":"image\/jpeg"}],"author":"Irene R","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Irene R","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/#article","isPartOf":{"@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/"},"author":{"name":"Irene R","@id":"https:\/\/analystprep.com\/study-notes\/#\/schema\/person\/7002f30d8f174958802c1c30b167eaf5"},"headline":"Preparing and Wrangling Data","datePublished":"2021-03-08T02:15:02+00:00","dateModified":"2026-03-18T07:46:36+00:00","mainEntityOfPage":{"@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/"},"wordCount":2122,"image":{"@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/#primaryimage"},"thumbnailUrl":"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19.jpg","keywords":["CFA-level-2","Preparing and Wrangling Data","Quantitative Method"],"articleSection":["CFA Level II Study Notes","Quantitative Method"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/","url":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/","name":"Data Wrangling and Data Preparation","isPartOf":{"@id":"https:\/\/analystprep.com\/study-notes\/#website"},"primaryImageOfPage":{"@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/#primaryimage"},"image":{"@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/#primaryimage"},"thumbnailUrl":"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19.jpg","datePublished":"2021-03-08T02:15:02+00:00","dateModified":"2026-03-18T07:46:36+00:00","author":{"@id":"https:\/\/analystprep.com\/study-notes\/#\/schema\/person\/7002f30d8f174958802c1c30b167eaf5"},"description":"Learn data wrangling techniques and how structured and unstructured data are prepared for analysis and machine learning models.","breadcrumb":{"@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/#primaryimage","url":"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19.jpg","contentUrl":"https:\/\/analystprep.com\/study-notes\/wp-content\/uploads\/2021\/03\/Img_19.jpg","width":1590,"height":1381},{"@type":"BreadcrumbList","@id":"https:\/\/analystprep.com\/study-notes\/cfa-level-2\/quantitative-method\/preparing-wrangling-data\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/analystprep.com\/study-notes\/"},{"@type":"ListItem","position":2,"name":"Preparing and Wrangling Data"}]},{"@type":"WebSite","@id":"https:\/\/analystprep.com\/study-notes\/#website","url":"https:\/\/analystprep.com\/study-notes\/","name":"CFA, FRM, and Actuarial Exams Study Notes","description":"Question Bank and Study Notes for the CFA, FRM, and Actuarial exams","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/analystprep.com\/study-notes\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/analystprep.com\/study-notes\/#\/schema\/person\/7002f30d8f174958802c1c30b167eaf5","name":"Irene R","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/analystprep.com\/study-notes\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/33caf1e1bcb63ee970b36351f165c7bc714b19614993ab9c2c8bf36273b7df48?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/33caf1e1bcb63ee970b36351f165c7bc714b19614993ab9c2c8bf36273b7df48?s=96&d=mm&r=g","caption":"Irene R"},"url":"https:\/\/analystprep.com\/study-notes\/author\/irene\/"}]},"og_video":"https:\/\/www.youtube.com\/embed\/ifHmwpgHWYY","og_video_type":"text\/html","og_video_duration":"3468","og_video_width":"480","og_video_height":"270","ya_ovs_adult":"false","ya_ovs_upload_date":"2021-03-08T02:15:02+00:00","ya_ovs_allow_embed":"true"},"_links":{"self":[{"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/posts\/12162","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/comments?post=12162"}],"version-history":[{"count":36,"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/posts\/12162\/revisions"}],"predecessor-version":[{"id":42786,"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/posts\/12162\/revisions\/42786"}],"wp:attachment":[{"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/media?parent=12162"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/categories?post=12162"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/analystprep.com\/study-notes\/wp-json\/wp\/v2\/tags?post=12162"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}