Students can easily access the best AI Class 10 Notes Chapter 4 Data Science Class 10 Notes for better learning.
Class 10 AI Data Science Notes
What is Data Science:
Data Science is one of the important concepts to provide some features like statistics, data analysis, machine learning, and deep learning. Data science helps to understand and analyse the actual scenario and helps to take fruitful decisions.
Data Science is just not a single field but it uses concepts and principles of mathematics, statistics, computer science, and information science. It is also capable of discovering hidden patterns from the raw data. It can also be used for predictions. Observe the following picture which differentiates the points between data analysis and data science.
Earlier the data processing was quite easy because data was limited and structured. The structure can be analysed easily and effectively. Nowadays more than 80 % of data is unstructured. So with unstructured data, the traditional methods cannot work appropriately.
In addition to this, day by day the number of internet users increasing. So it increases the use of unstructured data. These unstructured data collected by the various organisations through mobile apps, websites and other platforms can be used to serve the specific requirements of the customer and users. This will increase the demand for data science.
Now before we get into the concept of Data Sciences, let us experience this domain with the help of the following game:
Rock Paper and Scissors: Rock, Paper, Scissors | Afiniti (rockpaperscissors.ai)
Go to this link and try to play the game of Rock, Paper and Scissors against AI model.
The challenge here is to win 20 games against AI before AI wins them against you.
Answer the following questions after playing the game:
Did you manage to win?
…………………………………………………..
…………………………………………………..
…………………………………………………..
…………………………………………………..
What was the strategy that you applied to win this game against the AI machine?
…………………………………………………..
…………………………………………………..
…………………………………………………..
…………………………………………………..
Was it different playing Rock, Paper & Scissors with an AI machine as compared to a human?
…………………………………………………..
…………………………………………………..
…………………………………………………..
…………………………………………………..
…………………………………………………..
…………………………………………………..
…………………………………………………..
…………………………………………………..
What approach was the machine following while playing against you?
…………………………………………………..
…………………………………………………..
…………………………………………………..
…………………………………………………..
…………………………………………………..
…………………………………………………..
…………………………………………………..
…………………………………………………..
Applications of Data Science:
Data Science is not a new field. Data Sciences majorly work around analysing the data and when it comes to AI , the analysis helps in making the machine intelligent enough to perform tasks by itself. There exist various applications of Data Science in today’s world. Some of them are:
Fraud and Risk Detection: The earliest applications of data science were in Finance. Companies were fed up of bad debts and losses every year. However, they had a lot of data which use to get collected during the initial paperwork while sanctioning loAnswer:They decided to bring in data scientists in order to rescue them from losses.
Over the years, banking companies learned to divide and conquer data via customer profiling, past expenditures, and other essential variables to analyse the probabilities of risk and default. Moreover, it also helped them to push their banking products based on customer’s purchasing power.
Genetics & Genomics: Data Science applications also enable an advanced level of treatment personalisation through research in genetics and genomics. The goal is to understand the impact of the DNA on our health and find individual biological connections between genetics, diseases, and drug response.
Data science techniques allow integration of different kinds of datawith genomic data in disease research, which provides a deeper understanding of genetic issues in reactions to particular drugs and diseases. As soon as we acquire reliable personal genome data, we will achieve a deeper understanding of the human DNA. The advanced genetic risk prediction will be a major step towards more individual care.
Internet Search: When we talk about search engines, we think ‘Google’. Right?
But there are many other search engines like Yahoo, Bing, Ask, AOL, and so on. All these search engines (including Google) make use of data science algorithms to deliver the best result for our searched query in the fraction of a second. Considering the fact that Google processes more than 20 petabytes of data every day, had there been no data science, Google wouldn’t have been the ‘Google’ we know today.
Targeted Advertising: If you thought Search would have been the biggest of all data science applications, here is a challenger – the entire digital marketing spectrum. Starting from the display banilrs on various websites to the digital billboards at the airports almost all of them are decided by using data science algorithms. This is the reason why digital ads have been able to get a much higher CTR (Call-Through Rate) than traditional advertisements. They can be targeted based on a user’s past behaviour.
Website Recommendations: Aren’t wee all used to the suggestions about similar products on Amazon? They not only help us find relevant products from billions of products available with them but also add a lot to the user experience.
A lot of companies have fervidly used this engine to promote their products in accordance with the user’s interest and relevance of information. Internet giants like Amazon, Twitter, Google Play, Netflix, LinkedIn, IMDB and many more use this system to improve the user experience. The recommendations are made based on previous search results for a user.
Airline Route Planning: The Airline Industry across the world is known to bear heavy losses. Except for a few airline service providers, companies are struggling to maintain their occupancy ratio and operating profits. With high rise in air-fuel prices and the need to offer heavy discounts to customers, the situation has got worse. It wasn’t long before airline companies started using Data Science to identify the strategic areas of improvements. Now, while using Data Science, the airline companies can:
Predict flight delay
Decide which class of airplanes to buy
Whether to directly land at the destination or take a halt in between (For example, A flight can have a direct route from New Delhi to New York. Alternatively, it can also choose to halt in any country.)
Effectively drive customer loyalty programs
Revisiting AI Project Cycle: Data Sciences is a combination of Python and Mathematical concepts like Statistics, Data Analysis, probability, etc. Concepts of Data Science can be used in developing applications around AI as it gives a strong base for data analysis in Python.
But, before we get deeper into data analysis, let us recall how Data Sciences can be leveraged to solve some of the pressing problems around us. For this, let us understand the AI project cycle framework around Data Sciences with the help of an example.
Humans are inherently social creatures, constantly engaging in various social gatherings. One of the most enjoyable activities is dining out with friends and family, which is why restaurants can be found almost everywhere. To cater to their customers’ preferences, many restaurants offer buffet options, providing a wide variety of food items.
Whether it’s a small eatery or a large establishment, every restaurant prepares food in large quantities, anticipating a-significant number of patrons. However, at the end of the day, a considerable amount of food often goes to waste, as restaurants are unwilling to serve stale food to their customers the following day. Consequently, they prepare food in bulk daily, taking into account the expected number of customers.
Unfortunately, if these expectations are not met, a substantial amount of food is discarded, resulting in financial loss for the restaurant. They are left with no choice but to either dispose of the excess food or offer it to those in need for free. When considering the cumulative daily losses over the course of a year, the financial impact becomes quite significant.
Now that we have understood the scenario well, let us take deeper look into the problem to find out more about various factors around it.
Stage 1: Problem Scoping
Let us fill up the 4 W s problem canvas to find out.
Who Canvas – Who is having the problem?
Who are the stakeholders? | Restaurants offering buffets |
Restaurant Chefs | |
What do we know about them? | Restaurants cook food in bulk every day for their buffets to meet their customer needs. |
They estimate the number of customers that would walk into their restaurant every day. |
What Canvas – What is the nature of their problem?
What is the problem? | Quite a large amount of food is leftover everyday unconsumed at the restaurant which is either thrown away or given for free to needy people. |
Restaurants have to bear everyday losses for the unconsumed food. | |
What is the problem? | Restaurant Surveys have shown that restaurants face this problem of food waste. |
Where Canvas – Where does the problem arise?
What is the context/ situation in which the stakeholders experience this problem? | Restaurants which serve buffet food |
At the end of the day, when no further food consumption is possible |
Why? – Why do you think it is a problem worth solving?
What would be of key value to the stakeholders? | It the restaurant has a proper estimate of the quantity of food to be prepared every day, the food waste can be reduced. |
How do you know it is a problem? | Less or no food would be left unconsumed. |
Losses due to unconsumed food would reduce considerably. |
Now that we have noted down all the factors around our problem, let us fill up the problem statement template.
Our | Restaurant Owner | Who? |
Have a problem of | Losses due to food wastage | What? |
While | The food is left unconsumed due to improper estimation | Where? |
An ideal solution would | Be to be able to predict the amount of food to be prepared for every day consumption | Why |
The Problem statement template leads us towards the goal of our project which can now be stated as:
“To be able to predict the quantity of food dishes to be prepared for everyday consumption in restaurant buffets.”
Stage 2: Data Acquisition
Once the project goal has been finalised, it is essential to examine the different data features that have an impact on the problem at hand. As AI-based projects rely on data for testing and training, it is crucial to comprehend the type of data that needs to be gathered in order to progress towards the goal. In our specific scenario, there are several factors that influence the quantity of food to be prepared for buffet consumption the following day.
Let’s explore the correlation between these factors and our issue at hand. Utilising the System Maps tool will help us analyse how elements are interconnected with the project’s objective. Below is the System map depicting our problem statement.
In this system map, the relationship between each element is clearly defined in order to achieve the objectives of our project. It is important to note that positive arrows indicate a direct relationship between elements, while negative arrows represent an inverse relationship between elements.
After analysing the factors that impact our problem statement, it is now necessary to examine the data that needs to be obtained in order to achieve our goal. In this particular case, a dataset has been created for each dish prepared by the restaurant over a 30 -day period, encompassing all the aforementioned elements. This dataset is collected offline through a regular survey, as it is tailored specifically to the needs of this particular restaurant.
The data collected falls into the following categories: dish name, dish price, daily dish production quantity, daily quantity of unconsumed dishes, total number of customers per day, and fixed number of customers per day.
Stage 3: Data Exploration
Once the database has been created, it is essential to analyse the gathered data and comprehend the necessary information. In the context of our project, which aims to forecast the amount of food to be prepared for the upcoming day, the following data is required:
Hence, we extract the necessary data from the carefully curated dataset and ensure its cleanliness by eliminating any errors or missing components.
Stage 4: Data Modelling
Once the dataset has been prepared, the model is trained using it. Specifically, a regression model is selected, with the dataset being input as a dataframe and trained accordingly. Regression is a type of Supervised Learning model that deals with continuous data values over a specific time frame.
Given that our dataset consists of continuous data spanning 30 days, utilising a regression model allows for predicting subsequent values in a similar manner. The 30 day dataset is split into a 2: 1 ratio for training and testing purposes. Initially, the model is trained on the first 20 days of data, followed by evaluation on the remaining 10 days.
Stage 5: Data Evaluation
After undergoing training on a 20 -day dataset, it is now imperative to assess the model’s functionality. Let us observe the model’s performance and evaluate its testing procedures.
Step-1
The trained model is fed data regards the name of the dish and the quantity produced for the same.
Step-2
It is then fed data regards the quantity of food left unconsumed for the same dish on previous occasions.
Step-3
The model then works upon the entries according to the training it got at the modelling stage.
Step-4
The Model predicts the quantity of food to be prepared for the next day.
Step-5
The prediction is compared to the testing dataset value. From the Step-5 testing dataset, ideally, we can say that the quantity of food to be produced for next day’s consumption should be the total quantity minus the unconsumed quantity.
Step-6
The model is tested for 10 testing datasets kept aside while training.
Step-7
Prediction values of testing dataset is compared to the actual values.
Step-8
If the prediction value is same or almost similar to the actual values, the model is said to be accurate. Otherwise, either the model selection is changed or the model is trained on more data for better accuracy.
Once the model has reached its peak efficiency, it is prepared for deployment within the restaurant to be utilised in real-time.
Data Collection: Data collection has been a part of our society for a long time, even before the advent of advanced technology. People have always found ways to maintain records and keep track of relevant information, even without a deep understanding of calculations. While data collection itself doesn’t require much technological knowledge, analysing the collected data can be a challenging task for humans, especially when dealing with numbers and alphanumeric data.
This is where Data Science comes into play. Data Science not only helps us gain a better understanding of the dataset, but also provides valuable insights and analysis. With the incorporation of AI, machines are now capable of making predictions and suggestions based on the data. Now that we have explored a Data Science project example, we have a clearer understanding of the types of data that can be used in such projects.
Data for domain-based projects is typically in numerical or alphanumeric format and is organised in tables. These databases are commonly found in various institutions for record keeping and other purposes. You may already be familiar with some examples of datasets used in these projects.
Now look around you and find out what are the different types of databases which are maintained in the places mentioned below. Try surveying people who are responsible for the designated places to get a better idea.
It is evident that the data types mentioned earlier are presented in table format, consisting of numeric or alpha-numeric information. However, a crucial question arises: is this data accessible to everyone? Should these databases be open to all individuals? What are the different origins of data that contribute to the creation of such databases? Let us explore these questions further!
Source of Data: There are various sources for data collection found nowadays in the market. The major kinds of sources for data collection are:
- Online
- Offline
Online Sources | Offline Sources |
Open-Sources web portals run by Government | Sensors |
Reliable private websites such as Kaggle | Surveys |
Word Organisations Open-source websites | Interviews |
Observations |
Online Sources: The online sources provide the data collection facility by various websites, portals and apps. Users need to browse the web portal or download the app and follow the instructions. This method is not that popular as compared to offline sources right now but in future it become popular.
Offline Sources: The offline sources are more likely effective and useful for data collection. The offline sources give a clear picture to make a decision. Here are a few ways for the same.
Sensors: They are IoT-based devices which collect data from the physical world and transform it into digital form. They are connected through gateways to relay the data into the cloud and server.
Surveys: Surveys can be conducted by using different questionnaires. It is most popular for a large amount of data. It should be handled carefully. The surveys are less expensive and easy to process. Surveys are mostly conducted by using forms. These forms can be online or offline.
Interviews: Interviews are the best and most popular way to data collection. A list of questions is prepared to conduct interviews and collect data. It is one of the primary collection methods. It is the most expensive process. It can be also conducted over the phone, through a web chat interface.
Observations; It includes collecting information without asking questions. It requires researchers, and observers, to add their judgement to data. It can determine the dynamics of a situation and cannot be measured through other data collection techniques. It can be combined with additional information such as video.
The following point should be remembered while accessing data from any data sources:
- Data which is available for public usage only should be taken up.
- Personal datasets should only be used with the consent of the owner.
- One should never breach someone’s privacy to collect data.
- Data should only be taken from reliable sources as the data collected from random sources can be wrong or unusable.
- Reliable sources of data ensure the authenticity of data which helps in the proper training of the AI model.
Types of Data: For data science models or projects, generally, data is collected in the form of tables in different formats:
CSV: It is a common and simple file format to store data in tabular form. It can be opened through any spreadsheet software (MS Excel), documentation software (MS Word) and any text editor (Notepad). Everyone contains a record, each record has a number of fields and these fields are separated by a comma.
Spreadsheet: A spreadsheet contains rows and columns to represent data in tabular form. Mostly spreadsheet is used to calculate data, manipulate data, analyse data and maintain data records. MS excel is well known and popular spreadsheet software.
SQL: It stands for Structured Query Language. It is used to handle the data stored in DBMS (Database Management Software) System. It provides basic commands to create, alter, delete and manage transactions for database management.
Data Access: When the data is collected from different sources, it is required to use for different purposes. So data access is the key factor.
There are a few python modules and libraries which are very useful for data access, they are:
NumPy: NumPy, short for Numerical Python, serves as the essential library for performing mathematical and logical operations on arrays within Python. It is widely utilised for numerical computations. NumPy offers a broad spectrum of arithmetic operations on arrays, simplifying the process of working with numerical data. Additionally, NumPy supports arrays, which are essentially homogeneous collections of data.
An array represents a group of values that share the same data type. These values can be integers, characters, booleans, and so forth, but an array can only store one data type. In NumPy, arrays are referred to as NDarrays (N-Dimensional Arrays) due to the capability of creating arrays with multiple dimensions in Python. Comparatively, an array can be likened to a list. Let’s delve into the distinctions between the two:
NumPy Arrays | Lists |
1. Homogenous collection of Data. | 1. Heterogenous collection of Data. |
2. Can contain only one type of data, hence not flexible with datatypes. | 2. Can contain multiple types of data, hence flexible with datatypes. |
3. Cannot be directly initialised. con be operated with Numpy package only. | 3. Can be directly initialised as it is a part of Python syntax. |
4. Direct numerical operations can be done. For example, dividing the whole array by divides every element by. | 4. Direct numerical operations are not possible. For example, dividing the whole list by cannot divide every element by. |
5. Widely used for arithmetic operations. | 5. Widely used for data management. |
6. Arrays take less memory space. | 6. Lists acquire more memory space. |
7. Functions like concat-enation, appending, reshaping, etc., are not trivially possible with arrays. | 7. Functions like concat- enation, appending, reshaping, etc., are triv- ially possible with lists. |
8. Example: To create a numpy array ‘A: | 8. Example: To create a list: |
import numy A = numpy a array ([1, 2, 3, 4, 5, 6, 7, 8, 9, 0]) | A = [1,2,3,4,5,6,7,8,9,0] |
Pandas: Pandas is a Python software library designed for data manipulation and analysis. It provides data structures and functions for manipulating numerical tables and time series. The name “Pandas” is derived from “panel data,” which refers to data sets that contain observations for the same individuals over multiple time periods. Pandas is highly versatile and can handle various types of data effectively.
Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet.
Ordered and unordered (not necessarily fixedfrequency) time series data.
Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels.
Any other form of observational / statistical data sets. The data actually need not be labelled at all to be placed into a Pandas data structure.
Pandas, a library in Python, offers two main data structures: Series, which is 1-dimensional, and DataFrame, which is 2-dimensional. These data structures are designed to handle a wide range of use cases in fields such as finance, statistics, social science, and engineering. Pandas is built on top of NumPy and is specifically designed to seamlessly integrate with other popular scientific computing libraries.
Here are just a few of the things that pandas does well:
Easy handling of missing data: (represented as NaN) in floating point as well as non-floating point data.
Size mutability: Columns can be inserted and deleted from DataFrame and higher dimensional objects.
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations.
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.
Intuitive merging and joining data sets.
Flexible reshaping and pivoting of data sets.
Matplotlib: Matplotlib serves as an exceptional visualisation tool in Python for creating 2D plots from arrays. It is a versatile data visualisation library that is compatible with various platforms and relies on NumPy arrays. Visualising data offers a significant advantage by providing a clear representation of large datasets in a user-friendly manner.
Matplotlib offers a diverse range of plot options, which aid in interpreting trends, identifying patterns, and establishing correlations. These plots are essential tools for analysing quantitative data. Below are examples of the types of graphs that can be generated using this library:
You have the ability to not only create plots, but also customise them according to your preferences. By stylising your plots, you can enhance their descriptive and communicative qualities. These tools assist us in both accessing and exploring datasets, enabling us to gain a deeper insight into the data.
Basic Statistical Learning with Python: It is well known that Data Sciences revolves around the analysis of data and carrying out related tasks. Mathematics plays a crucial role in analysing both numeric and alphanumeric data in this field. Python also incorporates basic statistical methods from mathematics for the analysis and manipulation of datasets. Some commonly used statistical tools in Python include:
- Mean
- Median
- Mode
- Standard Deviation
- Variance
Mean: The average value of a sequence.
Median: 50 th percentile value of a sequence.
Mode: Most frequent value of the sequence.
Standard Deviation: Measures the spread of the sequence around its average value.
Variance: Average of the squared differences from the mean.
Mean (arithmetic mean): statistics.mean()
Median: statistics.median(), statistics.median_low(),
statistics.median_high()
Mode: statistics.mode(), statistics.multimode()
Standard deviation:
Population standard deviation: statistics.pstdev()
Sample standard deviation: statistics.stdev()
Variance:
Population variance: statistics.pvariance()
Sample variance: statistics.variance()
Advantage of using Python packages is that we do not need to make our own formula or equation to find out the results. There exist a lot of pre-defined functions with packages like NumPy which reduces this trouble for us. All we need to do is write that function and pass on the data to it. It’s that simple!
Data Visualisation: During the process of data collection, it is possible to encounter errors within the data. Let us begin by examining the various types of issues that may arise with the data:
1. Erroneous Data: There are two ways in which the data can be considered erroneous:
Incorrect values: This refers to instances where the values within the dataset are incorrect at random positions. For example, a decimal value may be present in the phone number column or a name may be mentioned in the marks column. These incorrect values do not align with the expected data for that particular position.
Invalid or Null values: In certain cases, the values within the dataset become corrupted and therefore become invalid. NaN values are often encountered, which represent null values that hold no meaning and cannot be processed. Consequently, these values are removed from the database whenever they are encountered.
2. Missing Data: Some datasets may contain empty cells where the values are missing, resulting in these cells remaining empty. Missing data should not be interpreted as an error, as the values may not be erroneous or missing due to any error.
3. Outliers: Outliers refer to data points that fall outside the range of a specific element. To illustrate this, let us consider the example of student marks in a class. Suppose a student was absent for exams and received a score of 0. If this score is included in the calculation of the class average, it would significantly lower the overall average.
To address this, the average is calculated for the range of marks from highest to lowest, while treating this particular result separately. This ensures that the average marks of the class accurately reflect the data.
Analysing the collected data can be challenging, as it primarily consists of tables and numbers. While machines excel at processing numbers efficiently, humans often require visual aids to comprehend and interpret the information effectively.
Therefore, data visualisation techniques are employed to interpret the collected data, identify patterns, and uncover trends. Matplotlib, a Python package, is essential for visualising data and deriving insights from it.
As previously mentioned, this package enables us to create a wide range of graphs for data representation. Let’s delve into a few examples.
Scatter Plot:
Scatter plots are utilised for representing discontinuous data, which refers to data lacking a continuous flow. Discontinuity is introduced by gaps in the data. A two-dimensional scatter plot can effectively showcase information for up to four parameters.
In this scatter plot, 2 axes (X and Y) are two different parameters. The colour of circles and the size both represent 2 different parameters. Thus, just through one coordinate on the graph, one can visualise 4 different parameters all at once.
Bar Chart:
Bar charts are widely utilised graphical tools that find application across various fields, from students to scientists. They serve as a simple yet informative means of visual representation.
Different variations of bar charts, such as single bar charts and double bar charts, cater to diverse data presentation needs.
This is an example of a double bar chart. The 2 axes depict two different parameters while bars of different colours work with different entities (in this case it is women and men). Bar chart also works on discontinuous data and is made at uniform intervals.
Histogram:
Histograms provide a precise depiction of continuous data. When focusing on illustrating the fluctuations in a single variable over a specific time frame, histograms are utilised.
They showcase the distribution of the variable across various time intervals through the use of bins.
In the given example, the histogram is showing the variation in frequency of the entity plotted with the help of X Y plane. Here, at the left, the frequency of the element has been plotted and it is a frequency map for the same. The colours show the transition from low to high and vice versa. Whereas on the right, a continuous dataset has been plotted which might not be talking about the frequency of occurrence of the element.
Box Plot:
Box plots, also referred to as box and whiskers plots, provide a convenient way to visually represent the distribution of data across its range by dividing it into percentiles. These plots effectively display the data’s quartiles, making it easier to analyse and understand its distribution.
Here as we can see, the plot contains a box and two lines at its left and right are termed as whiskers. The plot has 5 different parts to it:
The first quartile represents data from the 0th percentile to the 25th percentile, where the whisker length depends on the range covered by the data. A smaller range results in a shorter whisker, while a larger range leads to a longer whisker.
The second quartile, from the 25th percentile to the 50th percentile, is plotted inside the box as it has minimal deviation from the mean.
The third quartile, from the 50th percentile to the 75th percentile, is also plotted inside the box due to its low deviation from the mean. The second and third quartiles together form the Inter Quartile Range (IQR). The length of the box varies based on the spread of the data distribution.
The fourth quartile, from the 75th percentile to the 100th percentile, represents the top 25 percentile data. Box plots are useful in identifying outliers, which are data points outside the range and are visualised as dots or circles on the graph.
Let us now move ahead and experience data visualisation using Jupyter notebook. Matplotlib library will help us in plotting all sorts of graphs while Numpy and Pandas will help us in analysing the data.
Data Sciences: In this section, we would be looking at one of the classification models used in Data Sciences. But before we look into the technicalities of the code, let us play a game.
Personality Prediction:
Step 1: Here is a map. Take a good look at it. In this map you can see the arrows determine a quality. The qualities mentioned are:
1. Positive X-axis: People focussed: You focus more on people and try to deliver the best experience to them.
2. Negative X-axis; Task focussed: You focus more on the task which is to be accomplished and try to do your best to achieve that.
3. Positive Y-axis: Passive: You focus more on listening to people and understanding everything that they say without interruption.
4. Negative Y-axis: Active: You actively participate in the discussions and make sure that you make your point in-front of the crowd.
Think for a minute and understand which of these qualities you have in you. Now, take a chit and write your name on it. Place this chit at a point in this map which best describes you. It can be placed anywhere on the graph. Be honest about yourself and put it on the graph.
Step 2: Now that you have all put up your chits on the graph, it’s time to take a quick quiz. Go to this link and finish the quiz on it individually: What Animal are YOU? (DiSC) – Personality Quiz (uquiz.com)
On this link, you will find a personality prediction quiz. Take this quiz individually and try to answer all the questions honestly. Do not take anyone’s help in it and do not discuss about it with anyone. Once the quiz is finished, remember the animal which has been predicted for you. Write it somewhere and do not show it to anyone. Keep it as your little secret.
Once everyone has gone through the quiz, go back to the board remove your chit, and draw the symbol which corresponds to your animal in place of your chit. Here are the symbols:
Place the symbols where you have placed your names. Instruct 4 students not to do so and advise them to keep their animals a secret. Ensure that their name tags are on the graph so we can anticipate their animals using this map. We will now apply the nearest neighbor algorithm and attempt to forecast the potential animal(s) for these 4 unknowns. Examine these 4 tags individually.
Which animal appears most frequently in their proximity? If the lion symbol is most common near their tag, do you believe there is a high likelihood that their animal is also a lion? Let’s now make an educated guess for the animal for each of the 4 students based on their nearest neighbors. Once the animals are guessed, inquire with these 4 students if the guess is accurate or not.
K-Nearest Neighbour: The K-Nearest Neighbours (KNN) algorithm is widely utilised in machine learning for classification and regression purposes. It operates on the principle that data points with similar characteristics are likely to have similar labels or values.
In the training phase, the KNN algorithm stores the complete training dataset as a reference. When making predictions, it computes the distance between the input data point and all the training examples, employing a selected distance metric like Euclidean distance.
The algorithm proceeds by identifying the K closest neighbours to the input data point, taking into account their respective distances. In the context of classification, the algorithm determines the predicted label for the input data point by assigning it the most frequently occurring class label among the K neighbors. In regression, on the other hand, the algorithm predicts the value for the input data point by calculating either the average or weighted average of the target values of the K neighbors.
The KNN algorithm is known for its simplicity and comprehensibility, which has contributed to its widespread adoption across different fields. Nevertheless, the effectiveness of this algorithm can be influenced by the selection of K and the distance metric used. Therefore, it is crucial to perform meticulous parameter tuning in order to achieve the most favourable outcomes.
Some features of KNN are:
The KNN prediction model relies on the surrounding points or neighbours to determine its class or group
Utilises the properties of the majority of the nearest points to decide how to classify unknown points Based on the concept that similar data points should be close to each other
The personality prediction activity was a brief introduction to KNN. As you recall, in that activity, we tried to predict the animal for 4 students according to the animals which were the nearest to their points. This is how in a lay-man’s language KNN works. Here, K is a variable which tells us about the number of neighbours. which are taken into account during prediction. It can be any integer value starting from 1.
Let us look at another example to demystify this algorithm. Let us assume that we need to predict the sweetness of a fruit according to the data which we have for the same type of fruit. So here we have three maps to predict the same:
Here, X is the value which is to be predicted. The green dots depict sweet values and the blue ones denote not sweet. Let us try it out by ourselves first. Look at the map closely and decide whether X should be sweet or not sweet? Now, let us look at each graph one by one:
In this case, K is set to 1 , indicating that only the closest neighbour is being considered. Since the closest value to X is blue, the 1-nearest neighbour algorithm predicts that the fruit is not sweet.
In the second chart, K equals 2. Considering the two closest nodes to X, it is evident that one is sweet and the other is not. This poses a challenge for the machine in making predictions using nearest neighbour analysis, resulting in the machine’s inability to provide an predictions.
In the third graph, the value of K is set to 3. Within this context, 3 nodes closest to X are selected, with 2 being green and 1 being blue. Based on this information, the model successfully predicts that the fruit is sweet.
On the basis of this example, let us understand KNN better:
K-nearest neighbours (KNN) algorithm aims to forecast an unfamiliar value by considering the known values. The model computes the dissimilarity between all the known points and the unknown point (where dissimilarity refers to the disparity between two values), and selects K number of points with the smallest dissimilarity. Based on this, the algorithm generates predictions.
Let us understand the significance of the number of neighbours:
Scenario-1: Lowering the value of K to 1 leads to a decrease in the stability of our predictions. For instance, envisioning a scenario where K = 1 and X is situated among multiple green points and one blue point, with the blue point being the closest neighbor Logically, one would assume that X is more likely to be green. However, due to the K = 1 setting, the KNN algorithm inaccurately predicts that X is blue.
Scenario-2: On the contrary, by increasing the value of K , our predictions become increasingly reliable as a result of majority voting / averaging. Consequently, they are more likely to yield precise predictions (up to a certain threshold). However, beyond this threshold, we start observing a growing number of errors. It is at this juncture that we realise we have exceeded the optimal value of K
Scenario-3: When conducting a majority vote, such as selecting the mode in a classification scenario, it is common practice to choose an odd value for K in order to avoid ties.
Applications of KNN:
Data Preprocessing: When dealing with any Machine Learning problem, the initial step is to perform Exploratory Data Analysis (EDA). If missing values are found in the data, there are various imputation methods available. One effective method is the KNN Imputer, which is commonly used for sophisticated imputation techniques.
Pattern Recognition: KNN algorithms demonstrate excellent performance when trained using the MNIST dataset and evaluated. It has been observed that the accuracy achieved is remarkably high.
Recommendation Engines: The primary function of a KNN algorithm is to assign a new query point to a pre-existing group that has been created using a large corpus of datasets. This functionality is crucial in recommender systems, as it allows for the assignment of each user to a specific group and subsequently provides recommendations based on the preferences of that group.
Advantages of the KNN Algorithm:
- It is straightforward to comprehend and execute.
- It offers great flexibility for both regression and classification tasks.
- The incorporation of fresh data into the dataset does not necessitate the preservation of a model.
- It achieves high precision through basic supervised learning methods.
- It proves to be highly beneficial for handling nonlinear data.
- It demonstrates increased efficiency when dealing with extensive training data.
Disadvantages of the KNN Algorithm:
- The algorithm’s precision relies on the data’s caliber
- The expense of forecasting the k-nearest neighbours is exceedingly steep.
- Inadequate for data with numerous features or parameters.
- Consumes a substantial amount of memory.
- Lacks scalability.
- The Curse of Dimensionality.
- Susceptible to Overfitting.