Peace, mercy, and blessings of God be upon you. How are you all doing? I'm Wafaa, an engineer from the Computer Systems Control Department at the Faculty of Engineering, Mansoura University. In this video, we'll learn how to work with the first project, Chapter 1, Machine Learning. This is a quick explanation for anyone who missed it or is unsure what it entails. We already have the code from the Hands-on Machine Learning book, which is already available on GitHub, so we've got it. We just want to understand what's happening inside the code. We're not required to focus too much on the writing itself, as we understand you started the course late in the semester. Now, we have the first code snippet, which says... To work with the code we'll be using, we'll be using Python. Our version is from three to five, meaning it's greater than 3.5. So this confirms to me now that it's greater than three and at the same time five. The second part is that I'm importing the ASCII learner from version 20. Okay, what is this project about? This project is about having income and also life satisfaction data for people. So we have data in the dataset, one from CID and one from the UN, which we will work on. Here in the project, it will be supervised, and there will also be regression and a model base learner. Here, the project is about supervised learner work because we have, of course, the model answers. For example, we know the level of real human hours for each country in the data we are training on. Also, here we have regression because we are trying to predict a certain number, and this number is a value that is considered continuous from the life satisfaction of the person. We don't classify it. It's either 1 or 0, like passing or failing, that's the meaning of the word. Okay, so why, for example, is the model based on learning? Because we assumed that there's a mathematical equation, this equation acts like a straight line connecting income and human happiness, we tried to make it learn the best values for the equation's coefficients, like Δ0 and Δ1, which we explained in the section and you'll see again in the video. So, what problems does machine learning focus on? You'll notice that I'm not just focusing on the code; there are problems that can cause the model to fail, like, for example, that I have non-relevant data. The data we solved in Chapter 2, for example, is missing here. We removed certain data points to avoid distorting the model. Also, we'll notice that we have a limited quantity of data. My model needs a lot of data to understand it. So, you'll notice that there's a shortage of data. Furthermore, one of the problems we might encounter is overfitting. Overfitting means that the model we're working with stores so much data that it can't solve problems when we test it. Another problem we'll face is underfitting. Underfitting means the model is completely lost and doesn't understand the relationship because it's too simple. For example, it might use a straight line for a very complex relationship. This is also a problem. Note that Chapter 1 teaches us that machine learning involves letting the computer learn from data instead of writing fixed rules. We started with the example of GDP and live satisfaction to understand what linear regression means. And how can we use a simple model to predict the future, keeping in mind that data quality is of course more important than the code itself? The code we have now, the part we've defined, consists of two lines. The first line is supposed to tell us to get the total values, meaning we want the overall average for the country, not the subdivisions. As for the second line, this is a change that happens because the original table in the data is very long. This makes us widen the table, meaning we put all the countries, each country on a separate line, and something like the life satisfaction data in a separate column. This makes things easier for us while working with the code. Next, we'll find these two lines. The first line is supposed to have a column that already shows the year 2015, so we'll change it. But its name is GDP Capita, and the second part, which is indexing, makes this counter the key or index of the table so that when we merge this table with the other table, we can control it through this index. After that, we will notice this code. This code is supposed to work with us, linking the two tables together through our country name, so it ultimately produces a large table with all my data, which is the live survey, next to it the input, which is the data entry. So, we notice the first code, which is supposed to select the three, so we are supposed to have the full range of states, which is the sort value, and this sorts the countries from poorest to richest. Also, the code after it, you will find 01 6 8 33 34 35. These are supposed to be seven countries that we will remove from our data so that we can use them in the test later. So, we are supposed to now, so that when we draw the linear regulation, it will be The relationship is clear and straightforward for us, and there shouldn't be any odd values. This line got messed up, and here it started to retrieve the keyindex, which is supposed to be the entire list ( 36 values) minus the remove (seven values). So, the output should show 29 countries, which we'll be working on. This function's return part— what it will output—will only show us the two columns we need: the input (the GDP capita values) and the life satisfaction (the level of happiness or satisfaction a person has with their life). The second code now is the import ( OS/OSD data), which stands for Operating System. Its function is to communicate with or allow Python to understand your operating system, whether it's Windows, Linux, or Mac, and to interact with your folders and files. This part, the database ( datapass), is equal to OS/OS/Join/ Data. The code I'm supposed to use is supposed to let me know if, for example, you're in the right path. In Windows, for instance, you backslash the data set and backslash the live set to find the password. If you're in Linux or Mac, it's a regular slash, not a backslash. This code lets the system understand directly based on your topic so it can create the folders and follow the correct path. It's supposed to be where the folders will be loaded later so that when we start working, just mentioning the data password will show us the location and start displaying our project files, which are the data sets. Of course, we already have the data set, so we don't need to download it online. The code will then fix the drawing part. So now, the first code is the MATT BOOT. The online library allows the drawing to appear directly below the stream in the output. In previous versions, we had to create a block show every time, but now it will automatically display the output. The second part, the import, is called MP1. This is supposed to create an MPL (Multiple Logic Library) for my library, the one that will generate the drawings. It gives it a name, like MPL. Next, it configures my access points, the X and Y lines I'll be drawing on. It tells you the size here is 14. This is the font size for the headings on the X and Y lines. This X line, the horizontal line, should be 12. This is the size for writing numbers on it. Similarly, this Y line is the size for writing numbers on it. This is an example. We now have... For example, I tell it to import the request: Plot Library by Plot SPLT. We started by putting in the code that tells it how to draw with a line, and we put the X and Y values, what they will be written, or the X and Y headings. Then PlotShow will start. This is a practical application of its drawing justification, so it actually appears directly under the cell, as we can see. Of course, we do all these settings at the beginning of the code to ensure that all our drawings in the project are consistent, meaning they have the same style. For example, it would n't make sense to find one drawing with a small line and another with a large line, or for the font size to be different. So we adjust it from the beginning so that everything is clear and looks better. This code is now loading my data, so of course it loads it from the GitHub link, and of course we already have it loaded and put it in the same We have the folder that contains the code that we are working on. The whole world is with us. Of course, Mick Derr is telling you, I mean, make folders for the name of our Data Pass, so it will start here in Home Hack. For me, the part that we mentioned in the Data Pass exactly as it is. You will download it, which is the Data Sets Live Sat, and it will start downloading your data on it. It will remain for you now when you come to call. The data, in order to read it, of course, you must remember the data pass so that the world remains clear with each other. I mean, it remains correct with us. This code remains this code. We are supposed to read our data and see it. For me, for example, it is like the library, which is Bands. I use it to appreciate CSV files. For example, I have something in the code like, for example, Sounds A, which is equal to Coma Coma. What does this mean? For example, when I write 20,000, I write 20 heaps and then three. Zeros here tell you what it means to ignore the heap. So, you get the whole number without the heap so that it's not recognized as text, but as a number. For example, I have a backslash t a, which is also present in the code in this part. The delimiter, the delimiter, is supposed to be that I said backslash t d. So, in this dataset, you'll find that the data is separated by tabs. Here, they also know that we have a tab in the text. Also, the encoded Latin 1 d t is there so that I can read the characters specific to European languages in general. This is the part we mentioned, written within the code. If you notice, we had a function above, which is the property construct state. This function was to perform property adjustments for my data. Now I will start using it on the two datasets so that Now, I'm going to do something, for example. I have something called x, which will do the part that's about features, like the income of a person in this country. I also have the yfd, which is supposed to be my tagged data, like life satisfaction. I also have nb.cd, which is supposed to convert the data into an array so we can understand it, like columns or something else. So here we draw a diagram. This diagram is supposed to be a scatter. A scatter is points, like the output we see here. Notice that if we find all the points are tilted upwards to the right, then there's a direct relationship: the higher the income, the higher the life satisfaction. So it's tilted upwards to the right. What does this mean? That there's a direct relationship between the two. Then here we said that the model is a skein-learner model, which is a linear model. Regression, so now I'm telling him to do the part, meaning that we chose Clean Regression here in the model because the relationship will be like, for example, he'll assume it's a straight line. So here, the DotFit X and Y model will start to get my features and target and start to roughly, meaning he'll learn how to draw the best line so that it passes between all these points, meaning it will be a line agreed upon by everyone. Here, we've brought a new country, meaning it's not considered to exist, or a new place that isn't in our data, like, for example, he put here like Cupros, meaning he said that now I have the per capita income there, which is $22,587. So here he starts telling me, meaning, Print Model DotProtect X New, will you be able to discover why? What is this person's life satisfaction? How happy are they? So this is what we tell our model, and of course, our model is thankful. God means that he responded to us with a result of 5.96%, which means that we haven't actually programmed the computer yet. We just gave it our data, and it deduced the relationship based on the line drawing it started to create here in the code. The book shows that we can change the general concept of the model so that it can change its way of thinking with just one line of code. Here, we used something called KNN. KNN stands for KeyNearstNippers. So, the idea is that instead of the model drawing a line and working with linear relativity, it tells it that when I give it a specific value and it outputs its value, it will start looking at, for example, Coopers in the previous example. It will look at the three nearest points around it, take those three points, and calculate their average, thus giving me an approximate value. So, most likely... I got 5.7 out of tens, which is a bit close to the correct answer. This is another method he can work with. Now, the difference between Linear Regeneration, which we used, and Key Nerve Neighbors is that Linear Regeneration assumes the relationship is a straight line and regular from the beginning, while Key Nerve Neighbors tells you that countries are close to each other. For example, life satisfaction is close to each other, people's incomes are also close to each other. You'll also find that Linear Regeneration is based on the Best model, while Key Nerve Neighbors is Instant Best. So, which is better, Linear Regeneration or Key Nerve Neighbors? We can't say that one is better than the other. They are both good, and both are always good, but sometimes, for example, if Linear Regeneration has understated features, then Key Nerve Neighbors also has understated features. It might be overly sensitive to any changes in neighbors, so our job is to determine which is better in terms of how you use the code. For example, the code I have here will configure folders. So, for instance, `project.ir.dt` means you're currently in the folder `Chapter ID Fundamental`. This should be the name of the folder you'll store. Also, `agepass`. It tells me that my password will be in the `images` folder, which is the `ages` folder. You'll put the `Chapter ID` (the Fundamental) inside it. Where do you look for these `images`? You look for them in ` project root dir`, which is the `.dt` file, located within the same folder you're currently in. Then we use OS's library to create the folder. It tells you to create the folder only in the `images` folder, which is... Its name is Fendmet, all its data is there. "Vegist OK" means I'm telling it, if the folder exists, leave it, don't recreate it, as if nothing happened. For example, if you find "Egs," go into it; if you don't, recreate it. Here's the "BendDefinition" function, which is the service. I'm supposed to use this function to save my image. For example, I have something called "Veg ID," which is supposed to be the name of my image. It will be recorded here. Also, the "Type Lay Out" function is supposed to be set to "True" so that the text doesn't get jumbled up, cut out, or have any other problems. For example, the "File Extension" function means I'm telling it to save this image with its extension, PNG, and its resolution will be 300. Then it starts creating it. This whole function saves the image for me, so I'm just... I'll mention its name, Save Figure, and put its name in parentheses. Then, as soon as it's finished, it will tell me that it's now a Saving Figure named so-and-so. Now let's move on to the code `NP Random Seed 42`. This is supposed to adjust the random numbers I'll use later so that they are close together and there's no spacing between them, so I don't have any problems later. The next code is supposed to organize the data a bit. For example, instead of the data being in a raw table format, we're adjusting our table to be organized and sorted. It's like we're cleaning the table. So, the data might all be mixed up, but here it simplifies things and clarifies them so we can understand what's going on. The first code we have now is reading the data set of the OCD, and it also mentions the sound. My data is in a coma so that when I say, for example, 20,000, as we said before, there's a coma between them. So, it removes the coma and starts treating the false values as numbers, not text. Then, the second code is supposed to give me a total. So, now I have the data, for example, if I'm talking about specific people, there will be, for example, men, women, children, adults, and so on. But I don't want all the data in such detail. No, I want to summarize my total so that I can compare countries in general. I want to be able to do my comparison in a way that makes sense. So that we know that now, in our data set, if I mention a country and put next to it, for example, the health information of its people, or their ANCA or life certification, then it takes each line, for example, a line The health section has a separate line for life certifications, and the country is already included in the data state. This line helps to streamline things. I want the index to be the country, and it also helps to organize my columns. I don't want separate rows for life certifications, health certifications, or health; I want them all to be columns opposite my country. This reduces several rows to one. We then started to adjust my values a bit further. Here, ` value` equals `value`, which means to put my values—the actual numbers in your data. Don't expect anything; put them as they are. This makes the table a bit simpler, with each country having its own row and the corresponding columns of data. Its role in the matter means that instead of being very long and complicated, it started telling me, " No, everything is fine as it is." This line is supposed to display the first two countries in the table so we can see what they look like. We notice that the country now occupies a single line and is not repeated. We will start to find that all our data has become a column format, meaning everything is more organized, making it easier for us to work with it later. Here, the name of my dataset is the live status it contains. When I display any dataset, I take the viable or the place where I stored the dataset, read it, and start mentioning it. Between two columns, I put the name of the column I want to display. Here in the code, I am displaying the column, which is the live status. It tells me to display the part above so we can see what's in it. Our goal here is money. After that, we will start loading the GTP into the capt data. So, the first line, of course, is for the data, and of course, we understand what the SOS is and the demeter, and we also have the Latin, and we also have 2015, and we put GDP in its place, and we also made our index, which is the counter, so that when I merge between them, everything will be fine. We did the merge, and it told me to show it the first two lines so we can see. So, we notice that it already did the parts we asked for, it did GDP, and everything we did, and of course, it made the counter my index here, it's the main one at the beginning, and then my data, and the GDP is here, okay? This is what will merge the two data sets. We have a data set that contains life satisfaction and health, and it contains a set of other data, and we have life, meaning another data set that contains So now we'll start merging them. We'll put one on the left and one on the right. He's given them names, left and right. He's telling you here that my index in both is true, which means that the index in both is my country. For example, if I'm looking for my country, let's say the United States, he'll start searching for it in both tables and retrieve its data, putting it all on one line. That's how he merged them. The second line is now sorting the tables. For example, the richer the country, the higher the ranking. So, based on wealth and development, as they say, when we draw the diagram, things will be much clearer and more logically ordered. Of course, we can see the world and where the rich countries are, and whether life satisfaction is increasing or not. This will be useful for us in this matter. This line will print the final result we just stored, which is the live data. It will start printing, and we noticed that it actually merged the tables and values, increasing the number of values and improving the overall results. Now, we'll notice that the code I have—and I know that full contrasts are present in all my data—is telling me to display only the GDP (Global Data) part related to live certification, specifically the part for the United States. So, it will only display the United States portion, showing the GDP value of 55800 and the live certification value of 7, along with its name (notated steps). My value is 64. Here, it performs the segmentation we discussed, creating a section for training and another for the training segments. Of course, what we're supposed to subtract from the test portion is the seven countries we chose. The person who created the code didn't choose them randomly; they selected them. Some of these countries are very rich but don't have high live satisfaction, while others are very poor but have high live satisfaction. So, you'll find that the values for these points are slightly different from the line. He excluded them initially so that when he adds them, the model will see what's most suitable. So, in the first line, we said the data is already numbered from Z to 35 because there are 36 countries. He chose 01, 6, 8, 33, 34, and 35. The next code I'm telling him is the key. The key is the part of the key that will remain, and this is what I'll use in the The training involves subtracting the range of 36 countries from the removal of the seven, resulting in 29 countries. We store this data in the Sample Data. The Sample Data will contain the training data (the 29 countries), while the Missing Data will contain the seven countries we'll use for the final test. The code we have is supposed to display our data as a visual representation. The first code we have is ` sampledata.dot`, which will draw the sample data and the keystone skeet. This means it will be a pointer. We also mentioned that the x will be labeled `g`, which represents the size of five. This means my image will be smaller and more organized. Here's the `pt.dot`. Access in, now the horizontal part, which is the horizontal axis, will start from Z to 60000 and the vertical axis will start from 0 to 10. As for position text, this is a set of text that will be written on the drawing. For example, the word Hungry will be written on the coordinate axis 5001. 5000 is the value of the x-coordinate, and one is the y-coordinate. Likewise, Korea is in the x- axis and y-axis, and so on until it starts to start the drawing. What does this mean? For example, here Hungry is, so we said 5000. One will start 5000 here, and one here will start the point. From there, it will start writing and write the word Korea, and so on. It will also get the coordinates of Korea and start writing Korea, and so on until it writes the rest of its data. After that, we have, for example, the function that is ` plt.not`. What does this do for me? It makes me draw, for example, the cloud itself, so that it focuses on the location of the actual point. Of course, the point has its x and y coordinates so we can see where it is based on our data, whether it's incompatible or live specification. Notice here that it randomly selected a group of countries and also made them red, which is the redout point. This is an abbreviation for the R or what we see, which is PT.plate. It started mentioning the x and y and added the R so that it can then, from the x data position, start to get its position in the GDP. Similarly, its live specification will be the data position, and it starts to calculate where it is and starts drawing a red point here so we can start to see it. It is supposed to have made this point to start to see if it is true that as the numbers increase, the specification increases, or not, and if we can draw a certain line through it. In these parts, we'll notice that I have a ` sampledata.csv` file. This file is supposed to save my sample data in a specific folder. I also have a `sampledata.look` file. This `look` file takes a specific part of the table and cuts it so that I can display it. I cut a specific value, and it also showed me the five joules with their GDP and live satisfaction. We do this together so we can see our data. These are the red point values that appeared in the previous code in the output. Here, I'm creating a learning system where I tell it to draw points. I now have the x-coordinate from 0 to 600,000, and we also told it to make it, for example, ` y`, from 0 to 10. We also said here, for example, from 0 to Let's say 600,000 places 1000 points between zero and 600,000 so we can create our data. This will create three colors of lines: red, green, and blue. There's another important part here, which is the double quotation dollar, and then you close it with the dollar double quotation. This is supposed to display the equation in the middle. This part, the part outside, is supposed to be written by me. Then the part that determines this is my equation, so it displays it like mathematical equations you write by hand. It will appear in the form of sets. So what are sets 0 and sets 1? What are these lines? I'm supposed to teach it this. So now I have the red part. This red part is the LI satisfaction with the incumbent starting at zero. So, the higher the incumbent, the higher the LI. Satisfaction is with me, but the green part is supposed to have a high live satisfaction rating. However, the higher my incoming data, the lower my live certification. Okay, of course, we should have the blue point, which is the normal one with our data. You definitely have some live satisfaction, but it's not very high; it's considered acceptable. Then, the higher your incoming data, the higher your live certification. You'll also notice that the data points are already around it, so it's the most suitable option to choose for the project. In the previous code, we were experimenting, meaning we're now trying by chance to place lines and see what's closest to our points. But here, we'll use the SketchLearn kit, which is the screen, so we'll start to see what the best option is for me. So, regarding Now, he will of course use the linear model and he will start to make a regulation for us and he will start with me now that the x is the GDP and the y is of course B.C, meaning he will make it a column with me so that I can use it in the array. Now, this line is supposed to be the model's study, what does it do? It takes the x and the y. The x is of course the income and the y is the life satisfaction, and it applies a mathematical equation called Ordinary List Squares. The goal of Ordinary List Squares is to find the line that will make the sum of the distances or squares, the distances between the points and each other, make them, that is, between these points and the line, the least possible, that is, that the error is very small so that the line will be closer to my points. Okay, after that, it will of course show me the part about Sita Z and Sita 1 and it will show me the output. We will notice here that, for example, Sita 1 is supposed to be a very small number, so why is that? So, he's telling you that we're comparing, of course, dollars—for example, tens of thousands—with happiness levels, which are supposed to range from 1 to 10. It's perfectly logical that an increase of just one dollar affects a very small part of the happiness level. Here, he used the values the model calculated—the Δ0 and Δ1, or Δ1 and Δ0—which came from the metadata, to draw the blue line. He then started directly applying the equation and calculated the values shown here—Δ0 and Δ1. These are the actual values that exist, and it appears as we see now. Next, we'll work on the part we've worked on so far. We'll start by getting Copernicus and seeing what its data will be. The first thing he told me to do is cut a section of the table about its data crossover, which is the GDP. What is the data file? It starts printing its value, which we got as 22587 dot 4 9. Now it will start prescripting the value to get my live certification value. It will display it here in the form of square brackets because we are working with a matrix, and also these two brackets so that it appears in a floating-point format, meaning it gives me the whole value like this. So, when we applied the Coopers Protect No Certification, we got approximately 5.96. So this is our output in the Coopers. Of course, the additional information is that this prescript is actually calculating the value of xD so that it can get the live certification. It takes from the set values 0 and 1 that we saw previously, so it writes an equation like this, for example: any. 10, which is one, and I have 4 raised to the power of, so it becomes approximately 4.8 plus. Then you start opening a parenthesis, for example, 4.9 times 10 raised to the power of negative x multiplied by 22587, which is the value of the coming value. Whatever it is, the important thing is that it performs a specific mathematical equation so that it gives me the value of the projection, which is my live certification at the end, and that's what I need. As for this code, it only shows me the part of the projection of the cupros on the line, how it looks. So it just shows you, as a display, that the cupros are present here, so you start its T-point, whether in the X or Y, depending on your data, which we saw a little while ago. After that, we will find a part that is sample data x, meaning it will start telling me to show me the data from the seventh to the tenth line. Here, as for the country code, it is actually a... This is my index, and I put the GDP and Live Certification that we had already stored. The Simple Data section is just that; it only displays the GDP. So, these three data points are supposed to be the closest to the Acopros value, which is 22000. Then it tells me, "No, I'll do the ' neighbors' thing, which is the average of the three values." So it puts the values with me, which are 5.1, 5.7, and 6.5, and then it adds them and divides by their number, which is 3, giving me approximately 5.7. This is another prediction, so we'll notice that it's approximately the same, meaning there's no significant difference between it and what came before. Now, I'm displaying my Missing Data, so I have these seven data points that we separated. It then displays their GDP and Live Certification. Of course, we know their values so we can see them on the graph. We start by drawing their positions. First, it creates position text 2 because text 1 contained the numbers we wrote above. This is the sending data. It then creates the coordinate locations so I can write the word there. After that, it starts creating the code. It's supposed to create a straight line for these seven numbers, and the old line, the blue dotted one, is the previous line we created. So now we've clarified the difference between the two; this is in one area and that is in another. Now we start to run a test to see what the machine does. Here, the overfitting section appears. My problem is that it started causing a problem. When it created, for example, seta 30, it started going to the next pointer, and everything went wrong. It wasn't nice anymore because He's now performing a very high override. Now, let's move on to the data. We want to display only a group of countries containing, for example, the letter Y, the letter W (sorry, A). So now I'm telling him to display only the part related to the live satisfaction; I don't want anything else. So he started displaying a group of countries containing the letter W and included their live satisfaction information. We'll notice in the next code that it's also about displaying all the information about them, not just the live satisfaction. We added a dot header here just to display the first five. This is almost the same code, but with the addition of the header and the live satisfaction section. The difference is in the last part. We'll start to notice that he displayed the data, and the counter is a candex, which is the data. So, the first five appeared because of the header. This makes things easier. The table is large, so I don't want to overwhelm the data here. This is the last code we have, which is supposed to provide a solution to the overfitting problem and the missing data. This is the concept of regularization. We also used a model called Ridge Regeneration, which compares three different nodes so we can control the intelligence of our model. For example, we've now drawn the basic data, which represents 29 countries. Now, let's look at the output. Below, we'll see the part for the 29 countries (the red dotted one), the blue dotted one (the part for 20 countries, if you remember), and the red dotted one (the part for the remaining seven countries). Then, we compared the models (the red line and the blue dotted line), and the solid line was generated using the Ridge Regeneration model. And here's where... The red model is supposed to solve our problem, making things a bit clearer. It also solved the overfitting issue that was present. In short, it solved many problems, and I hope everything is clear now. Here, we used regulation, and things are now a bit less sensitive to changes in the data. The blue line indicates that it has learned the missing data, but it's very close to the red line and also close to the blue points. Therefore, it's considered to have done better data generalization, and the problem is starting to be solved. That's the end of the first project. I hope everything is clear and the information is more understandable. God willing, I'll upload Chapter 2 as soon as I finish it. Good luck!
ch1 in hands-on machine learning (project code)