Peace, mercy, and blessings of God be upon you. How are you all doing? This is Engineer Wafaa from the Accounting and Control Systems Department, Faculty of Engineering, Mansoura University. God willing, today we will work on Chapter 2 of the Hands-Only Learning book, specifically the practical project section. As we agreed before, we are using the book, meaning we are getting the code from the GITEX website and starting to understand how the code works and what the people who created the book did with the code. So, we currently have Chapter 2, which is a complete code. Let's understand what this project is. This project is like, for example, if I am a large real estate company in California, and our data site is concentrated around California, then I, as a company, should have some old data about some areas. But I don't know, for example, if I receive specifications for a particular house or something, how I will know its price. Therefore... We'll create a model for them. This model will allow us to input the specifications of a specific area, and it will then tell us what the average price there is. This part will naturally involve regression because it continuously predicts the value. I'll also need to take an average house price. Here, I have a set of features I'll need for the project. The first feature is the house's location. Here, the location is defined by horizontal and vertical coordinates within the dataset. The location is determined by ocean proximity, which determines whether it's near the ocean or not. The second feature is population. Is the area where I'm targeting this house remote, crowded, or not? So, for me, the project in general involves inputting data about the house, and then it will generate its price as one of the features. I also want to know what the average income is in this place, or what the average income of the people here is. I also want to know the number of rooms, the number of bathrooms, and the general specifications of the house. So, let's see what my problem category is. My problem should be based on superlearning, of course, because here we have model prices—the prices of the houses—that the model should learn from. And we have a model answer, of course. So, what else will we use? And of course, there will be a rationing task because we always expect a continuous value, not just a simple "it will turn black" or "it will turn white" or "it will turn black" or "it will turn 1" or "it will turn 0." So, for example, I might get a price of $250,000, and so on. This gives me a specific value. What else? You'll find multiple rationing because I also use... for example, I... I'm working on an incom, location, or area, so I have more than one feature. What else will we find? Universities, Universitration data, because I'm only predicting the price, meaning I'm predicting just one thing. But if I'm predicting both price and rent, for example, then I have more than one thing, and then it will be multivariate. But in this case, it will be unvariate because we're predicting just one thing. Or, for me, it's just one thing. Also, it's batch learning. This is because my data is fixed, and I train the manager on it only once. Of course, they learn in real time; it's not an online system where they learn moment by moment. So, it's called batch learning. Of course, we have these features that we use, like incom or room numbers. These look like tables, and we also have a specific target in a column that we... We want to predict it, which is the price of the house, for example. And of course, we have the data in each row of the table, which represents a region, not just one house. Also, we have the alternatives in the table, of course, in the data, so the data is already with us. We will need it too. So, regarding the problems we have, the first problem we will encounter is that we will find empty cells in a column, so we will solve it by cleaning the data or total databases, depending on how we will work with it in the project. Now we will see together that I can now do machine learning, of course, meaning we will convert the text to numbers so that it can understand, because machine learning only understands numbers, so we will do this as well. Among these problems is that there are regions whose price is considered encrypted at $500,000. This, of course, affects our model. For example, if we add more features or capabilities to it in general, the model unfortunately reaches a maximum price of $500,000. This is another problem we have. The model can also encounter overfitting, where it might store so much data that it can't predict what it's looking for, for example, the value of new homes. There's also underfitting, where the model is sometimes overly simplistic and does n't generate clear relationships, like a straight line, or a complex relationship. So, to measure my performance, I'll work on something related to errors, specifically root main square error. This is supposed to be the primary metric for our project; it calculates the average difference between the actual price and the actual value. The expected price, and of course, the lower it is, the better the model is for us, and we know that it was a correct prediction. I have the Data Engineer section here. For example, the data pipeline is the path the data travels through because we clean it, then convert it, and then automatically enter the model. There's also a feature engineer section where, instead of just having the total number of rooms, I should add features like the number of rooms per person to be more precise. So instead of saying I have five rooms, I'd say I have three single rooms and two double rooms, and so on. We learn all these things because, in reality, when the data comes to someone to work with, it's often very messed up. So, we have to learn how to prepare this and create a pipeline to clean the data and convert it into numbers according to what we're working on. Now let's get into the model and work with it. First, the data needs to be set up. We'll start working on our project code. Our first code is for importing a system, and a system is a version from 3 to 5, or greater than or equal to 3. This means we need a Python version greater than that. We also mentioned adding AskLearn from version 20 (which we discussed last time), and we added Pi and OS. This section is for drawing; we explained it in video number one. If you haven't seen it, you can watch it and understand what all these points mean. This is where we stored the images. Here's a folder where our images will be stored. My project will store the images in a folder called "and2andProject." We're doing this image storage because it generates... I have a high-quality image that I can use, for example, in a Word document or anything else. For example, if I bought the project from you, I would need a document with everything written in the output, with your work. Of course, it would be very useful if you could put it and save it directly from the code. Its quality will be good. And then here, of course, is the function of CIFE, and we explained it previously in the last video. After that, the next step is that we are supposed to download the data from online. Of course, since we have the data, we don't need to execute this code and the code after it, which is to get the data that it downloaded from GTA. So this code is related to the fact that it also works. So let's go straight to this import. I put, for example, the PANDAZ. But why did I put the PANDAZ? I put these bands because I'm going to retrieve the CS file from the folder I saved. We need to pay attention to this code; it tells me I've created a function that will retrieve my data. This means it won't be executed until I call it. So, what's written in parentheses? The "allows" I need is the location of the dataset. You'll find "Housing Pass" here. This "Housing Pass" is mentioned in the download section. This is where he put the downloads. We changed it, of course. Regarding where you put your dataset, we explained in the section that it 's in the same folder as the dataset. Next, it retrieves the file. This file is located in the folder called "Housing Dot CSV." This is the file that... This is our dataset, which is the general table that hasn't been cleaned or updated yet. I tell this part to execute the function so that this folder and the dataset are processed. Then I store it in the main section, which is "House," so that when I mention "House," it means that's my dataset. Then you'll find a "Head" in "House." We agreed that the "Head" shows me the first part of the dataset, usually the first five lines, so it showed me the first five lines. We started looking and found our data looks like this. The " Length" part is supposed to be the horizontal line, and the vertical line is the length and width relative to the coordinates (the geographical location). We also have "Housing Median Age," which is supposed to be the age of our house, meaning its average age. We also have something called "Total Rooms." What is the total number of rooms in the house? The total number of bedrooms is also mentioned. We also have, for example, the pollution, which we mentioned earlier, referring to overcrowding. We also mentioned the C.M. (medium income), which is the average income per person in this area. We also have the medium house value, which is supposed to be the average price our model will try to predict. This means it's supposed to be the target it tries to predict based on the new house I'm going to set for it, starting with the existing data. The offshore proximity shows me if it's near the ocean, near the bay, or in the city center. This shows me the location. The code we mentioned next is HOUSE.INFO. HOUSE.INFO is supposed to provide me with more information about the dataset, which is the housing. Here, we'll notice that it displays my columns in an organized way and starts showing me how many records there are, for example, whether they are numbered or not. And also, what is the data type of my followers? Are they floats, objects, or what exactly? It depends on our work and what we see as necessary. For example, if we create the info, we might notice that in the total bitroom, the number of records is less than the others. This means that I have missing data here. Also, you might find that this section, Ocean Proximity, has an object data type, but it should be a float. This is another problem currently present in the data. After that, we created the code for the follow counts, which modifies my followers so that I can explore in a bit more depth. For example, how many times does this column repeat? What exactly do I have? For example, this Ocean Proximity has, say, 1 or Ocean, and this word has been repeated 9136 times. Inland, the data set was 6500 words. Near Ocean, 2600 words were repeated. Near Bay, 2200 words were repeated. Island, five of the outputs I'm currently seeing, should indicate that I now know there are five islands. This should alert me that the model might not learn from them effectively because their data is limited. We need to be careful. In the output, I see " you," which refers to my counts—the count of these values, not the values within the Ocean Proximity, as we agreed it's an object. Here, we created a two-factor script, which provided more data than the two-factor script. For example, it showed me the total number of counts, the average, the minimum, the maximum, the STD, and how many values (25% or 50%) of the houses. What values does it take? Or what values does 75% of the houses take? For example, if we look at the housing media age, if we see 25% at 18, that means 25% of the houses in this dataset are 18 years old. If, for example, 75% are 37 years old, then the remaining 50% are considered the middle ground, which is approximately 29 years old. So, this code will start by drawing the parts we showed above, which are the features or columns in the dataset. But, for example, all these things are presented in the form of tables. So, for me now, it will start to show me, for example, the part about longitude and latitude, and the part about my housing media age. You'll also notice that here the average lifespan of the house is shown, and I have 500, which increases with me. So, what's the situation? The ransom is displayed as a drawing only. We notice here, for example, that housing.hst is supposed to be responsible for how I divide my data. How do I divide it? I divide it, for example, into 50 parts, and I also start with 50 categories, meaning according to what is called the "Veg Size," which is the area of my image, 20 by 15. I also have, for example, a "Save" field, so in parentheses, the " Tripot Hosting Plot" should be the name of the image that will be stored with me in the folder that we created later, `pt.show`, so that I can see my images in their output, directly under the ladder. Notice here that there is a code, which is `input Random.seed 42`. We said last time that it adjusts my random numbers to be closer together. This code is supposed to be that I am now doing a function. This function is responsible for determining that I will divide the data set into two parts: a train part and a test part. So, according to the ratio that you enter here, which is the test ratio, it will start to know that the test, for example, if I set it to 20, for example, so it's 20% test, and the rest will be a train. This is, of course, the data in general, so it's supposed to be responsible for dividing the data or dividing my numbers into a train part and a test part. So, for example, if we take the first 200 numbers from the array as a test, then the rest will be a train. So this is generally a manual division. Then the last code, which is returndata.eloc, is supposed to do the IC. It's supposed to display the rows, meaning the actual rows based on the numbers we'll put in. So, this code will show that the table is divided into two tables, one for testing and one for testing. This is my table. Then we can run our function and set the test to be, for example, 20%, and then display its size. So, it will show me, for example, 16512, and it was probably 200000 and a bit. We will also display the test. That's roughly how I did it. This means that the data is considered 80% and the test is 20%. Another piece of information: when we implemented random sets, this random set solves the partitioning problem. For example, when we're partitioning, if you don't run the code without a random set, you'll get a different output each time. It could even get different values if you run it 10 times. So, what do I do now? If I implement random sets, it organizes the functions of the SketchALearn library and ensures that my partitioning remains constant no matter how many times I run my code. Here, the code solves a very big problem: how to maintain the consistency of the test set if we add new data to the project later in the future. So, I now have the underlying concept, which is the test check function. This function is supposed to rely on hashing. What is hashing? Hashing is... For example, instead of relying on randomness or randomness that changes all the time, I use a digital fingerprint, meaning a specific fingerprint for each row. This is what was done in CRC32, which is the ID for each page. The part of the OXFFD is supposed to ensure that the number remains compatible with all operating systems, like 32T. Here, the ID equals the data of the ID. The column is supposed to determine which column I will use as a key, like it will be an index for the table later. Now, for example, I apply this test function here as well. So, to know who will take a test in these IDs, I start by identifying people who are not in the test, putting them on the train, and then I start getting the rows for my test using the data outline. So it will start to get my test and what through My delta is complete, and we used the Z-Library, while the code after it used the Hash Library. This is another way to perform hashing, but it's slightly better now because it deals with the MD5 fingerprint, which is a bit more complex and a longer fingerprint than CRC32. This ensures that we distribute our randomness fairly across all the data, meaning that the 20% actually represents all the data for the types of bits present in our dataset. Here we find the line that's the part about Return Hash Pitch LNB.intgr64firesalboa256vratio. This line should show me how it converts the digital fingerprint into a byte string. So, for me, now I'll get the last byte of this fingerprint and start looking at which byte... Its value is limited, for example, to a specific value. For example, we already know that a byte is limited to between 0 and 255. So, for our test ratio, it will be 20%. Now, I'll convert it to 256, so it will be 256 x 20. This should be my value, which is 20%, which will be 0.2. So, in the end, I get approximately 51. This is the calculation. For him, he sees the last thing in my fingerprint or byte as 51, so I'll put it with my test set. Why is this method better? Because if we put, for example, a probability of 20%, we know that it's less than 51. Also, MD5 has its encryption. Much higher than the second part, so here, for example, in this code, we worked with a byte array. This byte array is supposed to ensure that Python will handle the fingerprint, which is the hash. This is a practical application of the previous part, that we will now run it as a test check in the open fire, and the test ratio, and the hash equals a hash lighted MD5. So, for me now, this byte array will ensure that my Persona will handle the hash as a list of numbers, so we can now easily take the last number in it and compare it, for example, to our ratio, which is my ratio, according to 20% or whatever. After that, we will find that in the next three codes, this code also does the test for me. So, for me, it is now telling you that since we don't have a column, for example... The ID is ready, but we'll use the index or a specific row number in my table. It's fine, but it has a problem: it works, yes, but it has to be a condition that we have to add the new data at the end. Because if we don't add it at the end and then delete something, for example, and also, you can't delete any old row at all. If you delete a row, it will change the test, and the test will be corrupted. So that's the only drawback. After that, he should have told you that the smartest solution is to work with the geographic ID, so that would be better. The last part we're working with now is the code that's written, which is the housing with the ID. It equals the housing length times 1000 plus the housing that's the tube. So, for me now, this solution, where we put the ID, is supposed to be based on the location of The house, and since the longitude and circle lines do n't change for me here, it's currently a unique number. So, it can be my ID. The goal is for the ID to be linked to the actual location of the house, not its location in the table. Of course, its location won't change; it will remain constant. So, we've covered that part. Now, what's the second part? I'm displaying the test set, the part above, and I'll see my index. I'll start noticing that my data is beginning to appear, and now I have the ID column ready to be entered into the test environment. This part is a quick way to configure it, so it should be the fastest. The test split here will start to be randomly divided (42). So, the world is supposed to be like this, meaning we can get... of course, we won't get the same number every time we run the code because it's now Random State 42. So, the world is a little clearer now. I mean, Random State means we're now fixing our value, while the previous one was... well, this is a bit better. Now we have a problem: if the data is small, or if there's a specific, very important feature and it's distributed unevenly, then random division might be unfair. So, what will we do? We'll work on the medium, which is the part that determines the average income. They said that, statistically, we'll work by dividing people into income brackets so that when we take the test, we can make sure that we didn't get all the people who have money, or the people who don't have money, or the people who are average. No, I... I want to collect from everyone together so the world will be perfect. Then, it started converting these numbers into categories. For me, the first category, which is the first layer, takes from 0 to 2, from 1.5 to 3, from 3 to 4.5, from 4 to 6, and then from 5 to 6. I've divided them into approximately five layers now, and I can work on these five layers later. Now, we want to make sure our test went well through this part. It goes to the house section, where you get the value and count the values. These are our layers. It asks you to show me how many values I have in the connection. You'll notice that the layers are from 1 to 5. It starts telling me that number 3 contains 7000 and some change. The number contains 6000 and a bit. Of course, we know this thing; we're counting. How many times is the number three repeated? How many times is the number two? How many times is the number four, and so on? The important thing is that it now shows me how many layers, and then it starts drawing the output of this statistic. Here, we'll notice that after we did the split, we'll start to see what it consists of: the percentages of the number three, and the percentages of the number three in the overall data. For example, if I have a group of 100 people, and 30 of them live in Cairo, 40 live in Mansoura, and the rest live in Alexandria, then when I take a training group or a test group, whatever it is, I should take the same percentage. So, if 30 are from Cairo, I should also take 30% in the test. 30% inside the train. The same applies if they are, for example, from Mansoura, for example, 40%. So I will also take a percentage of 40% inside the test, and so on. So now, for me, I have to make sure that after he did the thing, the percentages are really correct. So I will notice here in the test that the three have taken 35%, and in the full data it is 35% as it is, and the two have taken approximately 31%, and the two here are also 31. You will find the percentages are similar, there is no difference except in the last number, so this will not affect the overall effect. So consider that I took the part of my test. He actually took from the whole data, so it is also natural that the rest, which is the part of the train, took from the whole data. So the work will be organized and things will be a little smoother, God willing, and I will not have any problems with the code after that. The code we have here, of course, proves with numbers why we went to the trouble of using splitting instead of just random splitting. For example, the first function, which is the incumbent robot, is supposed to calculate each input class in any dataset we give it. So, if I have a number, '3', that appears 7000 times in a dataset of 20,000, then the data will tell us that its percentage is approximately 35%. Here, we create a data frame to compare the percentages of incumbent classes, for example, between three states. The first state we have is the over-first. This is supposed to be the actual data or percentages of the original data in the housing. As for the stratified state, we use it to represent the percentages that came out in our testset when we used the splitting, which is a bit more advanced. So, what is random? The random number is supposed to be the ratio obtained in the test group when we used very simple random segmentation. We'll also recalculate our performance ratio. We'll notice here that I have the error percentage in the random segment; this is supposed to calculate how close a material is to the true percentages. Also, in the stratification, it will calculate the ratio to the true values or the true percentages. After that, we'll move on to the comparison purpose, which shows me the values we worked with previously, such as the overall stratification, the stratified error, and the random error, depending on the situation. So, for me now, I'll start looking at my numbers. For example, in the random segment, this part is positive, but here in the stratified segment, it's negative. This will certainly make it clear that it's the better option. So, of course, it's supposed to be stratified because all its values are considered less than the other values, to some extent, and their values are negligible. So, for me, the percentage difference between them is almost negligible. So now there's almost no error, and the minister is negative, so everything is fine. But this one has one where there are varying percentages in the randomness, so sometimes it works and shows a very small error, and sometimes it shows a large error, depending on our part. Then the next code talks about what's supposed to be deleting the column we used, which is the inkjet, if you remember, the one we used to create the five layers. We do n't need it now, so it starts to identify it and delete it. Access equals one, which means it's telling it that this is a column, not a row, so delete it from the same table, not from a copy table, but from my main table, because I don't need it anymore. Okay, then it enters the StartTrentSt Copy, so we're supposed to have started here. This is the data exploration stage, so we just need to make a copy of our data because we're going to modify it now. We'll add new columns, delete things, to better understand the relationships and connections. We'll start changing our data, so we'll make a copy so that if anything goes wrong, we can go back to the original version and have a copy of it. Now we'll start working on the geographic data section. It wants to look at the data geographically, as it says. I've used scatter plotters to display it as a set of points, and I've also used longitude and latitude to get the geographic coordinates ( longitude and latitude). The result is that it draws the coordinates of my drawing, and then we save it in the geographic field. So, it has drawn it for me. The entire dataset is accessed through the part related to geographic information. What does it look like? And of course, the resulting image is a map of California, but it's structured as points because it's meant to be comprehensive, encompassing all areas. He even called it "Bad Visualization." Let's see why he called it that. Because when you look at the image, you'll see that it's made up of many blocks, which are blue here. There are a lot of them in certain areas, like this one. For example, there's a lot of overlap, so I ca n't pinpoint where the bibliography is, or if there are any houses. All these things aren't clear to me. So, if we try to work with this image, we won't be able to understand the points well because they're all crammed together. Naturally, we always start with the initial stage and begin to modify it until it becomes clear. So, we'll work with alpha. The alpha setting makes my points transparent so that, for example, if I find an area with many points stacked on top of each other, they will appear dark, and areas with few points will appear faded. This allows me to understand what this area is like, or the areas with visible porosity, or areas with fewer points, like in the following diagram. It works here with alpha. It worked normally with the scooter and coordinates, but it stated here that alpha equals one. So now it has started correcting the previous part and has begun adding faded points. We will notice here, for example, that this part has become faded. This part is shaded, meaning there are points stacked on top of each other, and of course, alpha equals one, which is the transparency ratio of the points. We will notice that the ratio is small. Then, if it starts to increase and becomes darker, then there are several points stacked on top of each other, so the transparency changes with the color. This part is now considered... Drawing it more artistically, we now, for example, relied on four things or three things, most likely the first thing, which is geography, which is longitude and latitude, and these of course determine the location of the point on the map. Anything is the size of the point, meaning its size, the more of course it increases, the more we know that the population here has increased, so therefore we made it here a ratio on the 100 plates, we divided it by why? So that its size doesn't become too large, look at the shape of the drawing we have, so we make its size very small. For example, the points that are like this one, but this point, for example, is small, and so on. Then, the third thing is color. The color depends on what? Your color here should start with blue, which is the lowest, and keep increasing until it reaches red. These are the things that are more expensive. You'll notice, of course, that the colors start to appear according to the price of the place or area. That's how the prices are there. Also, here, I should also have my color gradient. Here, I have a color gradient that starts from blue to red. So this should be the fourth thing. The third thing was the color of the house that it represents. We took this gradient and started drawing here based on this gradient. What is the appropriate color? So the painting appeared in a more elegant way, or the map in general. Let me clarify a bit. So far, we've added to the basic drawing that we've created alpha. This alpha creates transparency for our circles so that when they're stacked on top of each other, we can see what's underneath. We've also added a color bar so that the column on the right shows the exact dollar value of each color. This is also written in the code, so things are a bit clearer. We've also applied an action to the " Cherkes Faults" section. This means that the technical adjustment ensures that the axis and column names appear clearly and without any formatting issues. For example, the attitudes and other elements I've written appear clearly, so I don't have any problems afterward. So far, we're fine with the drawing of our circles that appear on each house, and everything is fine, but it feels a bit off. Because we can't see the map, we don't know what these places are or what their names are. The code in front of us is designed to retrieve a map of California from the internet, copy it, and place it in our folder, which is "And2AndProject." It then downloads the downloaded map, and I'll work on it later, overlaying it onto the image so it looks natural. This code will overlay the map image with my points. So, I'll first read the file containing the image, and then I'll start drawing my drawing, which is the points section. I'll also start looking at the latitude and longitude coordinates of the map and specifying them here, whether for your geographical location in California in general (and we've already written the latitude and longitude lines here). The California map is designed to ensure that when we place our points, they align perfectly. Each point is in its correct position on the map, so everything is aligned without any shifts or errors. The part about the "Take Values" and "System" is primarily for aesthetics, to improve the overall appearance. For example, the " LNB Dot Line Space" is supposed to divide the prices into 11 equal sections, like the one shown on the ruler at the bottom right. Then there's something called "Site Way Take Labels," which is meant to simplify the numbers. Instead of writing 500,000, it might write 500 kilodollars, making the area slightly smaller. Notice that our output already includes this. The map was created, and our points were added to it. The right side started to show the layers we had added, and we made it look much neater. The lower numbers are now much neater than before. Next, we'll notice that he created a housing core. What does this mean? He's now using the core relationship coefficient. The core, as a code, is a function that calculates the strength of the relationship. For example, between one column and another, if the relationship is always limited to two values, either 1 or -1, it means that if the first column increases, the second column increases exactly the same. If it's -1, then if the first column increases, the second column decreases exactly the same. However, if the value of the coefficient is zero, then there is no relationship between the two columns at all. This means that zero means we've set a condition that we have numerical values. Only equals true, meaning you'll now only work with numbers. If you see text, ignore it to avoid errors. We've already discussed how ocean proximity contains text, which can cause problems. So, we'll only work with numbers. This code will output values from the core matrix. Let's see how we can use this output. For example, we have the medium house value (value 1), which is the average house price. The medium incumbent value is 68, which is very close to 1. You'll notice below: 13, 11, 6, 4, and 4. These values are not close. This means the two largest values are the medium house value and the medium incumbent value. The relationship between them is that the higher the incumbent value, the higher the value. A direct relationship, which is the value we mentioned before, which is one. This is what we deduce from the graph we have, or the output. Here, we have a graph, which is the sector matrix. This is a set or network of graphs. We will notice that it shows our four points, which are the medium house value, the medium incumbent, the total rooms, and the medium house edge. If we look at the incumbent or the medium house value, we will notice that their points are rising upwards. This means that there is a stronger relationship; the more this increases, the more that increases. So, the issue we have is that their value is indeed greater, and this is the point that appeared. We will notice that they are rising upwards, so you will find at 5000. Look, there is a long, straight line. This line is the problem; it will cause us a problem, so we must We're focusing on cleaning up the data by removing the rows with horizontal lines so the model doesn't learn that prices always stop at 5000 and above. We want it to increase and for things to improve. In this code, we've created a relationship—we've drawn a diagram, essentially zooming in—between the medium incumbent and the medium house value. When we view these two parts together, we've noticed three flaws we need to be aware of. First, there's a very strong correlation between the points, but these points are scattered, meaning their direction is upward. This confirms that the strongest element is the incumbent versus the value. We'll also notice a horizontal line in the... 500,000 means that any house priced at 500,000, or a million, or whatever, will be recorded in the data as 500,000. This is another problem. Thirdly, if you look closely at my lines, you'll find other horizontal lines that are dimmed around 4500, which is fine, or perhaps around 350, or around 280. You'll notice here a group of dimmed lines. This is probably due to errors in data collection, or because they are close together, or because there is inaccuracy in the numbers. We'll start looking at this issue and how to deal with it. Here, we are applying triplet combination, or triplet combination experimentation. So, the idea is simply that raw data is sometimes not useful on its own, but when we combine it, it gives us useful information. So, for example, we are now on average... Our rooms, for me, are supposed to be calculated by dividing the total number of rooms by the number of families to determine the average number of rooms. So, we're working now, and we've done this in a specific area—for example, a single large building, or perhaps this building in that area has 1000 rooms. I don't know how to determine the average number of rooms per family, so it's probably five rooms, for example. Of course, the second method is much more accurate in determining the area's standard and price, and so on. So, my room ratio is calculated by dividing the total number of rooms by the total number of bedrooms. This will give me the percentage of bedrooms. This allows me to determine the number of bedrooms in a house based on the total number of rooms. So, how do I determine the price? For example, if a house has five rooms, and only one of those rooms is a bedroom, this means there's a spacious reception area in a living room. So, this house is a bit more luxurious and its price is higher. The number of rooms also makes a difference in this aspect. Here, I focus on the population density. I'm supposed to divide the population by the number of families to get the population density. So, if these areas are crowded, meaning there are many people, then they're usually cheaper, especially if there are many people in one house. But if there are small or independent families, the price is a bit higher. Then we move on to the core area. Here, we'll notice that the number of bedrooms has a negative value, meaning that the number of bedrooms is very small. This is good because it will give me a higher price for the houses. But if it's large, it means it gets cramped and has many bedrooms, so the issue will be that... It's much cheaper than when there are fewer bedrooms. Now, let's talk about the root household. I'm supposed to start looking at the total rooms now. Of course, the more rooms there are, the more we know that this house enjoys more luxury, not just a lot of rooms in the area. No, it definitely means this house has many rooms, meaning it's spacious by nature. Therefore, its price will be higher. Now, we'll find that we've drawn a diagram now between the rooms household and the medium household. We started by placing the alpha, and we made the rooms start from 0 to 5, and the volume starts from 0 to 500,000. So this output appeared, and we'll notice that there's also a line at the top, and the smaller lines in the middle are still there. Then we made the house description to start looking The data, which includes the count, the minimum ( 25%, 50%, 75%), and the maximum, will be examined first. We'll start with the count to see if there's any missing data. Notice that the total drums are indeed lower than the rest by approximately 16300, while the rest totaled approximately 16500. This indicates that there are empty spaces in this section that need to be addressed before the model runs, as machine learning can't handle such small spaces. Next, we'll investigate the outlier values. For example, in the maximum values, we'll see these values in the roombearer house section. The average is five, but the highest value is 141, meaning the percentage is quite high. This suggests an outlier is involved, indicating something is amiss. We'll also examine the scales. Here, in the main data for the medium income, you'll notice that we display it like, for example, a medium where the part that represents the price, for example, you'll find it's a, and so on. It's all small data. But let's look, for example, and you'll find that the second one is 100,000, unfortunately. So, the price of the house is actually 3, 4, and so on. But, as an income, you're now dealing with different sizes. What I mean is that the medium is currently dealing with numbers less than 10, or it's only units and tens. But if you find, for example, a medium house value, it's dealing with thousands. So, of course, we have to make this a feature scale so we can address this issue and prevent our model from getting confused by the difference in sizes. This is also a problem that appeared for us through the estimation. We'll stop here, which is the first part of the project for now. The second and third parts will follow, God willing. We'll put them all together in one video, which we'll send later, but we'll stop here for today, guys. Good luck, and God willing, they'll be sent to you by Friday at the latest. Good luck, goodbye.
ch2 in hands-on machine learning (project code)—part 1