EfficientML.ai Lecture 7 - Neural Architecture Search Part I (MIT 6.5940, Fall 2024, Zoom recording) Instructor: Prof. Song Han Slides: https://efficientml.ai
okay good afternoon everyone let's get started so today we are going to jump into lecture seven about new architecture search it's a new section so as a review we talk about pruning we talk about quation and the pruning uh lab is due today so make sure you want submit the homework lab one right today and also lab two is out about quantization we'll Implement a lot of cool techniques like camean quantization linear quantization derive all the equations by ourself to see the int quantization and four bit quantization ETA so both lab one and lab two make sure you check out efficient ml. a to see the uh the labs so we are going to start a new section about new architecture search so there is many Tre offs between the efficiency Matrix and also the accuracy Matrix so it's very hard to get the B of both World in general we want a low latency high accuracy low energy and also low storage but it's very hard to compete to get both from the latency perspective you already would want a model to run fast for example if you want to make a Tex to image generation you want to generate image within seconds or Mill tens of hundreds of milliseconds you don't want to wait for like 10 minutes to generate a single image and also in a self-driving car uh you want to make timely action rather than gliding a gliding someone fighting some objects so the real time matters a lot also for robots right robots interacting with other people and some other stuff you want a real time interaction and also energy is crucial right sometimes we want to run these applications on mobile devices like running the phone and these Edge devices are battery constrainted right you don't want to uh application to join your battery and run out of battery within half a day so energy matters a lot energy also matters for tring in large scale data centers um the cooling is a big issue uh CO2 emission is a big issue environmental issue and accessibility to those large power Plum is not easy you're looking we are talking about this megawatt power plant which is um very high power consumption storage also matters right you don't want want the model to be like 70 gigabyte which make it very hard to download on the phone some of them some of you may want to make a startop in the future build some a applications uploaded to Apple store or Google Play you don't want the user to download it and join the whole month's PTI of their 5G port on the phone so storage also matters but none of this is as important as accuracy right even if you have a low latency low energy low storage but if the accuracy is poor it's very hard to attract users use this a application so accuracy certainly is the foundation that's why we put it on the bottom is the foundation right so when when we are optimizing the latency the energy the storage we want to make sure the accuracy it doesn't hurt or sometimes even improve so that's a very complicated process so today we are going to talk about rather than compress compressing an existing model by pruning onization can we design a small compax and efficient model to begin with so that's the motivation for neural architecture search in order to achieve that we'll go through the following sections we first talk about these primitive operations and reveal those Classic Building Blocks if you are if you're architect right if are constructing a new building what are the building blocks now we are constructing a new network architecture what is the building block and then we'll talk about neural architecture search what is the search space and how do we design the the search space and also how to search what are the search strategies within that that search space so let's start with primitive operations which is overw first let's recap the Primitive operations like fully connected layer the input feature C in dimension and is the batch size C out Dimension which is the output feature the weight is C in times C out and there's also a bias the number of Max is just C in time C out we also learn about convolution layer 1D convolution and also 2D convolution and the input feature size CN * W and if you have both height and weight that's height and width that's a 2d scenario C in to C out and the weight is four dimensional we have the kernel height kernel width Chanel in Chanel out and the way to calculate the Mac is multiplying these six terms C in and C out H and w k and KW we also learned about group convolution layer so basically turn a a fully connected layer divided into several groups such that the number of Weights can be divided by G and similarly the group convolutions Mac is also reduced by G times compared with the original convolution we also introduced the depthwise convolution layer which is a special case for group convolution where the number of groups equal to the number of channels so each Channel basically interact with only itself only interact with theel so the number of Weights there's no C out C in equal to C out so here just K KW time C and also the um the Mac we have only these five terms rather than C in time C out it's only C out and finally we have the one one by one convolution where KH and KW the kernel size is just one which basically projects the input Dimension from C in to C out okay so those are the Primitive operations using these primitive operations we can construct the building blocks okay so there is a famous resent 50 bottom neck Bott NE block which consists of three layers A one by one L one by one layer followed by a 3X3 convolution followed by another one by one layer okay and there is also a bypass layer connect to the input to the output such that these three layers is basically learning the Dela learning the regu it has a very special design called bottl neck which is very good trade off between the expressiveness versus the computation complexity so as we know the 3X3 convolution is pretty expensive is nine Ops compared with one by one only one so we want to First reduce the number of channels here by 4X we are this one by one convolution projecting from 2K channels to 512 from 2K to 512 and then we do 3x3 convolution from 512 to 512 and finally we'll project it back to 2K using another one by one convolution so that's the startle the design okay so we feed the reduc the feature map not the original feature map which has 2K channels but we want to feed the reduced the feature map which has only 5 12 Channels with the 3X3 convolution finally we expanded uh the channel so that we can have good expressiveness by having 2K channel in the end so how to calculate the total number of Max so the first layer um we have 2K 2K input output Channel H by W 1 by one the middle we have 3x3 convolution input output 52 by 512 H by W * 9 since this is 3x3 convolution last ler from 512 to 2K one by one convolution H that by W so that's the total amount of Max how does that compare with if we have only the 3X3 convolution it'll be like this right if we only have this 3x3 convolution each 9 and then we're projecting from 2K to 2K from 2K 2K to 2K that's actually 8 times 8.5 times F more compared with our case to do this auton neck architecture right so the body neck architecture really shrinked the number of channels with the 3X3 convolution such that we can reduce the max what about the number of parameters 3x3 1 by one C which is 2K by 512 and then 3 by3 C 52 52 and then one by one so this is a total number of Max and again compared with just using a single 3x3 col from from 2K to 2K that will cause 2K by 2K by 9 which which is 8 * 8.5 * higher compared with this bottleneck architecture so the bottleneck architecture in Resident 50 is quite efficient with respect to both number of Maxs and also the number of parameters okay after res there's res next the next version of res which introdu use the group convolution so originally uh we have this is 256 Dimension input 256 Dimension output we have 1 by one U projecting from 256 reduce it to 128 and using a 3X3 to project 128 to 128 but here actually they're using a group group convolution with group size of 32 okay what does that mean so we can have 32 it's like we 32 independent smaller convolutions each one is no longer 128 by 128 but is actually from 256 actually project from 256 to four okay rather than from 256 to 128 we have from 256 to four and we have 32 of them so four times 32 that's 128 we can further change that since the last layer from 128 to 56 can be also viewed as from 4 to 256 and we have 32 of since 4 * 32 = 128 such that we can think about it as Al together we have 32 paths right and these path can work together um as a equivalent to a multi paath block and potentially you can convert it into a mixture of expert so that's group conclustion res next there is also very widely used architecture mobile ad which int re introduced the deps wise separable block okay so we use that to capture the use the depthwise convolution to capture the spal information and also use the one by one convolution to capture the information across the channels why we call it dep as separable we are separating the modeling of the spal um information versus the channel information the spatial information is modeled by depthwise convolution only spatial information and then the pointwise convolution models the channelwise interactions so we separated them so that we can reduce the computation and what does that um look like actually mobile I tool introdu this inverted botton act block okay inverted botton block different from a botton block actually the number of channels didn't decrease but increase by six times expanded by six times what is the reason behind that so here we have depthwise convolution and remember depthwise convolution the flops is very low it has one less Dimension okay only five terms multiplied together so the expressiveness is very weak to compensate for that we want to introduce more channels so here we are expanding the number of channels by one by one using 1 by one cl expand the number of channels by six in the bottom NE layer we are reducing it by four because 3x3 normal convolution is very expensive but here we are expanding it because the depthwise convolution is very cheap and we want to have good expressiveness and model capacity so here we are expanding the number of channels rather than shrinking the number number of channels so how do we calculate the Mac in this case one by one column stays the same one by one column height and width inut Channel output channel inut channel output Channel okay and for the for the middle part that's where it gets interesting so we expanded the number of channels from 160 to 9 to 960 from 160 to 960 by the width times not since this is a deps was convolution there are there's no other 960 it's just 960 rather than 960 times 960 input Channel times output 10 okay this is steps wise so it's just a single 960 in this case so this is the total amount of Max what about what about the total number of parameters again using Nal 16 as an example I you really want to use real numbers rather than just the letter so that you can get some intuition so this is the 1X one C input Channel output Channel 1 by one input Channel output Channel and this is the depth wise layer nine times we have 960 channels so the number of peret is very very small actually U just 9 times 960 so that's why mobile net gets applied to mobile devices a lot it has very low Max and low parameters and the key reason for that is than thanks to this inverted to this inverted Bott neck layer and depthwise convolution where it has one less Dimension compared with normal convolution but it's not perfect uh what is under the rug here so the activation memory is actually not efficient for this kind of inverted bottom NE structure the activation memory why is that the case okay here you are expanding the number of channels by six times to compensate for the d s convolutions low computation complexity you're expanding number of channels by six times so the feuture map originally this big now is six times bigger right so it's actually not activation memory friendly if we look at the uh inference this is from R 18 to M V2 shrink by 75% they are of the similar accuracy on IM the number of parameters reduced by four to five times from reset to mobet however the peak activation didn't decrease but even increased the 80% right so the activation is even larger but you can see the peak activation is not the bottom neck for inference okay this is for inference batch size of one activation is smaller than uh the uh the weight however during tring side where you have a pretty large batch size the activation matters a lot so here parameter reduced but by by four times but activation only improved by 10% okay so right mobile net is very good for inference inference time model size reduction flop reduction but is not good at reducing the activation memory which is a bottleneck for training so at training time you wouldn't observe much memory reduction if you change from res 50 to mov V2 with a very large batch size all right next design next Bea block Shuffle okay so here we are familiar with this already 1 by one com 3x3 1 by one this is group wise convolution now we have three uh three groups what is the problem here different group only interact with itself okay so red group green and blue group they are independent and now we want to encourage different groups to talk to each other therefore we want to shuffle uh the channels across different groups this is after the channel shuffling the red group have the component from both the red and also the blue and also the green okay so this shule information to encourage the information to flow between different channels all right there is also the Transformer building block which is getting super popular these days like in Vision Transformers basically all the operations are Vision are Transformer blocks so for Transformer block there is the multi had self attention and I say and what is that so we have a tensor we first project it into a Clarity key and value with three different um um weight the projection matrices after converting that we pass it through the attention mechanism we're going to talk about in the Transformer lecture in detail which is basically scaled do product uh between the qkv we first calculate qk transpose and normalize that by the square root of the Hidden Dimension we pass it through the soft Max we do a mask like a cal mask finally we multiply that within the V so that every token can attend to every token in this case if it is CAO the future token other not current token can attend to all the previous tokens so rather than convolution which has a very limited receptive field in one layer you would need a lot of layers many layers to have a global access Global receptive field or pretty large receptive field different from that uh this body high attention this attention mechanism just using one layer you can attend to all the previous tokens okay immedi give you a pretty large receptive field which is good for both region and also language tasks so after the imh say we concatenate the output from different head so each head they are inde independently operating in the attention mechanism the qk transpose s Max times V they are doing the attention mechanism independently and finally we can cenate them together um and then project it again using the output projection Matrix so here although we have three input projection matrices like w q k b um they can be contaminated into a single three times larger Matrix do one single kernal call for the matrix multiplication since we want to avoid number of kernal calls to reduce the time spent on the GPU kernal costs so if you see the right hand side that basically depicts what is the process of the attention mechanism we have three tensors Fu K transpose and also Fu k transpose and also V okay Q This is n Dimension we have n number of tokens each token is of Dimension d k transpose that becomes D * n we do the matrix multiplication we we learn what is the number of Maxs in the matrix multiplication and D * d n okay and the Mac is n² d n sare d since the output is n by n in order to compute each we need D Max so all together is n² D we do the soft Max and then we do another matrix multiplication with a V so Dimension would be n * n so that's the result of n d byn the result is n by n and then we multiply n by n by n * D the computation is n² n sare d so Al together with we have two n sare D so it's o n Square it grows quadratically with the number of tokens so as the resolution get bigger the number token get longer as the context gets larger the document gets larger the tokens gets larger this is actually quite computationally expensive so that's the attention part what does the attention map look like so here is a sentence uh this is a real example we have in the HPC paper but stat now the sentence is at that the video game is a lot more fun than the film and we can see the game World the game ATT 10 have to video okay and I attend to fun on the lot and more attend very heavily here to that is more than really um talk together film here you can have gain okay so um that's the potential map capturing the relationship between different tokens and naturally you can see some of the tokens they don't attend to anything like the pretty much everything is white so meaning it's not tending to any stuff so in the future lecture we are going to talk about token pruning tokens like that and be safely approv question why is it a symmetric why is it not symmetric in the diagonal it doesn't have to be symmetric the future attend to the Past versus the past tend to the Future not necessarily symmetric and here when we are talking about mul High the tension basically we are concatenating the tension result across different head and passes through the output trans formation Matrix wo okay so that concludes our Beauty block section and we talk about FC layer F normal convolution group convolution dep as convolution um we use those convolutions to build building blocks such as inverted Bott neck Bott neck um and also the M mody tension uh blocks so we have a lot of building block now now we talk about how do we select among different building blocks so we introduce neural architecture search so conventionally when we are designing the model we us design a manually right there are so many layers many some kernel we choose across different kernel size 3x3 1X one 5x 5 7 7 by S and pick a resolution maybe 224 maybe 256 284 and we learn those connectivity bypass layer no bypass layer and we want to choose the number of channels starting with three and then we add more channels um gradually add more channels and reducing the resolution adding the channel there is a lot of black magic doing the manual architecture search and us is very hard to get a model exact feeding the demand of the hardware for example um Google TPU has 28 first generation of TPU has 28 mtes of asram onip asram so um that fix fits perfectly with Google V1 okay because um 28 megabyte exactly can fit the model everything can be fit on the hash but how do we design a model that's exactly 20 28 megabyte or 100 megabyte or certain of certain size if it fits then it can fit in the cach it be very fast if it doesn't it'll be very slow which motivates us to build automated design space exploration method um so that we can use machine learning to design these neural AR neural architectures rather than manually design them so this is comparing the manual design versus um automated design the search the design space okay in this number of Mac this is X xaxis we want a low low Max so that we can make it faster but we want a high accuracy so ideally we want us a model that is right here on the top left corner okay both low computation and high accuracy and we hope the circle size to be small okay the circle size indicates the number of parameters like this big means it has 2 million parameters here's 4 million here's 8 million here is 64 milon parameters so all these models are 100 crafted designs like res 50 is roughly 4 million Max 76 topon accuracy exception model res 101 res X 101 roughly 80% comp accuracy is BD Max so there's no free lunch every increment for the accuracy comes at the cost of the uh flops what about automated design okay um like this wi for approach designed by our our lab can have only half a bon math half a bon math but match the accuracy even higher accuracy compared with res next and also exception which this showing that this very powerful tool to use neural architecture search to design a very small model low computation but can match the accuracy with the um very big human design models and this is a a whole family of models designed by alml so here if you see the circle in the middle we can accurate distinguish the r is hard crafted versus this as is autoo design models these autoo design models and you can easily tell the difference between these two we have a bunch of dots on the top uh left corner which showing that given this huge design space manual design is sometimes inferior compared with such automated design approach with that as motiv motivation let's see what are the components of a new architecture search and what is the goal of the new architecture search okay so given the design space these are the building blocks okay this is the build search space building blocks with a lot of building blocks and then we have a search strategy and also a performance estimation strategy okay we want to propose the architecture and then collect the performance feedback and then iterate this process if you find a good model you can stop this iteration otherwise you want to continue this iteration propose a new architecture estimated performance estimate the efficiency uh such as latency energy storage different perspectives and also accuracy for sure give the feedback and repeat the search process again until you find a model that feature need and to begin with let's talk about the search space okay so that's a set of candidate neural architectures we are going to choose from um we introduce uh two search spaces here one is the uh cell level the other is the uh Network level okay so let's first look at the the sale level what does that mean so this is a from the image all the way to the output we have minor layers initially we have a large resolution like 22 224 and finally we have a very low resolution even a single prediction so we have to go through many F SLE layers to reduce the resolution and in and increase the number of channels so those are the reduced reduction cell okay reducing the am of U resolution can be done by stred C or by ping and between those reduction cells we have those normal cell and those normal cells are really repeated so the philosophy from this cell level search space is that you just want to search one building book okay and then you're going to repeat by Building B many many times okay here repeated end times another n times another nend times and here for the reduction cell we are reusing the same reduction cell across different stages okay so basically all the building block boil down to two building blocks one is the normal cell the normal Cell between the reduction cells so that make a very complicated problem much simpler so previously we have like 50 layers now we only need to search two layers and we are repeating these two layers for the whole network these two layers fall into two categories one is the reduction cell one is normal cell normal cell just repeat between these reduction cells and the reduction cells responsibility is to reduce the resolution for example initially is 224 and in the end it becomes 7 by 7 okay so you reduce from 224 by 224 to 128 by 128 128 to 128 um sorry 224 by 224 112 by 112 256 by 56 to 28 by 28 14 by 14 and finally 7 by seven okay five reduction cells in the middle that's the noral cell let's see the result first and then we're going to talk deeper how that worked uh from this paper uh new new architecture sear paper the cell architecture is quite unreadable okay so that's the models searched architecture searched uh by Machine learning and we can see roughly there are uh this a layer H this is h i Min - one so we can take the input either from the previous layer or from the layer before the previous layer I can do some operation like separable 3x3 identity and then do some operation like add to uh add these two oper and then con C all the choices together and fit it to output and similarly for the um reduction cell basically choose the input either from h i or h i minus one passes through two operations add them together and then concatenate them in the end so how is the architecture to select these choices so here this paper used at a recurrent neuron Network as a controller to generate the game the candidate cell okay so um we don't quite use RN these days but let's give a review first select the uh one hidden State and then select another hidden State and Al regressively select the operation for the first hidden State and then select operation for the second hidden State and now we want we want to combine these hidden States we want to select the method to combine these hidden States okay and we repeat this types so here we first select um the gr part select one hidden State layer a and then select another hidden State conveyor B and then select operation on top of the first hidden State here is selected 3x3 col and then select the second hidden State operation which is 2x two Max Po and then we select the method to combine them here is using add operation to combine them okay so we are repeating this process B times the question here um what is the design space what is the size of the design space say we have two candidate inputs exactly like this case and then we have M candidate operations to transform the inputs okay to transform the input yellow part we have M candidates and then we we need to to combine them okay in the green part and there are end potential operations to combine them add is one of them but they are in potential methods and what is the size of the search space we have B layers give you one minute to think about it think consider step by step which is a very commonly Ed the CL think step by step so what is the step one okay step one um is to select one hidden States so we can either select from like we mentioned here we can either select from the previous state or the or the state before the previous state so we have two two choices for selecting one hidden State and then we have another two choices for selecting the second hidden State and then we select the operation on the first hidden State here are M operations m M candidate operations to perform the input and second one also M possibilities and after that we have earn approaches uh to combine those two hidden States so that's just one layer and we want to repeat this repeat this for B layers so um to the power of B so that's the complexity of selecting the total design space assuming I'm equal to five that's actually very typical we can have 1 by one comp 3x3 comp 5x five comp and dentity or pting easily you can get five choices I'm equal to five then n equal to two we can do um add wise addition uh or we can do um multip wise multiplication so we can easily have two or more than two operations for M repeating b b layers given this example we can have 10 to the power of 11 candidates in the design space that's a pretty huge design space which motivate Us in the later part of this lecture we are going to learn how to select how to search within such a big uh search space Okay so that is the cell level search space the philosophy is we took a lot of effort very big design space super big design space we just want to pick a single cell that's um very elegant we want to repeat this cell so many so many times but another thinking is that that may not be optimal because from deeper layer versus shallow layer you may want a different cells okay from deeper layer versus fellow layer behavior is very different it may not be the best solution to just repeat the same building block same cell across the whole depths of the neural network okay so it might be better to think about the whole design space the whole network as a whole as a um a larger design space different blocks may have different architectes so what is the design space in this case so starting from the input all the way to the output we can have a variable number of layers okay previously when we are repeating we assume a certain depths but now we can also change the depth for example the first uh first stage so here we have five stages each stage they of the same resolution we reduce from 224 all the way to seven the first stage we can have depths of one depths of two depth of three and s similarly for later stage like this stage we can have three five seven nine different uh depths we can s choose from also we can have the resolution Dimension okay the input resolution doesn't have to be 224 by 224 but rather it can be um 128 by 128 or 256 by 256 or at trading time you can dynamically randomly select different resolutions which usually helps like in the previous quantization lecture we about usually adding Randomness helps with generalization and helps with training so here we can also train with sometimes with2 by 192 sometimes train it with 256 by 256 so that's the resolution Dimension we also have this width dimension okay so um you can choose from 48 channels in the first uh in first place and 64 or 96 when we are going deeper we want to have a larger resolution a larger Dimension right here so we can have 120 120 uh 1,280 or uh 28 48 Etc okay so we can choose different width also we can choose from different kernel sizes so different color here mean different different size for example here 3x3 Comp 7 by S comp 5x5 comp we can have different kernal sizes it's all very flexible we can also have the topology the connection um for example when we are doing the down sample um from one time to two times four times 8 times all the way to 32 32 times we can go through different paths we can stay at at the eight times down sample and then stay for a while and then we jump to 16 times down sample Until the End uh for some super resolution task or sorry for some um classification task this is the way to go okay reducing the resolution and sometimes for Det for segmentation tasks we want to have pixelwise predictions so the output resolution and input resolution are pretty much the same so we want to First sample then up sample so this also applies to Auto encoders which nowadays is very very widely used in diffusion models okay ldm lat diffusion models we are going to learn later in in this this acture series okay you first down sample and then you up sample so you can also down sample up sample down sample up sample so the trajectory across the grid um indic is is representing actually the architecture of the model and when do you down sample for example here when we are we are doing sample and the next layer we are not and then we are doing down sample and next layer we don't versus we just down sample down sample down sample up sample up sample up sample so the different paths different trajectories within this grid actually correspond to a network architecture in the search space so that's Network level topology connection on the search space okay so now let's take a break uh before we jump into how to design the search search space all right welcome back let's continue discussion let's talk now talk about design and design search space so just now we talk about there's a huge design space right either using the cell level or the network level and how do we narrow down the design space for example when we are searching something we don't want to search like your our forest the forest is pretty big your front get lost the first you want to do is to narrow down which part of the forest you got lost right just like here we have we are given a pretty big uh search space for our Target design you want to First narrow down okay your front get lost in the left part of the for then we want to First optimize the search space and then we do the model architecture search the specialization within the smaller the optimized search space so we'll introduce tiny Nas how do we uh do automated the search space optimization to search the search space and then we do the resource constrainted model architecture search we specialized the model for particular Hardware constraint so talking about Hardware constraint on the mobile devices the latency matters you want it to react very fast the energy matters you don't want it to tr the battery of your phone very quickly but on this iot devices memory con is also pretty bigint and also recently for large language models memory on the GPU is also very scarce resource sometimes you have only 80 gigabyte of memory you want to really well utilize the memory so latency energy memory those are the big constraints especially from cloud AI to mobile AI to Tiny AI when we are looking at the memory size we are talking about tens of gigabyt a few gigabyt versus a couple of hundred kilobytes or tinyi uh which has a pretty big gap between uh the available resource of the memory like a he few kilobyte versus several megabyte um for even not to mention the lar langage models it's just so big and if you compare the memory for GPU for mobile phone to microcontroller is almost invisible for the microcontroller since the memory is so small here we have three orders of magnitude smaller memory okay and four orders of magnitude smaller storage so as a computer scientist you want to design across the whole space both big and also small that is pretty challenging so we want to First narrow down a design space so how do we find a good design space first of all here we listed a bunch of design spaces we have different width the width of the uh Channel and also different resolutions 16 112 69 96 C so given the these different design spaces which is good which is bad ideally we want to evaluate the accuracy by twinning the model to the end observe the accuracy right but that is very computationally C so how do we evaluate a model without training it so here we use a very simple but effective theistic you to see a good design space is more likely to achieve High flops under the same memory constraint you only have this amount of memory it give me the most amount of flops assuming the model capacity is proportional is proportional to the flops so if given the same model same memory constr you can fit more flops that is considered a good design space and also for Hardware P from Hardware perspective computation is cheap data movement memory access is expensive so if given the same am of memory constraints you can fit more flops that model architecture is considered more efficient therefore we plotted the relationship between the flops and also this is the cumulative probability of different design space like this red correspon to width of 0.5 resolution of 1.4 this black design space correspond to width of 0.3 resolution of 160 so it's higher resolution but um lower um um smaller width you don't know which is better right lower resolution higher width or higher resolution lower weights it's hard to tell but now this chart we can easily tell the difference so in this red design space 80% of the models has more than 15.3 million uh Max but in this black design space 80% of the models has only 32 Med Max so black the red design space has can fit more flops given the same project so you can that as a good design space and verify that between a bunch of models within that design space and observed that the best accuracy in this design space is about 78.7% versus in this design space and ass forward which verified our idea that a design space that can give you more flops given the same memory constraint will be a better design space okay so we just want to analyze the Flop distribution um larger flops means more larger model capacity which is more likely to give a higher accy so we use this principle to this design this design space okay so and it's actually quite effective compared with the res 18 and 24 resolution uh which is actually running out of memory on the microcontrollers design space um our space is superior compared with even even larger huge design space or random design space and also the search space with higher minum flops on the X axxis versus accuracy they are very positively correlated okay so if you have higher flops then very likely you have high accuracy the correlation is is Sim is pretty uh mean all right so we use this approach to narrow down a design Stace using theistic that you want to have more flops given the same memory project and now let's talk about we finalize the design space how do we do search on top of that design space and how do we automate this process without human Intervention which is quite helpful if you want to build a product build a tool you don't want to specialize by using human expert that's not skillable you want to have a push the B solution no matter what kind of scenario came what you given the new memory constraint given a new latency constraint you want exactly fit this amount of design Target accuracy Target medicine Target how do we have Push solution to find the neuron Network architecture so we'll talk about several search strategies on top of the search space including grade search random search reinforcement learning gring desent and also evolutionary research so let's start with read search which is the simplest and some sometimes work very well so garch very traditional way of parameter optimization you can have many um degrees of optimization in this scenario we use tool for illustration purpose one is the resolution YX or 1.2x or 1.1x the width either 1.1 1.1 1.2 so we are trading off between resolution and with uh we call grid search because um each Dimension we can have a step size and then we can have a grid by dividing the whole space into several smaller ones um each dot on the grid um is one design in the search space so the entire search space is represented as the cartisian product of a single Dimension design spaces like uh this is basically the AO product between resolution and the width and to obtain the accuracy on each candidate Network we want to train them from stretch we don't have to train them all the way to merges but we can train it for for a while to observe the accuracy so that's the bad part of it you have to train for each design you have a lot of if you have a lot of gpus that's probably a very easy way just two four Loops for resolution for with TR them then pick the best one for me of course we want to satisfy the we can Str some of them you have very large weth pretty big resolution you may not fit you may not satisfy that it can so we are going to eliminate them even before we TR okay so eliminate um the points the design that break the data constraint before training them so this very easy to implement but it may take a quite a long time to evaluate them one by one make sure you prove away those obvious bad ones like the red ones in this scenario and now we want to introduce compound sta you have so many dimensions to search from do you push one to the extreme or do you um tune each knob the M ispl for each other okay so given this Baseline uh this is Baseline architecture from input to Output we can scale the width make the width much wider comp with the first one we can also still scale the depth make it much deeper we can also scale the resolution higher resolution lower resolution and the best way is actually do not Corner each design knob to the extreme like only scale the width only scale the depth but we want to do compound scaling okay that's the lesson here you want to make it a little deeper but not absolutely to the extreme a little higher resolution a little wider okay so here the target is to make the model 2x as the original Network so you don't want to make only one dimension 2x but for each Dimension the width depth the resolution multiply them together you want to have 2X but not each of them okay so that's the principle and similarly if you want to compress a network you don't want to just do you don't want to just do quantization or distillation you want to put them together okay just do a little bit of each and combine them together that you really works pretty well so that's grid search Okay random search uh seems very trivial right you can compare with green search uh one by one you just four x four Y and then you value with each of them now you can randomly evaluate the main design space but this is actually quite nontrivial quite helpful for sanity check to start with even in the homework make sure you want you run random search for a little while you don't want to see that the reward gets constant no improve no improvement at all that must be something wrong in your design space or the search or the evaluation or the reward function you you want to do some first order approximation and do a quick sanity check by running random search after a few iterations you probably want to observe some improvement of the latency accuracy tradeoff so random search you really the first step to do reinforcement learning that's a very um Advanced way some tricky tricky way and not quite easy to uh to use in practice but I want to put it here um so you design a sequential decision making problem so you want to decide what is the number filters what is the filter height filter width stride height and width number of filters and fter height Etc okay you want to design step by step one by one using RN nowadays right RN is no longer popular probably using a auto regressive model some kind of Auto regressive model not necessary RN controller and using that controller to propose a architecture consisting of the like the height the width F size Etc propose this architecture with probability T and then TR a child ENT with this architecture to get accuracy latency to get a reward and then you compute the gradient of P which is the probability here and the scale it by R to update the controller so you can have updated controller and next iteration is going to uh encourage High reward and discourage bad reward and repeat that process as your reward got higher and higher we can also use gradient descent to make this search process differentiable okay so uh for a path from this node to this node we can have many different choices many different configurations we can have here 3x3 count 5x5 count or that's comp okay so how do we choose from them um we can treat the output as a weighted sum of different edges from different red yellow different colors right we are learning the probability of choosing different choices uh for example here we are choosing this blue part maybe it could be like one by one col right but actually during training time we write inference for all of them you can imagine the memory requirement would be much larger because we have to run the activation for both 3x three 5x five and one by one and we are learning the probability for these three okay and we inference time we sample the one with the highest probability but tring time you can imagine this takes a lot of TP memory and inference time we just pick the one with the with the probability sample across according to the probability and choose a half and this is about making it differentiable how about latency can we also make a latency differentiable if the case that will be interesting right so here we have um four different choices for this layer 3x3 count 5x5 comp dentity 4 okay from the accuracy perspective probably 3x3 and 5x5 maybe 5x five has higher expressiveness has higher opacity but from the latency perspective if they are slow the identity would be the fastest right so how do we that take that into account when we are doing the stochastic reading descent previously we just care about the world entropy loss which doesn't factor into um the latest like identity may be a lot better than 5x five with respect to latency so how do we make latency differentiable so here we learn the probability of choosing each path in our C and then we pre measure the latency for each building block like measure the latency for the 3X3 convolution measure latency since it's the same of different inputs measure the latency for 5x five the latency for identity the latency for poing and then the expected latency would be suming up the probability times each latency probability times latency probability latency so that's the expected latency okay given the probability distribution here and we can update um this probability distribution by adding this expected latency to the loss function okay so this is the cross entropy loss um this is the regularization for the weights this is the latency portion of the loss function in this way we are learning the probability so that we can make latency the BR and we plug in this latency as part of the loss term so that when we are doing The gring Descent we can also take care of the latency okay such that we can learn a model that both give a high accuracy and also give a low latency any questions about this part about how to make latency differential okay good the next way to do uh uh to about the search strategy is about evolutionary search Okay a quick review we talk about um starting with random search right and then we talk about using r r search now let's talk about evolutionary search okay that mimic the revolution process in biology we have a population which is our design space and then we have a fitness function um which basically is the accuracy the latency Etc and then we do mutation cross over we talk about that and then select the Survivor assuming they are the best species and then we repeat this process to do notation cross over and then survival selection repeat this process on and so on and then we turn it so let's see how that works Fitness selection we given we are given a r for Network we're going to talk about in the next lecture and also in the homework given a sub Network we want to measure the latency and accuracy we want to have low accur low latency and high accuracy otherwise it doesn't fit it's too slow or the accuracy is too low then we need to resample and after that we get the feedback example again and repeat this process so the fitness function is basically selecting between the latency the effici the accuracy the efficiency such as latency through put memory consumption energy Etc so that's the fitness function what is mutation so we started with this natural architecture two stages STS of three in the first stage steps of three in the second stage and then we notate the STS such that the first stage can become two types of two second stage can become typ types of four or three or five and we can change it we can change deps we can also change the operator for example we loate from a 3X3 convolution to a 5x5 convolution from a 3X3 convolution to a 7x7 convolution so that's mutation we can also do crossover so cross over take two inputs mutation take one input okay so crossover take two inputs this is input one this is input two two architectures we want to cross over these two are the same so from parent one this is no problem from the second layer we can choose from either parent one or parent two actually the crossover is chosen from parent to and similarly for the last layer is also chosen from parent to okay so randomly choose one operator among the two choices we have two parents for each layer so that's the crossover we are going to implement that in the third lab so make sure you are C about this about mutation takes one input crossover take two inputs two parents and after that um we conclude this lecture about um the math we first introduced the Primitive operations and used that primitive operation to construct the building blocks and then we talk about the search Space by large search space cell level versus the network level and then how do we design the search space such that we can have High flops given the same memory constraints and finally how do we search on top of the search space using random search qu search reimbursement learning and evolutionary search okay all right so lab two is out feel free to check it out and let us know if you have any questions in the in the office hours here are some reference thank you that's all for the leure today