The past week was the start of the fall quarter at Stanford. I am taking CS330 Deep Multi-Task and Meta Learning in this quarter and I am super excited about it.
Meta Learning is one of the promising lines of work that aim to solve the small data problems in machine learning field. Currently, many people working on AI are thinking day and night about how to scale AI systems and improve their profit margins. One main challenge to solve is how to quickly build an AI model that reaches human level performance on classes with only few samples. The Andreessen Horowitz recently had a very good discussion on this issue to improve AI economics.
In the space of manufacturing, for example, people generally believe in the potential of AI technology and there are proofs that AI models are better than the traditional pattern matching-based CV techniques in accuracy and robustness; however, a fundamental constraint that is blocking many of the AI applications from successful deployment is that if a new defect type appears or the process changes, AI models need to be re-trained, which usually requires a non-trivial amount of samples and a few rounds of model iterations. The landing of AI systems will be more practical and scalable if AI models can reach >99% precision and recall by training on only a few samples if the capability of AI models on recognition and detection can be more transferable across different tasks, and if AI models can master at learning how to learn new tasks. The capability of learning how to learn is what we called Meta-Learning. That’s what makes me very excited about this course.
The instructor of this course is Chelsea Finn, who is teaching this course for the second year. Chelsea has been working on reinforcement learning and meta learning at Berkeley BAIR and Google Brain and she has been publicly sharing her research on Twitter a lot. I really like her passion and mindsets, which is another main factor that attracts me to this course.
The first two lectures have been fine. In the second lecture, Chelsea started on multi-task learning and mostly focusing on how to choosing condition on the task descriptor and design the architecture of shared parameters in the multi-task models. When she came to the objective and explained the loss function that combines losses from different tasks, I realized that the mainstream of object detection models largely overlaps with this area. From the perspective of multi-task learning, the object detection models like RetinaNet are learning two tasks at the same time: localization of the target objects (by creating bounding boxes) and classifying the objects to be one of the designated classes (by assigning a class). Both tasks are depending on the models’ deep understanding of the images’ contents, so they sharing lots of weights and the multi-head architecture is the mainstream. In past work, we always tied the bounding box labels and classification labels closely: each bounding box needs to have a class. However, from the perspective of the multi-task learning, such practice is not useful but not necessary — you can separate them as complete two tasks, so it’s fine if an image has bounding boxes but no classification labels (maybe not in reverse). That means you can discard or change some of the classification labels while retaining the bounding boxes when, for example, you split a class into multiple sub-classes or if you update your class definition.
As I learn more about the multi-task learning, I find the optimization approach is not very distinct from the vanilla deep neural network: you still use back-propagation on the gradient of losses and you handle the overfitting/underfitting in the similar fashion. You still approach the problem with the same three key components: the input data distribution, the output data distribution given the input, and the loss function. The core underlying technology doesn’t change from single-task learning; only the problem statement and the architecture. This then extends the application of deep learning into a bigger scope of work and powers real-life applications like multi-task YouTube recommendation and text embedding system in Pinterest. The knowledge of neural networks may not transfer that well, but our human knowledge transfer across problems very well indeed.