Side Project Ideas with Large Pretrained Foundation Models

January 29, 2023 Ivan Zhou

With the advent of large pre-trained Foundation Models (DALL-E, GPT-3, CLIP, Stable Diffusion, etc), the opportunities to build unique and impactful applications are endless and the technical barrier is getting lower. These models have been trained on large amounts of broad data and are adaptable to a wide range of downstream tasks. They have shown impressive generative and few-shot learning abilities.

I have been brainstorming with friends possible side project ideas to try with Foundation Models. In this blog, I’ll share some of the interesting and practical side project ideas that utilize pre-trained foundation models. Whether you're a seasoned developer or a beginner, hope this will inspire you to unleash your creativity and build something amazing.

Created with OpenAI’s DALL·E 2

Image Search with General Descriptions

My phone album contains hundreds of thousands of images. I want to find a photo of “myself holding a basket of cherries and a cheese platter”.

I want to find a photo from my private stock with general description.

Foundation models trained with image-text pairs are good at associating text descriptions with image features. I’ve previously made a few toy projects by leveraging pre-trained foundation models, like detecting objects in images with zero-shot training and background removal without any annotation. Besides that, a study shows that such models exhibit the interesting capability to understand abstract concepts like “many”, “bird-view”, “step-by-step” etc. Therefore, I am thinking of an application where users can search through their photo stock with natural language queries. It will be similar to the search feature in Google Photos but work with general descriptions (instead of keywords) and private image stock repositories.
Some possible features:

It can answer queries like “find me the photo of me holding a basket of cherries and a cheese platter”
It can search with an image as a query and find all similar images.

Relevant paper/project:

OpenAI’s CLIP: https://github.com/openai/CLIP
Microsoft’s GLIP: https://github.com/microsoft/GLIP
Visual exploration of visual transformers: https://arxiv.org/abs/2212.06727

ChatGPT with private domain knowledge

ChatGPT knows the Internet of 2021. If you ask it any question about 2023 or your company’s latest return-to-office rule, it won’t be able to answer.

ChatGPT knows the Internet of 2021 and nothing beyond.

Rather than relying on a fixed Large Language Model (LLM) to answer a query, if we first find relevant documents (online or elsewhere) and then use an LLM to process the query and the documents into an answer, this could provide an alternative to a current web search. The effect would be akin to having an LLM do a web search and summarize the results. The search results can be more specific, accurate, and tailored to custom context than a fixed LLM like ChatGPT. I can think of a chatbot that can search through Notion/personal Wiki repositories and answer questions in natural language.

Relevant paper/project:

Meta's Atlas: https://arxiv.org/abs/2208.03299
DeepMind's RETRO: paper and PyTorch implementation
GPT Index: data structure and interface implementation to help with tasks like summarization and Q&A

Video Analysis

Videos are the most natural way for humans to watch, communicate, as well as entertain. However, video data is very unstructured, so they are difficult to organize and search. Can we parse structure data from the raw video content, such as metadata, transcript, key figures, description, etc?

For an example project, we can parse contents from a selected YouTube channel, process them and parse structured data. Then we can enable features like:

Statistical analysis to answer questions like how many times a certain keyword appears
Search with a concept and return video clips
Editing: remove unwanted filler words
Detect unsafe content

Search for a video clip with general description.

Other ideas that worth mentioning

There are other ideas that I found interesting and practical when doing the brainstorming, so I am going to share them below.

Ask GPT to perform tasks on the browser

A chatbot that takes in human instruction and performs multiple tasks on the browser: search through Wikipedia, summarize an article, post a tweet, etc. It can be done through both LLM and Selenium.

Reference:

Ask an AI Assistant to order a pizza through DoorDash. Created by Div Garg and Behzad Haghgoo.

AI-assistant writing on Mobile

AI-Assisted writing on mobile devices. Users can select a theme, template, style, length, etc and it will generate for you.

Reference:

https://twitter.com/russelljkaplan/status/1617243806390431744?s=20

AI-assisted writing on mobile

In the End

By utilizing large pre-trained Foundation Models, you can bring creative side projects to life with ease, making the possibilities endless. I’d love to get my hand dirty with these models and try out interesting applications. I will keep this blog updated with new ideas that come to mind.