My Ray Summit Talk - Building Multi-Modal Foundation Models for Document Automation

November 3, 2024 Ivan Zhou

I gave a presentation at the Ray Summit on my work building Multimodal Foundation Models for Document Automation at Uber. It is always a great pleasure to publicly share what I have been building over the past year!

About My Presentation

Here is a recording of my presentation. The title is “Building Multi-Modal Foundation Models for Document Automation”:

Here is my slide: Ray Summit 2024 - Foundation Models for Document Automation.

Uber receives a hundred million documents every year to onboard new drivers, validate grocery receipts, process invoices, etc. There is a strong business need to automate the document processing in a fast and accurate manner. There is a very high bar for performance requirements, ruling out most general multimodal LLMs. As a result, we invested in pretraining a foundation model from scratch using the in-house data accumulated over the years.

We built a multimodal foundation model that supports input modalities of document text, layouts, and images. We carefully designed our own tokenizer that supports languages of Uber’s key markets and is tailored specifically for document processing needs. We pretrained the model using tens of millions of documents of 120 different types.

The output model is compact and fast enough to perform document processing in real time and achieved higher performance than general LLMs, including GPT4o and Llama 3.2. It has now supported over 20 different documents in production, and the list is still growing, as we have platformized this technology to support new types of documents using a data and training flywheel. It is an exciting technical challenge to work on!

About Ray Summit

I had an awesome experience at the Ray Summit. The Ray community is a group of enthusiastic and curious practitioners focused on ML systems and applications. The summit featured numerous talks on multimodality, LLMs, and training & serving at scale. Teams from many leading-edge companies came to share their work. The audience was highly supportive and eager to learn more about the innovative projects everyone was working on. The atmosphere was electric with positivity and excitement.

Here are a few other videos that I enjoyed watching during the summit. I am still in the process of watching a few more videos that I missed:

Keynote of Day 1 - I particularly enjoyed the part where Ion Stocia shared the concept of AI Complexity Wall and the trends of GenAI system and infrastructure.
Keynote by Instacart Co-found Brandon Leonardo on the attempts (failed and successful ones) on Generative AI within his company. It is fascinating to see how GenAI is being incorporated into both the customer experiences as well as day-to-day internal workflow.
OpenAI’s CPO Kevin Weil discussed managing product at OpenAI, O1, distillation, and real time voice API that they just announced in a recent DevDay.
Niket Agarwal from Nvidia talks about their efforts on video curation. The amount of compute required for large-scale video understanding, training, and selection is mind-bloggling. This massive technical challenge is perfect fit for Nvidia to showcase their muscle.
Amjad Almahairi from Anyscale shared about the new Ray Compiled Graphs, which is positioned to improve training efficiency of multimodal AI models.