Open Source Large Models, Vespa

Insights

Insights 7: Open Source Large Models, Vespa

April 7

In this issue we discuss interesting developments in applied AI, including search technology with Vespa.ai, few-shot entity extraction, and open-source large models. We then review AI research announcements from big tech: Google AI, Meta AI, and Microsoft Research.

We kick things off with our pick of cool AI startups, DeepDub, which recently raised a $20M Series A. The company provides AI-powered dubbing services for film, TV, gaming, and advertising, powered by neural networks that split and isolate voices and replace them in the original tracks. According to VentureBeat, DeepDub’s work includes “Every Time I Die,” a 2019 English-language thriller that the startup dubbed into Spanish and Portuguese. It marks one of the first times that an entire dubbed movie used voice clones based on the voices of the original cast.

AI in Practice

As engineers we are constantly on the lookout for new tools that help us do our job better. It’s no different for AI engineers. In this issue we will cover three such tools: modern search with Vespa.ai, few-shot named entity recognition with large language models, and open-source large models.

A Modern Search Platform: Vespa.ai

After almost 8 years at Bing, I joined AI2 in 2014 to start Semantic Scholar with Oren Etzioni. Our options for a search platform were limited then, and we ended up picking ElasticSearch over Solr. ElasticSearch had just released version 1.0 two months earlier. Fast forward 8 years, ES is now at version 8.0 with much more features and community adoption. Elastic is now an $8B public company. We also witnessed a public fight between Elastic and Amazon. Today we have at least a couple of additional options for a search platform. The first option is Jina.ai, which brands itself as an open-source neural search platform with 14K GitHub stars. Jina has recently raised a $30M Series A. The second option is Vespa.ai, which is the search platform behind many of Yahoo.com’s sites (News, Finance, Sports, etc.). Yahoo open sourced Vespa in September 2017. Vespa currently has 3.8K GitHub stars, much less than Jina, but appears to be more battle tested. In addition to Yahoo.com sites, Vespa is also used at OkCupid and most recently at Spotify. At the AI2 incubator we are trying out Vespa on a project and will share our experiences and learning in the future.

Few-shot NER

Named entity recognition (NER) is a common NLP task. Traditionally, we train NER models using tools such as spaCy and NLTK. This approach requires gathering a large amount of training data (using tools such as Prodigy) which can be quite time consuming. Large pre-trained models such as GPT-J and GPT-NeoX (EleutherAI), and Cohere.ai’s models provide an option to quickly train such a model with just a handful of examples that are converted into a prompt. For example, we can extract job titles using the following prompt (source: NLP Cloud).

In this case study, we provide three examples of job titles and prompt for the job title for the 4th sentence. The output should be “CTO”.

This sounds pretty amazing, but what’s the catch? There are two factors to consider: Accuracy and cost. Few-shot learning has made huge progress with the rise of large models, but accuracy can often be improved with dedicated fine-tuning and training. Running inference using large models on a large number of inputs can be both expensive and tricky (e.g. in a Spark job). Consequently, few-shot learning with large models is best used in prototyping or MAP-type efforts. Nevertheless, we expect rapid progress in addressing these issues both in academia and industry.

Open GPT: EleutherAI, BigScience, and PolyCoder

OpenAI revolutionized few-shot learning with their work on GPTs, particularly with GPT-3 and OpenAI API. Startups such as AI21 Labs and Cohere.ai quickly followed with their own large models and APIs. All of these models are however closed source. For practitioners who want to get under the hood of large models, it’s not possible to do so until EleutherAI came along, first with GPT-J (6B-param model) and more recently with GPT-NeoX (20B-param model). These models are available for prototyping via NLP Cloud and GooseAI (running on CoreWeave Cloud).

In addition to EleutherAI, a noteworthy effort is BigScience:

During one-year, from May 2021 to May 2022, 900 researchers from 60 countries and more than 250 institutions are creating together a very large multilingual neural network language model and a very large multilingual text dataset on the 28 petaflops Jean Zay (IDRIS) supercomputer located near Paris, France. ‍During the workshop, the participants plan to investigate the dataset and the model from all angles: bias, social impact, capabilities, limitations, ethics, potential improvements, specific domain performances, carbon impact, general AI/cognitive research landscape.All the knowledge and information gathered during the workshop is openly accessible and can be explored on our Notion.

The BigScience team just started training a 176B-param model, using data from 46 languages, on a 416-GPU low-carbon cluster. You can follow the progress live on Twitter. The training is expected to complete some time in June.

Lastly, there’s now an open-source version of OpenAI’s Codex, created by a group of researchers at CMU, called PolyCoder. It is based on the smaller GPT-2 variant with 2.7 billion parameters, trained on 249GB of code across 12 programming languages. The team claims that PolyCoder is able to write in C with greater accuracy than all known models, including Codex.

Big Tech AI

When it comes to pushing the envelope in large-scale AI research, there are only a handful of companies that have the talents and resources necessary to achieve results. They are Google, Meta, and Microsoft. In this section we review recent announcements and published papers in this area, with commentaries on their relevance to AI practitioners.

Size matters, not just with models, but also with training data.

With GPT-3, models got large in a hurry, so there are plenty of rooms for all sorts of optimization. Google (DeepMind) published a paper showing that the recent crops of large models (GPT-3, AI21’s Jurassic, Microsoft/NVIDIA’s MegaTron-Turing, DeepMind’s own Gopher, etc.) are undertrained. By keeping the number of training data roughly in proportion with model size, the DeepMind team shows nice performance improvement. As a case study, they introduced a new model called Chinchilla that has 70B parameters that outperforms 280B-parameter Gopher simply because it was trained on 4x more data.

Hyperparameter tuning for large models.

When a neural network is too large to pretrain more than once, tuning its hyperparameters is practically impossible. Enter Microsoft Research which came up with a cool idea called

μTransfer

. In a case study, Microsoft shows that μTransfer significantly improves the performance of a 6.7B-parameter GPT-3 with only 7% compute overhead. This model matches the performance of a 14B-parameter model that did not use μTransfer.

Go deeper.

The adjective “deep” in deep learning signifies the fact that modern neural networks have a lot more layers compared to those in the pre-deep learning era. The question of how deep we should go remains interesting. Training extremely deep transformer networks has been plagued with instability. Microsoft Research published a paper titled “DeepNet: Scaling Transformers to 1,000 Layers”, which represents an order of magnitude deeper than SOTA. The interesting quote is: "Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction."

(Cost-)Efficient serving.

Fine-tuning pre-trained models is common in NLP, but forking the model for each task can be expensive in production where many fine-tuned models are served. To address this problem, Google came up with the idea of

prompt tuning

which adds a small set of learnable vectors to the input and can match fine-tuning quality while sharing the same frozen model across all tasks.

Size matters, again.

Google joined the rare group of those that pretrain a 500B-parameter model. Their entrant is Pathways Language Model (PaLM), a 540B-parameter model that (of course) surpasses the performance of every large model that came before it. It can even explain jokes. It’s so good that we have to reshare here: Meta AI also shared their latest work on scaling up models:

SEER 10B

. Meta described it as a better, fairer computer vision through self-supervised learning on diverse datasets, with 10x more parameters compared to its predecessor.

Real-world deployment.

Microsoft made some advances in machine translation, productionizing a mixture-of-experts architecture called Z-code. The new update now supports 100 languages. They aim to roll out support for about 1,500 low-resource languages in the future.

Ethical issues with large models.

A new paper from Google research found that large models have a memorization problem that is larger than expected, and more serious the larger the model.

TL;DR

for entrepreneurs looking to apply cutting edge AI: the pace of innovation in AI continues to be rapid. We expect many of these advances will show up as more powerful capabilities that technologists can tap into in building real-world applications. At the AI2 incubator, we strive to help founders navigate this fast changing landscape in their journeys building the next enduring companies.

Additional Readings That We Found Interesting

Stay up to date with the latest

A.I. and deep tech reports.

→

I have read and accept the Privacy & Terms