AI2 Incubator Insights 3

AI2 Incubator Insights 3

This is the September, 2021 edition of technology newsletter from AI2 startup incubator. After a brief community update on Ozette and Modulus, two companies that work at the intersection of AI and the life sciences, we’ll discuss the Transformers effect, self-supervised learning (SSL), large language models (LLM), updates from Hugging Face, and more. We then wrap up with a roundup of interesting fund raises in AI startup land.

Community Updates

This month, our community updates have an AI ⋂ life sciences flavor:

  • Chris Picardo of Madrona Venture interviewed Ali Ansary, Ozette’s CEO.
  • TheSequence interviewed Bryce Daines, Modulus Therapeutics’s Chief Data Scientist.

What do Ozette and Modulus do again? Below are one-paragraph descriptions of these two companies that we proudly call alums.

  • Ozette: Ozette’s immune profiling platform provides insights that help answer some of the most pressing questions, such as whether or not a therapy works for an individual patient or if we can determine a patient’s disease before they have physical manifestations. Answering these critical questions and understanding the complexity of the immune system in health and disease helps us drive towards better patient outcomes—a core motivation for Ozette.
  • Modulus Therapeutics: We envision a future where the design of cell therapies is guided by machine learning to treat more diseases and patients than ever before. From advances in genetic engineering to genomics and artificial intelligence, we’re deepening the understanding of immune cell behavior.

Cool AI/ML News

Before diving into the main topics on Transformers, SSL, LLM, etc., let’s spotlight a couple of interesting updates.

First up is AI at the edge. When deploying neural networks to the edge in environments such as ARM, x86, and WebAssembly architectures in Android, iOS, Windows, Linux, macOS, and Emscripten, there’s a fair amount of optimization that is necessary. XNNPACK is a highly optimized library of neural network inference operators for such scenarios. This month the Tensorflow team announced Faster Quantized Inference with XNNPACK. Quantization is among the most popular methods to speedup neural network inference on CPUs. A year ago TensorFlow Lite increased performance for floating-point models with the integration of XNNPACK backend. Google extended the XNNPACK backend to quantized models with, on average across computer vision models, 30% speedup on ARM64 mobile phones, 5X speedup on x86-64 laptop and desktop systems, and 20X speedup for in-browser inference with WebAssembly SIMD compared to the default TensorFlow Lite quantized kernels. As a (former) performance engineer, I like the sound of 20x speedup.

Second, in another update from Google that seems relevant to Michael Petrochuk and the WellSaid team, Google AI blogged about Recreating Natural Voices for People with Speech Impairments:

recreation of his voice generated by a machine learning (ML) model

The Transformers effect

In case you have not been paying attention, neural network Transformers are overloading news outlets around the world. Below is a sampling of Transformers-related developments this month.

  • Primer: Train transformers more efficiently. So et al. from Google Research proposed a new approach that aims to reduce the costs of Transformers by searching for a more efficient variant. Primer significantly reduces training cost compared to the original Transformer used in auto-regressive language modeling. The improvements are attributed to: squaring ReLU activations and adding a depthwise convolutional layer after each Q,K, and V projection in self-attention. Results show that Primer gains increase as compute scale grows following a power law with respect to quality at optimal model sizes. On C4 auto-regressive language modeling, the T5 model's training cost can be reduced up to 4X. This opens up other applications such as requiring less compute in a one-shot regime to achieve similar performance as the original Transformer.
  • Scale transformers more efficiently. Recently, there have been several efforts to better understand the scaling properties of Transformers. A huge motivation behind this effort is to make better scaling decisions that reduce costs and can help both financially and/or environmentally. Tay et al. from Google Research and Deepmind proposed an effective scaling strategy can achieve similar quality compared to canonical model sizes with 50% less parameters and being 40% faster. The bonus: they publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.
  • Dealing with long input (FastFormer). In a well-deserved break from the relentless updates from Google, we call your attention to a paper by Wu et al. of Tsinghua University and Microsoft Research Asia. It has a cool title: Fastformer: Additive Attention Can Be All You Need.  In Fastformer, instead of modeling the pairwise interactions between tokens, we first use an additive attention mechanism to model global contexts, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.
  • Transformers meet document parsing. Need to parse multimodal (text/image/layout) documents? Microsoft Research released LayoutLMV2 and its multilingual version LayoutXLM on Hugging Face.
  • CoAtNet: Transformers (and convolution) meet image recognition. Dai et al. from Google research (yes, we are back to Google updates) made the observation that convolution often has better generalization (i.e., the performance gap between training and evaluation) due to its inductive bias, while self-attention tends to have greater capacity (i.e., the ability to fit large-scale training data) thanks to its global receptive field. By combining convolution and self-attention, hybrid models can achieve both better generalization and greater capacity. Compared to previous results, CoAtNet models are 4-10x faster while achieving new state-of-the-art 90.88% top-1 accuracy on the well-established ImageNet dataset. The source code and pretrained models are on the Google AutoML github. CoAtNet was found with neural architecture search.
  • Pix2Seq: Transformers meet object detection. If you worked with object detection (YOLO, RetinaNet, and such), you probably had to deal with hackeries such as non-maximum suppression. Chen et al. from Google Research cast object detection as language modeling conditioned on the pixel, and trained with, you guessed it, a transformer architecture. They showed this simple and generic approach can achieve competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.
  • Textless NLP. Facebook’s SSL update for this month is Textless NLP. They train a transformer language model from audio only. Facebook’s Jerome Pesenti tweeted: This work opens up a new era of textless NLP: easier to deploy across multiple languages including low resource ones, and capturing the rich expressive content of speech (laughter, emotion, etc.). Fascinating!
  • Foundation models/Mistral. In the second week of September, a debate broke out among AI luminaries on Twitter around the term "foundation models", which are essentially LLMs. This term was introduced last month by the researchers (32 faculty and 117 students and postdocs) at Stanford's Center for Research on Foundation Models (CRFM). Among the coauthors are Fei Fei Li, Percy Liang, Chris Re (Snorkel's co-founder), Chris Manning, Stefano Ermon, and Matei Zaharia (Databricks' CTO). They wrote a 212-page report here. Some on Twitter, including Pedro Domingos, Gary Marcus, Tom Dietterich, and Judea Pearl objected to the use of the "foundation" adjective. We personally don't find this debate interesting. What's more interesting is that CRFM started an effort, called Mistral, which is a "framework for transparent and accessible large-scale language model training, built with Hugging Face hugs". Why not Because it's somewhat opaque and works on Google's TPU only. Why not just use Hugging Face? Because it's not scalable yet, hence the need to build on top of it.
  • AI21 labs released AI21 Studio, which is an alternative to OpenAI’s API. Pros: no wait list. Cons: low token limit and less developed ecosystem. Last month AI21 Labs announced their Jurassic LLM with 178B parameters. For comparison GPT-3 has 173B, and Wu Dao 2.0 has 1.75 trillion parameters. AI21 labs raised $35M so far.
  • Finally, Google AI shared work on SSL for anomaly detection. It’s an anomaly that they did not use Transformers. Perhaps Transformers cannot transform everything after all?

Updates from Hugging Face

A newsletter focusing on Transformers simply cannot omit updates from Hugging Face, the maintainer of the beloved open source library for all sorts of transformers.

  • GPT-J is now available on Hugging Face. is a grassroot effort to replicate large models and make them accessible for the public. GPT-J is Eleuther’s largest model to date with 6B parameters.
  • HF’s project Optimum. This open source project helps with quantizing, pruning, and efficiently training transformers on top of Intel (Low Precision Optimization Tool (LPOT)), Qualcomm’s Snapdragon, and GraphCore’s Intelligent Processing Unit (IPU).
  • HF’s CTO, Julien Chaumond shared that most production usage at HF is around document or token classification.

AI Startup Scene

  •, makers of spaCy, sold $6M at a $120M valuation to SignalFire. Congrats Ines and Matthew!
  • raised a $40M series A round. They want to make LLMs accessible and useful for everyone. One of the co-founders is a co-author of the famous attention paper.
  • Mobius Labs raised a $6M series A. The startup offers an SDK that lets the user create custom computer vision models fed with a little of their own training data — as an alternative to off-the-shelf tools which may not have the required specificity for a particular use case.
  • PolyAI raised $14M series A led by Khosla. In a statement, Vinod Khosla said: “PolyAI is one of the first AI companies using the newest generation of large pre-trained deep learning models (akin to BERT and GPT-3) in a real-world enterprise product. This means they can deploy automated AI agents in as little as two weeks, where incumbent providers of voice assistants would take up to six months to deploy an older version of this technology.”
  • Roboflow raised a $20M  series A. With Roboflow, customers can annotate images while assessing the quality of datasets to prepare them for training. (Most computer vision algorithms require labels that essentially “teach” the algorithm to classify objects, places, and people.) The platform lets developers experiment to generate new training data and see what configurations lead to improved model performance. Once training finishes, Roboflow can deploy the model to the cloud, edge, or browser and monitor the model for edge cases and degradation over time.
  • Tonic raised a $35M series B. Company CEO and co-founder Ian Coe says the goal of the company is to provide production-like data for developers that keeps governance and compliance folks inside an organization happy. “Tonic is a data transformation company that leverages synthetic data, differential privacy and distributed computing. We de-identify sensitive data, while preserving all the value of that data so that developers can use it for building and testing software,” Coe explained to me.
  • DataChat raised $25M series A. Spun out of U. of Wisconsin, DataChat empowers non-technical users to self-serve data science by simply chatting with their all-in-one platform using controlled natural language.
  • Databrick raised 1.6B at a $38B valuation. Enough said.


  • Vowel raised $13.5M series A. Vowel is launching a meeting operating system with tools like real-time transcription; integrated agendas, notes and action items; meeting analytics; and searchable, on-demand recordings of meetings. Vowel is out to bring Slack, Figma and GitHub components to meetings by recording audio and video that can be paused at any time. Users can add notes and see where those notes fall within a real-time transcription that enables people who arrive late or could not make the meeting to catch up easily. After meetings are over, they can be shared, and Vowel has a search function so that users can go back and see where a particular person or topic was discussed.
  • Cellino, a company developing a platform to automate stem cell production, won the TechCrunch Disrupt 2021’s Startup Battlefield. Its system combines AI technology, machine learning, hardware, software — and yes, lasers! — could eventually democratize access to cell therapies. It aims to bring down costs associated with the manufacturing of human cells, while also increasing yields.
  • Homelight closed a $100 million in a Series D round of funding and $263 million in debt financing. HomeLight’s initial product focused on using artificial intelligence to match consumers and real estate investors to agents. Since then, the company has expanded to also providing title and escrow services to agents and home sellers and matching sellers with iBuyers. In July 2019, HomeLight acquired Eave as an entry into the (increasingly crowded) mortgage lending space.

If you’re ready to build an AI-first startup, then we want to talk to you

We invite talented engineers, researchers, and entrepreneurs to join our incubator on a rolling basis. If you have ever thought of starting an AI-first company, now is the time.

Apply Now