Insights 3: Ozette, Modulus, and the Transformers Effect
Community Updates
Ozette
: Ozette’s immune profiling platform provides insights that help answer some of the most pressing questions, such as whether or not a therapy works for an individual patient or if we can determine a patient’s disease before they have physical manifestations. Answering these critical questions and understanding the complexity of the immune system in health and disease helps us drive towards better patient outcomes—a core motivation for Ozette.Modulus Therapeutics:
We envision a future where the design of cell therapies is guided by machine learning to treat more diseases and patients than ever before. From advances in genetic engineering to genomics and artificial intelligence, we’re deepening the understanding of immune cell behavior.Cool AI/ML News
AI at the edge
. When deploying neural networks to the edge in environments such as ARM, x86, and WebAssembly architectures in Android, iOS, Windows, Linux, macOS, and Emscripten, there’s a fair amount of optimization that is necessary. XNNPACK is a highly optimized library of neural network inference operators for such scenarios. This month the Tensorflow team announced FasterQuantized Inference
with XNNPACK. Quantization is among the most popular methods to speedup neural network inference on CPUs. A year ago TensorFlow Lite increased performance for floating-point models with the integration of XNNPACK backend. Google extended the XNNPACK backend to quantized models with, on average across computer vision models, 30% speedup on ARM64 mobile phones, 5X speedup on x86-64 laptop and desktop systems, and20X speedup for in-browser inference
with WebAssembly SIMD compared to the default TensorFlow Lite quantized kernels. As a (former) performance engineer, I like the sound of 20x speedup.Recreating Natural Voices
for People with Speech Impairments:recreation of his voice generated by a machine learning (ML) model
. Gleason’s voice recreation was developed in collaboration with Google’s Project Euphonia, which aims to empower people who have impaired speaking ability due to ALS to better communicate using their own voices. PnG NAT is a new text-to-speech synthesis (TTS) model that merges two state-of-the-art technologies, PnG BERT and Non-Attentive Tacotron (NAT), into a single model. It demonstrates significantly better quality and fluency than previous technologies, and represents a promising approach that can be extended to a wider array of users.The Transformers effect
Primer: Train transformers more efficiently
. So et al. from Google Research proposed a new approach that aims to reduce the costs of Transformers by searching for a more efficient variant. Primer significantly reduces training cost compared to the original Transformer used in auto-regressive language modeling. The improvements are attributed to: squaring ReLU activations and adding a depthwise convolutional layer after each Q,K, and V projection in self-attention. Results show that Primer gains increase as compute scale grows following a power law with respect to quality at optimal model sizes. On C4 auto-regressive language modeling, theT5 model's training cost can be reduced up to 4X
. This opens up other applications such as requiring less compute in a one-shot regime to achieve similar performance as the original Transformer.Scale transformers more efficiently
. Recently, there have been several efforts to better understand the scaling properties of Transformers. A huge motivation behind this effort is to make better scaling decisions that reduce costs and can help both financially and/or environmentally. Tay et al. from Google Research and Deepmind proposed an effective scaling strategy can achieve similar quality compared to canonical model sizes with 50% less parameters and being 40% faster. The bonus: they publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.Dealing with long input (FastFormer)
. In a well-deserved break from the relentless updates from Google, we call your attention to a paper by Wu et al. of Tsinghua University and Microsoft Research Asia. It has a cool title: Fastformer: Additive Attention Can Be All You Need. In Fastformer, instead of modeling the pairwise interactions between tokens, we first use anadditive attention mechanism to model global contexts
, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.Transformers meet document parsing
. Need to parse multimodal (text/image/layout) documents? Microsoft Research released LayoutLMV2 and its multilingual version LayoutXLM on Hugging Face.CoAtNet: Transformers (and convolution) meet image recognition
. Dai et al. from Google research (yes, we are back to Google updates) made the observation that convolution often has better generalization (i.e., the performance gap between training and evaluation) due to its inductive bias, while self-attention tends to have greater capacity (i.e., the ability to fit large-scale training data) thanks to its global receptive field. By combining convolution and self-attention, hybrid models can achieve both better generalization and greater capacity. Compared to previous results, CoAtNet models are4-10x faster
while achieving new state-of-the-art 90.88% top-1 accuracy on the well-established ImageNet dataset. The source code and pretrained models are on the Google AutoML github. CoAtNet was found withneural architecture search
.Pix2Seq: Transformers meet object detection
. If you worked with object detection (YOLO, RetinaNet, and such), you probably had to deal with hackeries such as non-maximum suppression. Chen et al. from Google Research cast object detection as language modeling conditioned on the pixel, and trained with, you guessed it, a transformer architecture. They showed this simple and generic approach can achieve competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.Textless NLP
. Facebook’s SSL update for this month is Textless NLP. They train a transformer language model from audio only. Facebook’s Jerome Pesenti tweeted: This work opens up a new era of textless NLP: easier to deploy across multiple languages including low resource ones, and capturing the rich expressive content of speech (laughter, emotion, etc.). Fascinating!Foundation models/Mistral
. In the second week of September, a debate broke out among AI luminaries on Twitter around the term "foundation models", which are essentially LLMs. This term was introduced last month by the researchers (32 faculty and 117 students and postdocs) at Stanford's Center for Research on Foundation Models (CRFM). Among the coauthors are Fei Fei Li, Percy Liang, Chris Re (Snorkel's co-founder), Chris Manning, Stefano Ermon, and Matei Zaharia (Databricks' CTO). They wrote a 212-page report here. Some on Twitter, including Pedro Domingos, Gary Marcus, Tom Dietterich, and Judea Pearl objected to the use of the "foundation" adjective. We personally don't find this debate interesting. What's more interesting is that CRFM started an effort, called Mistral, which is a "framework for transparent and accessible large-scale language model training, built with Hugging Face hugs". Why not Eleuther.ai? Because it's somewhat opaque and works on Google's TPU only. Why not just use Hugging Face? Because it's not scalable yet, hence the need to build on top of it.AI21 Studio
, which is an alternative to OpenAI’s API. Pros: no wait list. Cons: low token limit and less developed ecosystem. Last month AI21 Labs announced their Jurassic LLM with 178B parameters. For comparison GPT-3 has 173B, and Wu Dao 2.0 has 1.75 trillion parameters. AI21 labs raised $35M so far.SSL for anomaly detection
. It’s an anomaly that they did not use Transformers. Perhaps Transformers cannot transform everything after all?Updates from Hugging Face
GPT-J is now
on Hugging Face. Eleuther.ai is a grassroot effort to replicate large models and make them accessible for the public. GPT-J is Eleuther’s largest model to date with 6B parameters.HF’s project
Optimum
. This open source project helps with quantizing, pruning, and efficiently training transformers on top of Intel (Low Precision Optimization Tool (LPOT)), Qualcomm’s Snapdragon, and GraphCore’s Intelligent Processing Unit (IPU).classification
.AI Startup Scene
make LLMs accessible and useful
for everyone. One of the co-founders is a co-author of the famous attention paper.SDK
that lets the user createcustom computer vision models
fed with a little of their own training data — as an alternative to off-the-shelf tools which may not have the required specificity for a particular use case.newest generation of large pre-trained deep learning models
(akin to BERT and GPT-3) in a real-world enterprise product. This means they can deploy automated AI agents in as little as two weeks, where incumbent providers of voice assistants would take up to six months to deploy an older version of this technology.”self-serve data science by simply chatting
with their all-in-one platform using controlled natural language.Others
→
Insights 3: Ozette, Modulus, and the Transformers Effect
Community Updates
Ozette
: Ozette’s immune profiling platform provides insights that help answer some of the most pressing questions, such as whether or not a therapy works for an individual patient or if we can determine a patient’s disease before they have physical manifestations. Answering these critical questions and understanding the complexity of the immune system in health and disease helps us drive towards better patient outcomes—a core motivation for Ozette.Modulus Therapeutics:
We envision a future where the design of cell therapies is guided by machine learning to treat more diseases and patients than ever before. From advances in genetic engineering to genomics and artificial intelligence, we’re deepening the understanding of immune cell behavior.Cool AI/ML News
AI at the edge
. When deploying neural networks to the edge in environments such as ARM, x86, and WebAssembly architectures in Android, iOS, Windows, Linux, macOS, and Emscripten, there’s a fair amount of optimization that is necessary. XNNPACK is a highly optimized library of neural network inference operators for such scenarios. This month the Tensorflow team announced FasterQuantized Inference
with XNNPACK. Quantization is among the most popular methods to speedup neural network inference on CPUs. A year ago TensorFlow Lite increased performance for floating-point models with the integration of XNNPACK backend. Google extended the XNNPACK backend to quantized models with, on average across computer vision models, 30% speedup on ARM64 mobile phones, 5X speedup on x86-64 laptop and desktop systems, and20X speedup for in-browser inference
with WebAssembly SIMD compared to the default TensorFlow Lite quantized kernels. As a (former) performance engineer, I like the sound of 20x speedup.Recreating Natural Voices
for People with Speech Impairments:recreation of his voice generated by a machine learning (ML) model
. Gleason’s voice recreation was developed in collaboration with Google’s Project Euphonia, which aims to empower people who have impaired speaking ability due to ALS to better communicate using their own voices. PnG NAT is a new text-to-speech synthesis (TTS) model that merges two state-of-the-art technologies, PnG BERT and Non-Attentive Tacotron (NAT), into a single model. It demonstrates significantly better quality and fluency than previous technologies, and represents a promising approach that can be extended to a wider array of users.The Transformers effect
Primer: Train transformers more efficiently
. So et al. from Google Research proposed a new approach that aims to reduce the costs of Transformers by searching for a more efficient variant. Primer significantly reduces training cost compared to the original Transformer used in auto-regressive language modeling. The improvements are attributed to: squaring ReLU activations and adding a depthwise convolutional layer after each Q,K, and V projection in self-attention. Results show that Primer gains increase as compute scale grows following a power law with respect to quality at optimal model sizes. On C4 auto-regressive language modeling, theT5 model's training cost can be reduced up to 4X
. This opens up other applications such as requiring less compute in a one-shot regime to achieve similar performance as the original Transformer.Scale transformers more efficiently
. Recently, there have been several efforts to better understand the scaling properties of Transformers. A huge motivation behind this effort is to make better scaling decisions that reduce costs and can help both financially and/or environmentally. Tay et al. from Google Research and Deepmind proposed an effective scaling strategy can achieve similar quality compared to canonical model sizes with 50% less parameters and being 40% faster. The bonus: they publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.Dealing with long input (FastFormer)
. In a well-deserved break from the relentless updates from Google, we call your attention to a paper by Wu et al. of Tsinghua University and Microsoft Research Asia. It has a cool title: Fastformer: Additive Attention Can Be All You Need. In Fastformer, instead of modeling the pairwise interactions between tokens, we first use anadditive attention mechanism to model global contexts
, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.Transformers meet document parsing
. Need to parse multimodal (text/image/layout) documents? Microsoft Research released LayoutLMV2 and its multilingual version LayoutXLM on Hugging Face.CoAtNet: Transformers (and convolution) meet image recognition
. Dai et al. from Google research (yes, we are back to Google updates) made the observation that convolution often has better generalization (i.e., the performance gap between training and evaluation) due to its inductive bias, while self-attention tends to have greater capacity (i.e., the ability to fit large-scale training data) thanks to its global receptive field. By combining convolution and self-attention, hybrid models can achieve both better generalization and greater capacity. Compared to previous results, CoAtNet models are4-10x faster
while achieving new state-of-the-art 90.88% top-1 accuracy on the well-established ImageNet dataset. The source code and pretrained models are on the Google AutoML github. CoAtNet was found withneural architecture search
.Pix2Seq: Transformers meet object detection
. If you worked with object detection (YOLO, RetinaNet, and such), you probably had to deal with hackeries such as non-maximum suppression. Chen et al. from Google Research cast object detection as language modeling conditioned on the pixel, and trained with, you guessed it, a transformer architecture. They showed this simple and generic approach can achieve competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.Textless NLP
. Facebook’s SSL update for this month is Textless NLP. They train a transformer language model from audio only. Facebook’s Jerome Pesenti tweeted: This work opens up a new era of textless NLP: easier to deploy across multiple languages including low resource ones, and capturing the rich expressive content of speech (laughter, emotion, etc.). Fascinating!Foundation models/Mistral
. In the second week of September, a debate broke out among AI luminaries on Twitter around the term "foundation models", which are essentially LLMs. This term was introduced last month by the researchers (32 faculty and 117 students and postdocs) at Stanford's Center for Research on Foundation Models (CRFM). Among the coauthors are Fei Fei Li, Percy Liang, Chris Re (Snorkel's co-founder), Chris Manning, Stefano Ermon, and Matei Zaharia (Databricks' CTO). They wrote a 212-page report here. Some on Twitter, including Pedro Domingos, Gary Marcus, Tom Dietterich, and Judea Pearl objected to the use of the "foundation" adjective. We personally don't find this debate interesting. What's more interesting is that CRFM started an effort, called Mistral, which is a "framework for transparent and accessible large-scale language model training, built with Hugging Face hugs". Why not Eleuther.ai? Because it's somewhat opaque and works on Google's TPU only. Why not just use Hugging Face? Because it's not scalable yet, hence the need to build on top of it.AI21 Studio
, which is an alternative to OpenAI’s API. Pros: no wait list. Cons: low token limit and less developed ecosystem. Last month AI21 Labs announced their Jurassic LLM with 178B parameters. For comparison GPT-3 has 173B, and Wu Dao 2.0 has 1.75 trillion parameters. AI21 labs raised $35M so far.SSL for anomaly detection
. It’s an anomaly that they did not use Transformers. Perhaps Transformers cannot transform everything after all?Updates from Hugging Face
GPT-J is now
on Hugging Face. Eleuther.ai is a grassroot effort to replicate large models and make them accessible for the public. GPT-J is Eleuther’s largest model to date with 6B parameters.HF’s project
Optimum
. This open source project helps with quantizing, pruning, and efficiently training transformers on top of Intel (Low Precision Optimization Tool (LPOT)), Qualcomm’s Snapdragon, and GraphCore’s Intelligent Processing Unit (IPU).classification
.AI Startup Scene
make LLMs accessible and useful
for everyone. One of the co-founders is a co-author of the famous attention paper.SDK
that lets the user createcustom computer vision models
fed with a little of their own training data — as an alternative to off-the-shelf tools which may not have the required specificity for a particular use case.newest generation of large pre-trained deep learning models
(akin to BERT and GPT-3) in a real-world enterprise product. This means they can deploy automated AI agents in as little as two weeks, where incumbent providers of voice assistants would take up to six months to deploy an older version of this technology.”self-serve data science by simply chatting
with their all-in-one platform using controlled natural language.Others
→
Join our newsletter
→
Join our newsletter
→
Join our newsletter
→