OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers @gabrielmongaras

Gabriel Mongaras Sora: openai.com/sora
Sora paper (Video generation models as world simulators): openai.com/research/video-generation-models-as-world-simulators

DiTs - Scalable Diffusion Models with Transformers paper: arxiv.org/abs/2212.09748

My notes: drive.google.com/file/d/1h2pcgkrI0b6965f1xjf4kTyhhvxZNM3b/view?usp=drive_link

updated 8 months ago

OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers

Gabriel Mongaras 2024-02-18 | Sora: openai.com/sora
Sora paper (Video generation models as world simulators): openai.com/research/video-generation-models-as-world-simulators

DiTs - Scalable Diffusion Models with Transformers paper: arxiv.org/abs/2212.09748

My notes: drive.google.com/file/d/1h2pcgkrI0b6965f1xjf4kTyhhvxZNM3b/view?usp=drive_link

Deterministic Image Editing with DDPM Inversion, DDIM Inversion, Null Inversion and Prompt-to-Prompt

Gabriel Mongaras 2024-07-31 | Null-text Inversion for Editing Real Images using Guided Diffusion Models: arxiv.org/abs/2211.09794

An Edit Friendly DDPM Noise Space: Inversion and Manipulations: arxiv.org/abs/2304.06140

Prompt-to-Prompt Image Editing with Cross Attention Control: arxiv.org/abs/2208.01626

00:00 Intro
01:24 Current image editing techniques
11:42 Deriving DDPM and DDIM
23:08 DDIM inversion
32:46 Null inversion
47:15 DDPM inversion
1:01:18 Prompt-to-prompt
1:10:52 Conclusion

Attending to Topological Spaces: The Cellular Transformer

Gabriel Mongaras 2024-07-22 | Paper here: arxiv.org/abs/2405.14094

Notes: drive.google.com/file/d/12g_KkHqXD6mEDILJzYbCC08i8cDHITfC/view?usp=drive_link

00:00 Intro
01:39 Cellular complexes
07:26 K-cochain
13:26 Defining structure on the cell
20:28 Cellular transformer
34:18 Positional encodings and outro

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Gabriel Mongaras 2024-07-12 | Paper here: arxiv.org/abs/2407.04620
Code!: github.com/test-time-training/ttt-lm-pytorch

Notes: drive.google.com/file/d/127a1UBm_IN_WMKG-DmEvfJ8Pja-9BwDk/view?usp=drive_link

00:00 Intro
04:40 Problem with RNNs
06:38 Meta learning and method idea
09:13 Update rule and RNN inner loop
15:07 Learning the loss function outer loop
21:21 Parallelizing training
30:05 Results

WARP: On the Benefits of Weight Averaged Rewarded Policies

Gabriel Mongaras 2024-07-06 | Paper here: arxiv.org/abs/2406.16768

Notes: drive.google.com/file/d/11UK7mEZwNVUMYuXwvOTfaqHhN8zSYm5M/view?usp=drive_link

00:00 Intro and RLHF
17:30 Problems with RLHF
21:08 Overview of their method
23:47 EMA
28:00 Combining policies with SLERP
37:34 Linear interpolation towards initialization
40:32 Code
44:16 Results

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

Gabriel Mongaras 2024-06-25 | Paper: arxiv.org/abs/2308.07926
Paper page: qiuyu96.github.io/CoDeF
Code: github.com/qiuyu96/CoDeF

My notes: drive.google.com/file/d/10PMKdd5XBd6Y60HlRB9IW9naR2bWziDT/view?usp=drive_link

00:00 Intro
03:00 Method overview
08:40 Method details
15:24 Tricks done for training and how to actually train this thing
19:24 Flow loss and masking
25:10 Conclusion

Mamba 2 - Transformers are SSMs: Generalized Models and Efficient Algorithms Through SSS Duality

Gabriel Mongaras 2024-06-16 | Paper here: arxiv.org/abs/2405.21060
Code!: github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba2.py

Notes: drive.google.com/file/d/1--XGPFeXQyx4CPxgYjzR4qrLd-baLWQC/view?usp=sharing

00:00 Intro
01:45 SSMs
08:00 Quadratic form of an SSM
15:02 Expanded form of an SSM
24:00 Attention - it's all you need??
29:55 Kernel attention
32:50 Linear attention
34:32 Relating attention to SSMs
38:35 Defining the M matrix
43:48 Splitting the M matrix
46:30 Off diagonal decomposition
54:00 Recurrent form of the off diagonal
1:03:30 Combining the M matrix blocks and code
1:06:22 Complexity and other analysis

CoPE - Contextual Position Encoding: Learning to Count Whats Important

Gabriel Mongaras 2024-06-04 | Paper: arxiv.org/abs/2405.18719

My notes: drive.google.com/file/d/1y9VHZc7MLqc6t2SHHdlVTYeW3czmmRbl/view?usp=sharing

00:00 Intro
02:44 Background
09:58 CoPE
24:50 Code
32:16 Results

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Gabriel Mongaras 2024-05-28 | Paper: arxiv.org/abs/2403.03100
Demo: speechresearch.github.io/naturalspeech3
Code: huggingface.co/spaces/amphion/naturalspeech3_facodec

My notes: drive.google.com/file/d/1xnzErd_86B6eLwqpLckhoEQKqkxFPyM_/view?usp=drive_link

00:00 Intro
05:34 Architecture overview
18:45 GRL and subspace independence
24:45 Discrete diffusion Model
41:00 factorized diffusion model
44:00 Conclusion and results

xLSTM: Extended Long Short-Term Memory

Gabriel Mongaras 2024-05-17 | Paper: arxiv.org/abs/2405.04517

My notes: drive.google.com/file/d/1wFYvU_1oUWcCNuQ91zTpSGAeNUsPjlt3/view?usp=drive_link

00:00 Intro
05:44 LSTM
13:38 Problems paper addresses
14:12 sLSTM
23:00 sLSTM Memory mixing
27:08 mLSTM
35:14 Results and stuff

KAN: Kolmogorov-Arnold Networks

Gabriel Mongaras 2024-05-04 | Paper: arxiv.org/abs/2404.19756

Spline Video: https://m.youtube.com/watch?v=qhQrRCJ-mVg

My notes: drive.google.com/file/d/1twcIF13nG8Qc10_qeDqCZ4NaUh9tFsAH/view?usp=drive_link

00:00 Intro
00:45 MLPs and Intuition
05:12 Splines
19:02 KAN Formulation
28:00 Potential Downsides to KANs
32:09 Results

LADD: Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

Gabriel Mongaras 2024-04-29 | Paper: arxiv.org/abs/2403.12015

My notes: drive.google.com/file/d/1s1-nnWR_ZR26PNSAoZR1Xj3nuD9UZlvR/view?usp=sharing

00:00 Intro
01:31 Diffusion Models
08:08 Latent Diffusion Models
10:04 Distillation
12:02 Aversarial Diffusion Distillation (ADD)
17:06 Latent Aversarial Diffusion Distillation (LADD)
22:20 Results

Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

Gabriel Mongaras 2024-04-21 | Paper: arxiv.org/abs/2404.02905
Demo: https://var.vision/
Code: github.com/FoundationVision/VAR

My notes: drive.google.com/file/d/1qym3JG-0xqEgQhdvkt9N17o-ZzUWy2sn/view?usp=drive_link

00:00 Intro
00:53 DiTs
04:06 Autoregressive Image Transformers
06:23 Tokenization problem with AR ViTs
08:43 VAE
10:47 Discrete Quantization - VQGAN
16:42 Visual Autoregressive Modeling
21:31 Causal Inference with VAR
24:02 Losses
25:16 Residual Modeling
33:26 Summary
34:11 Results

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Gabriel Mongaras 2024-04-14 | Paper: arxiv.org/abs/2404.07143

My notes: drive.google.com/file/d/1plWJDwHTZkRK9PDdvaLMnZjFR6fVvNLH/view?usp=drive_link

00:00 Intro
07:17 Model intuition
11:00 Memory retrieval operation
16:29 Hidden state updates
21:58 Delta update
24:10 Is it causal?
25:26 Combining local attention and RNN
27:26 Results
30:25 Sampling and Conclusion

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Gabriel Mongaras 2024-04-08 | Paper: arxiv.org/abs/2404.02258

My notes: drive.google.com/file/d/1o4v5te1yfuK_FQPvvS8SR55Sysg04dYK/view?usp=drive_link

00:00 Intro
06:02 Mixture of Experts (MoE)
15:12 Mixture of Depths (MoD)
17:04 The gradients must flow!
22:40 Autoregressive Sampling
33:58 Results

Q* AGI Achieved (Apr Fools)

Gabriel Mongaras 2024-04-01 | Q* paper link: link.springer.com/content/pdf/10.1007/BF00992698.pdf

April fools 😏

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Gabriel Mongaras 2024-03-28 | Website paper: stability.ai/news/stable-diffusion-3-research-paper
Paper: arxiv.org/abs/2403.03206

My notes: drive.google.com/file/d/1n8rSM3OuOkzDBlXdK5VBrnADnEXp4xXv/view?usp=drive_link

00:00 Intro
01:58 DDPM
13:16 ODE/SDE formulation and score
18:09 ODE intuition
21:38 Rectified Flows
27:46 Sampling from a diffusion model
29:16 Going to the latent space
32:17 CLIP
37:53 Model architecture
56:18 Results and stuff

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Gabriel Mongaras 2024-03-21 | My notes: drive.google.com/file/d/1l2B4m8tDVchfsplIbps4-9533fcxqubF/view?usp=drive_link

Paper: arxiv.org/abs/2403.03507

00:00 Intro
02:44 Intuition and proof of low rank
12:28 GaLore intuition
16:38 More GaLore intuition
21:20 GaLore algorithm
27:50 Algorithm analysis
33:00 Results

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

Gabriel Mongaras 2024-03-06 | My notes:
BitNet: drive.google.com/file/d/1iA2tISamkfQq4jgZZBBSH1MN3Bgtc99_/view?usp=sharing
Era of 1-bit LLMs: drive.google.com/file/d/1iNy91MTP53kTCSkeqHBqMOSePPyoYvCD/view?usp=sharing

BitNet: Scaling 1-bit Transformers for Large Language Models: arxiv.org/abs/2310.11453
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits: arxiv.org/abs/2402.17764

00:00 Intro
03:10 BitLinear Intuition
08:05 Weight Quantization
10:35 Activation Quantization
16:30 Matrix Multiplication and Dequantizing
23:08 Model Parallelism with Group Quantization and Normalization
32:36 Other Training Stuff
37:11 BitNet Results
39:11 The Era of 1-Bit LLMs

DoRA: Weight-Decomposed Low-Rank Adaptation

Gabriel Mongaras 2024-02-23 | Paper: arxiv.org/abs/2402.09353

My notes: drive.google.com/file/d/1hA56lNtz7jxQPWIxBpnDUsiLFaFZlyyP/view?usp=sharing

A Decoder-only Foundation Model For Time-series Forecasting

Gabriel Mongaras 2024-02-07 | Paper: arxiv.org/abs/2310.10688

Notes: drive.google.com/file/d/1fmk5Z5VJkqHvEbNXlq1OiIBP317NqNfN/view?usp=sharing

Lumiere: A Space-Time Diffusion Model for Video Generation

Gabriel Mongaras 2024-02-02 | Paper: arxiv.org/abs/2401.12945
Demo: lumiere-video.github.io

Notes: drive.google.com/file/d/1fJl-ijVy6KML1YwM_9UVVU-MSfipDIqe/view?usp=sharing

Exphormer: Sparse Transformers for Graphs

Gabriel Mongaras 2024-01-29 | Paper here: arxiv.org/abs/2303.06147

Notes: drive.google.com/file/d/1eXoXtPgJYKBTKd7oN8StuBLW453yWJ3f/view?usp=drive_link

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Gabriel Mongaras 2024-01-24 | Paper here: arxiv.org/abs/2401.10774
demo: sites.google.com/view/medusa-llm

Notes: drive.google.com/file/d/1eOminZIC4wrjjWIBnSroxBYduCXzs86E/view?usp=drive_link

Boundary Attention: Learning to Find Faint Boundaries at Any Resolution

Gabriel Mongaras 2024-01-18 | Paper here: arxiv.org/abs/2401.00935

Notes: drive.google.com/file/d/1eAiAhbmvczYQwHqHHv-GJeDlX-WlZBBI/view?usp=sharing

Cached Transformers: Improving Transformers with Differentiable Memory Cache

Gabriel Mongaras 2024-01-04 | Paper here: arxiv.org/abs/2312.12742
Code here: github.com/annosubmission/GRC-Cache

Notes: drive.google.com/file/d/1cgR14tZmrF3lQROMT_2RUig2dBfhqU9z/view?usp=sharing

Translatotron 3: Speech to Speech Translation with Monolingual Data

Gabriel Mongaras 2023-12-27 | Translatotron 3: arxiv.org/abs/2305.17547
Translatotron 2: arxiv.org/abs/2107.08661
Demo: google-research.github.io/lingvo-lab/translatotron3

Notes:
Translatotron 3: drive.google.com/file/d/1EfOCuKp9yeLBzhxjsiTWuoYaVToBbgon/view?usp=sharing
Translatotron 2: drive.google.com/file/d/1zPrIvZspMWpWPaFhvgpM2DEYzsvTL8R6/view?usp=drive_link

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gabriel Mongaras 2023-12-12 | Paper here: arxiv.org/abs/2312.00752
The annotated S4: srush.github.io/annotated-s4

Notes: drive.google.com/file/d/1aoaKj3kuTtpHi0OzinXZGyZIFxhqp514/view?usp=sharing

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Gabriel Mongaras 2023-12-06 | Paper Link: arxiv.org/abs/2310.04378

My Notes: drive.google.com/file/d/1aUDxMSWNAqMkg0P91Ms1vu4yCeTtzSsR/view?usp=sharing

Adversarial Diffusion Distillation

Gabriel Mongaras 2023-11-30 | Paper Link: arxiv.org/abs/2311.17042
Stability Link: stability.ai/research/adversarial-diffusion-distillation

My Notes: drive.google.com/file/d/1a7EZpQ-4_jjt7Fic1EQlyGnHOX1xB9Af/view?usp=sharing

Unsupervised Discovery of Semantic Latent Directions in Diffusion Models

Gabriel Mongaras 2023-11-21 | Paper found here: arxiv.org/abs/2302.12469
My notes: drive.google.com/file/d/1_wFtrtxZk7ZYq6-FfUILET3Nga8KCzsz/view?usp=drive_link

DALL-E 3 - Improving Image Generation with Better Captions

Gabriel Mongaras 2023-11-20 | Blog post here: openai.com/dall-e-3
My notes: drive.google.com/file/d/1_lSM24dNSdzAvP8MKaKIyfbsASn4UfYe/view?usp=sharing

LRM: Large Reconstruction Model for Single Image to 3D

Gabriel Mongaras 2023-11-13 | Paper found here: arxiv.org/abs/2311.04400
My notes: drive.google.com/file/d/1_cI6cYIm8QZrv0lhfYBG7ULXc4szr8Hg/view?usp=sharing

CodeFusion: A Pre-trained Diffusion Model for Code Generation

Gabriel Mongaras 2023-11-06 | Paper found here: arxiv.org/abs/2310.17680v1
My chicken scratch: drive.google.com/file/d/1ErA6RsKW__uxmlprgdIO13PRCU69Q-U5/view?usp=drive_link

Matryoshka Diffusion Models Explained

Gabriel Mongaras 2023-10-30 | Paper found here: arxiv.org/abs/2310.15111

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

Gabriel Mongaras 2023-10-22 | Paper: arxiv.org/abs/2310.00704
Code: github.com/yangdongchao/UniAudio
Demo: https://dongchaoyang.top/UniAudio_demo/

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models

Gabriel Mongaras 2023-10-16 | Paper found here: arxiv.org/abs/2309.14717v2

StreamingLLM - Efficient Streaming Language Models with Attention Sinks Explained

Gabriel Mongaras 2023-10-07 | Paper found here: arxiv.org/abs/2309.17453

Code found here: github.com/mit-han-lab/streaming-llm

FreeU: Free Lunch in Diffusion U-Net Explained

Gabriel Mongaras 2023-09-24 | Paper found here: arxiv.org/abs/2309.11497

InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation Explained

Gabriel Mongaras 2023-09-17 | Paper found here: arxiv.org/abs/2309.06380

Llama/Wizard LM Finetuning with Huggingface on RunPod

Gabriel Mongaras 2023-09-16 | A demo I made to show how to fine-tune a WizardLM model with Huggingface and peft.
Presentation: docs.google.com/presentation/d/17TyDtImkcXnIXwd6CDoYxCXBprvtD_n1I3RlImJg8gQ/edit?usp=sharing
Github: github.com/gmongaras/Wizard_QLoRA_Finetuning

2x Faster Language Model Pre-training via Masked Structural Growth

Gabriel Mongaras 2023-09-10 | Paper found here: arxiv.org/abs/2305.02869

Bayesian Flow Networks (BFN) Explained

Gabriel Mongaras 2023-09-03 | Paper found here: arxiv.org/abs/2308.07037

WizardLM: Empowering Large Language Models to Follow Complex Instructions Explained

Gabriel Mongaras 2023-08-27 | Paper found here: arxiv.org/abs/2304.12244

Code release: github.com/nlpxucan/WizardLM

From Sparse to Soft Mixtures of Experts Explained

Gabriel Mongaras 2023-08-21 | Paper found here: arxiv.org/abs/2308.00951

BK-SDM: Architecturally Compressed Stable Diffusion for Efficient T2I Generation Explained

Gabriel Mongaras 2023-08-16 | Paper found here: openreview.net/forum?id=bOVydU0XKC

Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained

Gabriel Mongaras 2023-08-10 | Paper found here: arxiv.org/abs/2305.18290

Universal and Transferable Adversarial Attacks on Aligned Language Models Explained

Gabriel Mongaras 2023-08-06 | Paper found here: arxiv.org/abs/2307.15043

Demo here: llm-attacks.org

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis Explained

Gabriel Mongaras 2023-08-01 | Paper found here: arxiv.org/abs/2307.01952

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations Explained

Gabriel Mongaras 2023-07-30 | Paper found here: arxiv.org/abs/2108.01073