Audio and Speech Processing

2023 Transformers in Speech Processing- A Survey

Recent survey on the application of Transformer model on tasks of audio speech processing. Tasks include automatic speech recognition, speech synthesis, speech translation, speech para-linguistics, speech enhancement, spoken dialogue systems, and numerous multimodal applications.
- Applications overview
  1. Automatic Speech Recognition.
  2. Neural Speech Synthesis.
  3. Speech Translation.
  4. Speech Paralinguistics.
  5. Speech Enhancement and Separation.
  6. Spoken Dialogue Systems.
  7. Multi-Modal Applications.
- Challenges overview
  1. Training challenges: self-attention mechanism does not suit to speech sequence compared to word sequence.
  2. Computational cost and efficiency: self-attention mechanism has quadratic complexity with respect to the input sequence length, linearly growth in memory consumption, and requires techniques to parallelize and accelerate on different hardware platforms. May not be efficient for all downstream tasks due to different data distributions.
  3. Large data requirements: large amount of data is required for effective training.
  4. Generalization and transferability: lack of inductive biases, and build-in biases (unlike CNNs). Transferring is hard to fill the distribution gap between training data and practical data.
  5. Multimodal training: interaction between modalities can be explored further and zero-shot classification remains difficulty.
  6. Robustness: sensitive to domain shifts and noise in speech data. Generalize bad to other languages when trained solely on monolingual data. And they are lack of prosody information.

Computer Vision and Pattern Recognition

2019 Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer

Depth-estimation model used in StableDiffusionDepth2ImgPipeline by HuggingFace.

Vision Language Models

2023 Vision-Language Models for Vision Tasks: a Survey
2022 A Survey of Vision-Language Pre-Trained Models
2022 Clinical-BERT: Vision-Language Pre-training for Radiograph Diagnosis and Reports Generation
2024 Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review
2023 M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization
2023 Vision–Language Model for Visual Question Answering in Medical Imagery A transformer encoder-decoder model, for visual question answering (VQA) tasks. i)Extract image features using vision transformer (ViT), Embed questions using a textual encoder transformer. ii)Concatenate resulting visual & textual representations, feed into a multi-modal decoder, then generate answer autoregressively. Validates on radiology images dataset: a)VQA-RAD, and b)PathVQA. Evaluation metrics: accuracy, BLUE score.

Medical

2017 ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases

Diffusers

2020 Denoising Diffusion Probabilistic Models

🌲 Famous based diffusion model inspired by the nonequilibrium thermodynamics.
2021 High-Resolution Image Synthesis with Latent Diffusion Models

🌲 Stable Diffusion model by HuggingFace, dealing with the text-to-image task.
2021 LoRA- Low-Rank Adaptation of Large Language Models

🌲 Good components to speed up efficient training❓ Maybe learning from HuggingFace-Diffusers.
2021 Score-Based Generative Modeling through Stochastic Differential Equations

🌲 Mathematic idea of SDE ❓
2021 SDEdit- Guided Image Synthesis and Editing with Stochastic Differential Equations

Stable Diffusion model by HuggingFace, dealing with the image-to-image task.

Breakdown: Given an input image with user guidance input (unnatural artifacts eg. stroke painting), generate a realistic and faithful image with stochastic differential equations (SDEs) based model, which is pre-trained on unlabeled data.
Cao, P., Zhou, F., Song, Q., & Yang, L. (2024). Controllable Generation with Text-to-Image Diffusion Models: A Survey.

Recent survey on the application of diffusion model on text-to-image tasks.
- Taxonomy of text-to-image diffusion models based on conditions:
  - Generation with specific condition
  - Generation with multiple conditions
  - Universal controllable generation
- Applications:
  1. Image manipulation: text prompt, reference image, pretrained text-to-image model → edited image generation.
  2. Image completion and inpainting: masked regions, reference images → complete image generation.
  3. Image composition: several foreground object images → one composite image
  4. Text/Image-to-3D generation: text or image or pairs → 3D representation