Rolling Sink aims to bridge the gap between the limited horizons during training and open-ended horizons during testing.
Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the autoregressive video synthesis to ultra-long durations (e.g., 5-30 minutes) at test time,
with consistent subjects, stable colors, and smooth motions.
Powered by large-scale training data curated from our panoramic data curation engine, and the SphereViT for addressing the spherical distortions in panoramas,
DA2 is able to predict dense, scale-invariant distance from a single 360° panorama in an end-to-end manner,
with remarkable geometric fidelity and strong zero-shot generalization.
Lotus-2 is an advanced two-stage deterministic framework for monocular geometric dense estimation built upon FLUX.
By effectively analyzing the DiT-based rectified-flow formulation and leveraging pre-trained generative model as a deterministic world prior, Lotus-2 achieves SoTA performance while producing significantly finer details.
Lotus is a diffusion-based visual foundation model with a simple yet effective adaptation protocol,
aiming to fully leverage the pre-trained diffusion's powerful visual priors for dense prediction.
With minimal training data, Lotus achieves SoTA performance in two key geometry perception tasks, i.e., zero-shot monocular depth and normal estimation.
Characterized by its emphasis on the interpretation of subject-essential attributes, the proposed DisEnvisioner
effectively identifies and enhances the subject-essential feature while filtering out other irrelevant information,
enabling exceptional image customization without cumbersome tuning or relying on multiple reference images.
DIScene is capable of generating complex 3D scene with decoupled objects and clear interactions. Leveraging a learnable Scene Graph and Hybrid Mesh-Gaussian representation, we get 3D scenes with superior quality. DIScene can also flexibly edit the 3D scene by changing interactive objects or their attributes, benefiting diverse applications.
We present LucidDreamer, a text-to-3D generation framework, to distill high-fidelity textures and shapes from pretrained 2D
diffusion models with a novel Interval Score Matching objective and an advanced 3D distillation pipeline.
Together, we achieve superior 3D generation results with photorealistic quality in a short training time.