Save now
off all Envato Elements plans this Cyber Sale.
Get in quick!

We independently review everything we recommend. When you buy through our links, we may earn a commission.

Nvidia just revealed how Edify 3D works

By Matic Broz
Hero image with Nvidia logo on green background

Key takeaways

  • Edify 3D generates high-quality 3D models from text prompts or reference images.
  • Combines diffusion models and Transformers for detailed, customizable, and accurate 3D asset generation.
  • Partners like Shutterstock and Getty Images use Edify 3D for efficient and scalable virtual production.

What is Edify 3D?

Edify 3D is NVIDIA’s AI-powered platform for generating high-quality 3D assets. As part of the larger Edify multimodal architecture, Edify 3D specializes in creating customizable 3D models, textures, and lighting environments from simple text prompts or reference images.

This approach bypasses the traditional need for specialized skills and expensive software. In essence, it simplifies complicated design processes and makes high-quality 3D assets accessible to everyone.

Edify 3D is designed for high-quality 3D asset generation. With text prompts and/or a reference image, the model can generate a wide range of detailed 3D assets. Credit: Nvidia

Imagine generating a “steampunk robot turtle with rusty mechanical parts” with just a text description or transforming a 2D image of a coffee cup into a fully realized 3D model—Edify 3D makes this possible.

How it works

Edify 3D is built on a sophisticated architecture that combines the strengths of two powerful AI models: diffusion models and Transformers. This synergistic approach enables the platform to generate 3D assets with impressive detail and fidelity.

The process begins with generating multiple 2D images of the desired object from different viewpoints. These images are created using a multi-view diffusion model, similar to those used in video generation.

Scheme showing how Edify 3D works
Pipeline of Edify 3D. Credit: Nvidia

This model takes a text prompt or a reference image, along with camera pose information, as input and generates a series of images showcasing the object from various angles. NVIDIA researchers discuss this in more detail in their GTC presentation. These multi-view images provide a rich visual representation of the object, capturing its shape and appearance from different perspectives.

Edify 3D employs three key multi-view diffusion models:

  • Base model: This model generates the RGB appearance of the object based on the text prompt and camera poses.
  • ControlNet for surface normals: Conditioned on the RGB images and text prompt, this model generates surface normal maps, providing crucial information about the object’s 3D structure. NVIDIA’s blog post highlights how Edify allows for better control over AI-generated images.
  • ControlNet for upscaling: This model enhances the resolution of the generated RGB images, ensuring high-quality textures and details. It uses the rasterized texture and surface normals of the 3D mesh as conditioning information.

The next stage involves transforming the 2D multi-view images into a 3D model. This is achieved using a Transformer-based reconstruction model. This model takes the RGB images and surface normal maps as input and predicts a neural representation of the 3D object in the form of latent tokens. These tokens are then processed to generate a signed distance function (SDF), which defines the object’s surface.

Edify 3D comparison of number of sampled views
Comparison of the number of sampled views. Credit: Nvidia

Finally, an isosurface extraction technique converts the SDF into a 3D mesh. The reconstruction model also generates texture and material maps based on the multi-view images and the predicted 3D shape.

The final stage involves refining the generated 3D mesh to optimize it for downstream applications. This includes retopologizing the mesh into a quadrilateral (quad) format, creating UV maps for texture mapping, and baking the generated textures and materials onto the mesh. These post-processing steps ensure that the final 3D asset is ready for integration into various 3D software and workflows.

The success of Edify 3D relies heavily on the quality and quantity of its training data. The platform is trained on a vast dataset of images, pre-rendered multi-view images, and 3D shapes.

The 3D shape data undergoes a rigorous pre-processing pipeline that includes format conversion, quality filtering, canonical pose alignment, and Physically Based Rendering (PBR) rendering. These steps ensure that the training data is consistent, high-quality, and suitable for AI model training. The shapes are also captioned using a vision-language model to provide textual descriptions for training the text-to-3D generation capabilities.

The multi-view diffusion models are fine-tuned on rendered images of 3D objects. The reconstruction model is trained on a large dataset of images and 3D assets, with supervision on depth, normal, mask, albedo, and material channels. This extensive training process enables Edify 3D to generate high-quality 3D assets from diverse inputs.

What Edify 3D can and will do

The whole point of Edify 3D is text-to-3D and image-to-3D asset generation. It will enable non-tech-savvy users to create 3D models from both textual descriptions and 2D images.

For example, a designer could describe an object like “a futuristic sports car with glowing neon accents” and have Edify 3D generate the corresponding 3D model. Alternatively, they could use a photograph of a real-world object as a starting point and let Edify 3D transform it into a 3D representation.

Edify 3D text-to-3D generation results
Text-to-3D generation results. Credit: Nvidia

The generated 3D models boast high-resolution textures (up to 4K) and support PBR materials. The resulting models have clean geometry and are easily editable in standard 3D modeling software.

Shutterstock has been quick to realize how useful this could be and already parnetered with Nvidia back in March. Currently, it’s generative 3D model is available through their 3D API Toolkit and includes everything 3D Edify can do: Models, HDRIs, and PBR materials.

Shutterstock is exploring the use of Edify 3D for virtual production in collaboration with WPP. Shutterstock is collaborating with companies like HP, Blender, Dassault Systèmes, and Katana to integrate its Edify-powered 3D generation capabilities directly into their respective workflows.

Edify 3D is also designed for speed and efficiency. Shutterstock claims previews of single assets can be generated in as little as 10 seconds, and Getty Images boasts generating four images in about six seconds.

Edify provides capabilities for fine-tuning its models, allowing companies to customize the AI with their own data. Getty Images, for instance, allows businesses to tailor the AI to generate images that align with their specific brand style, ensuring consistency and visual identity across their creative output.

Sources

Posted in:

Meet your guide

matic broz
Matic Broz

Matic Broz is stock media licensing expert and a photographer. He promotes proper and responsible licensing of stock photography, footage, and audio, and his writing has reached millions of creatives.