VoMP: Predicting Volumetric Mechanical Properties

Description:

VoMP predicts physically accurate volumetric mechanical property fields (Young's modulus, Poisson's ratio, and density) throughout the volume of 3D objects across multiple representations including meshes, Gaussian Splats, NeRFs, and SDFs, enabling their use in realistic deformable simulations.

This model is ready for commercial/non-commercial use.

License/Terms of Use:

Use of the model is governed by the NVIDIA Open Model License Agreement.

Models at a Glance

Geometry Transformer

Predicts per-voxel 2D material latent codes from 3D geometry and multi-view images
Consumes aggregated DINOv2 visual features and sparse voxel grids; scales to 32,768 non-empty voxels per object
Drives end-to-end material field prediction when paired with MatVAE for decoding into (E, ν, ρ)

MatVAE (Material VAE)

Decodes 2D material latents into physically valid mechanical property triplets (E, ν, ρ)
Enforces real-world plausibility learned from MTD using VAE + flow posterior
Provides a compact, continuous material manifold used by the Geometry Transformer at inference

Deployment Geography:

Global

Use Case:

VoMP is expected to be used by researchers, simulation engineers, digital twin developers, and 3D content creators who need to automatically augment 3D assets with physically accurate material properties for realistic physics simulation. Key applications include Digital Twins (virtual replicas of real systems), Real-2-Sim workflows (generating digital simulation from the real world), Sim-2-Real transfer (training policies in simulation for real-world deployment), and robotics simulation.

Reference(s):

VoMP: Predicting Volumetric Mechanical Property Fields. Rishit Dagli, Donglai Xiang, Vismay Modi, Charles Loop, Clement Fuji Tsang, Anka He Chen, Anita Hu, Gavriel State, David I.W. Levin, and Maria Shugrina. arXiv 2025.

Model Architecture:

Architecture Type: Transformer (Geometry Transformer) + Variational Autoencoder (MatVAE)

Network Architecture: VOMP (Custom Architecture) consisting of:

Geometry Transformer: Decoder-only Sparse Transformer with 3D Swin attention, 12 transformer blocks, 12 attention heads, MLP ratio of 4, window size 8, 768 model channels
MatVAE: Variational Autoencoder with 2-dimensional latent space, 3-layer encoder/decoder with 256 hidden dimensions, radial normalizing flow

This model was developed based on: DINOv2-ViT-L/14 for visual feature extraction; TRELLIS for Geometry Transformer backbone initialization

Number of model parameters: 85.81M (Geometry Transformer) + 304.3M (DINOv2) + 403.7K (MatVAE)

(Per‑Model Details)

Geometry Transformer

Role: Maps sparse voxel features (from multi‑view images + voxelization) to a per‑voxel 2D material latent
Backbone: 3D Swin attention, 12 blocks, 12 heads, model dim 768, window size 8, MLP ratio 4
Loss: ℓ2 loss on decoded properties via MatVAE

MatVAE (Material VAE)

Role: Defines the material manifold and decodes latents to (E, ν, ρ)
Design: 2‑D latent, 3‑layer MLP encoder/decoder (256 hidden), radial normalizing flow posterior
Objective: Modified β‑TC‑VAE with KL/Ml/TC weighting and annealing

Training uses AdamW optimizer with learning rate 1×10^-4, weight decay 5×10^-2 for Geometry Transformer and 1×10^-4 for MatVAE. Geometry Transformer trained with FP16 precision for 200,000 steps with batch size 16 (4 per GPU on 4 A100s), gradient clipping at 1.0, cosine annealing LR scheduler, EMA rate 0.9999, and ℓ2 loss. MatVAE trained with FP32 precision for 850 epochs with batch size 256, gradient clipping at 5.0, cosine annealing to final LR 1×10^-5, modified β-TC-VAE objective with weights α=1.0 (KL), β=2.0 (TC), γ=1.0 (MI), free nats constraint δ=0.1, KL annealing over 200 epochs, and log min-max normalization. Regularization via weight decay, dropout rate 0.05 in MatVAE, capacity constraints to prevent posterior collapse, and radial normalizing flow for flexible posterior distribution.

Computational Load

We haven't tracked the compute load throughout the experiments, we provide computational load for the final training process.

Cumulative Compute: Training performed on 4× NVIDIA A100 80GB GPUs for approximately 5 days (Geometry Transformer) and 12 hours (MatVAE). The estimated FLOPs for training is 2.4 x 10^20 FLOPs.

Estimated Energy and Emissions for Model Training: Not Calculated. Estimation (with 450W): 237.6 kWh.

Input:

Input Type(s): 3D Geometry (Mesh, Gaussian Splat, NeRF, SDF) + Multi-view Images

Input Format(s):

3D representations: Triangle meshes (USD format), 3D Gaussian Splats, Neural Radiance Fields (NeRF), Signed Distance Functions (SDF)
Rendered images: RGB, 518×518 pixels per view

Input Parameters: 3D volumetric data (64³ voxel grid resolution), 2D multi-view images (150 views)

Other Properties Related to Input:

3D objects are normalized to [-0.5, 0.5]³ bounding box
Multi-view rendering: 150 camera views sampled via Hammersley sequence on a sphere, radius 2 units, 40° field of view, 512×512 pixel resolution
Feature extraction: DINOv2-ViT-L/14 processes images at 518×518 pixels (37×37 patches), producing 1024-dimensional features per patch
Voxelization: 64³ resolution grid with volumetric interior sampling
Maximum 32,768 non-empty voxels processed per object (stochastically subsampled if exceeded)
Pre-processing: Voxelization (31ms for Gaussian Splats), rendering (2.11s), DINOv2 feature extraction (0.86s), feature aggregation to voxels (0.58s)

Per‑Model Inputs

Geometry Transformer

Geometry: voxelized to 64³ using Open3D (volumetric flood‑fill of interiors) after normalization to [-0.5, 0.5]³
Images: 150 multi‑view renders, DINOv2‑ViT‑L/14 features aggregated to voxel centers
Sparse voxels: up to 32,768 non‑empty voxels processed per object

MatVAE

Tabular: normalized material triplets from MTD for training (log10(E), ν, log10(ρ))
Latent codes: 2‑D latents produced by Geometry Transformer during inference

Output:

Output Type(s): Volumetric material property fields (continuous numerical values)

Output Format: Per-voxel mechanical property triplets: Young's modulus E (Pascals), Poisson's ratio ν (dimensionless, range [0, 0.5]), density ρ (kg/m³)

Output Parameters: 3D volumetric fields at 64³ voxel grid resolution, with up to 32,768 voxels per object

Other Properties Related to Output:

Each voxel outputs a 2-dimensional material latent code decoded by MatVAE into (E, ν, ρ) triplet
Values guaranteed to lie within physically plausible ranges learned from 100,562 real-world material measurements
Properties normalized during training: log₁₀(E) and log₁₀(ρ) to [0,1], ν directly to [0,1]
Output can be transferred to original 3D representation via nearest-neighbor interpolation
Inference time: 3.59s average (8.2ms for Geometry Transformer, 0.32ms for MatVAE decoding)
Compatible with accurate FEM simulators (libuipc), sparse Simplicits for large-scale simulations

Per‑Model Outputs

Geometry Transformer

Predicts per‑voxel 2‑D material latents; no direct physical units
Latents are passed to MatVAE for decoding

MatVAE

Decodes 2‑D latent into (E [Pa], ν [0..0.5], ρ [kg/m³])
Ensures outputs lie within real‑world distributions learned from MTD

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g., GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

Not Applicable - Custom PyTorch-based inference pipeline. DINOv2 used for feature extraction uses TensorRT using torch.compile.

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere (tested on A100 80GB)
NVIDIA RTX (tested on RTX A6000)

Preferred/Supported Operating System(s):

Linux (development and testing performed on Linux environments)
Should work on Windows and MacOS but not tested

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

The VOMP model can be integrated into an AI system by: (1) Using the voxelization pipeline to convert input 3D geometry into 64³ voxel grid, (2) Rendering 150 multi-view images using provided camera sampling, (3) Extracting DINOv2 features and aggregating to voxel centers, (4) Running the Geometry Transformer to predict per-voxel material latent codes, (5) Decoding latents with MatVAE to obtain (E, ν, ρ) values, (6) Transferring voxel materials to target representation via nearest-neighbor interpolation.

Model Version(s):

v1.0 - Initial research release with Geometry Transformer (85.81M parameters + 304.3M parameters for DINOv2) and MatVAE (403.7K parameters). Trained on GVM dataset (1,664 objects, 8,089 segments, 37M voxels) and MTD dataset (100,562 material triplets).

Training Dataset (Per Model):

For Geometry Transformer — GVM

Link: GVM (Geometry with Volumetric Materials) — NVIDIA assets with VLM‑assisted labels (Qwen 2.5 VL‑72B)

Data Modality:

3D Geometry (meshes with part segmentation)
Image (multi-view rendered RGB images, 512×512 pixels)
Text (material names, semantic labels)
Tabular (material property triplets: E, ν, ρ)

For MatVAE — MTD

Link: MTD (Material Triplet Dataset) — Derived from public material science databases (MatWeb, Engineering Toolbox, Cambridge)

Data Modality:

Tabular (material property triplets: E, ν, ρ)

Image Training Data Size:

More than 100,000 Images (1,333 training objects × 150 views = ~200K training images; total across all splits ~250K images)

Non-Audio, Image, Text Training Data Size:

3D Geometry: 1,333 training objects with 6,477 part segments
Volumetric Data: 28.7 million annotated voxels at 64³ resolution
Material Triplets (MTD for MatVAE): ~81,000 training samples (80% of 100,562)

Data Collection Method by dataset:

Hybrid: Human (3D asset creation, PBR texturing, part segmentation, material science measurements), Automated (VLM-assisted annotation with Qwen 2.5 VL-72B). We have not performed the human data collection ourselves, but the data is present in either public NVIDIA datasets or external public datasets.

Labeling Method by dataset:

Hybrid: Human (material names, part segmentation, reference material measurements), Automated (VLM-generated material property triplets guided by real-world constraints). We have not performed the human labeling ourselves, but the data is present in either public NVIDIA datasets or external public datasets.

Properties (Quantity, Dataset Descriptions, Sensor(s)): Training set contains 1,333 objects with 6,477 segments and 28.7M voxels. Data modalities include (i) 3D meshes (triangle meshes in USD format with part-level segmentation), multi-view RGB images (512×512 rendered views), text (English material names), and tabular material properties (E, ν, ρ triplets); (ii) Nature of content is non-personal, proprietary 3D assets (NVIDIA asset libraries) combined with public domain material science data, no copyright-protected creative content, machine-generated annotations (VLM) constrained by human-measured physical properties; (iii) Linguistic characteristics include English material names and semantic object labels. No sensors were used for data collection; 3D assets are human-modeled, material properties are from laboratory measurements (ASTM standard testing), and images are path-traced renders. Average 4.86 segments per object (±11.97 std dev), 21,537 voxels per object (±23,431 std dev). Material property ranges: E [1.0×10^5, 2.8×10^11 Pa], ν [0.16, 0.49], ρ [50, 19,300 kg/m³].

Testing Dataset:

GVM test split
ABO-500 - Amazon Berkeley Objects

Data Collection Method by dataset:

Hybrid: Human, Automated (VLM-assisted)

Labeling Method by dataset:

Hybrid: Human, Automated (VLM-guided)

Properties (Quantity, Dataset Descriptions, Sensor(s)): Test set contains 166 objects with 1,060 segments and 4.9M voxels (13.1% of total dataset). Same modalities, content nature, and linguistic characteristics as training data. Average 6.39 segments per object (±11.33 std dev), 29,571 voxels per object (±25,987 std dev). Held-out objects ensure no overlap with training data.

Evaluation Dataset:

Primary: GVM test split (166 objects, 4.9M voxel annotations)
Secondary: ABO-500 mass estimation benchmark - 500 objects with ground truth mass labels

Benchmark Score:

ALDE: Average Log Displacement Error
ADE: Average Displacement Error
ARE: Average Relative Error
MnRE: Minimum Ratio Error

On GVM test set (per-object averaged):

Young's Modulus: ALDE 0.3793 (±0.29), ALRE 0.0409 (±0.04)
Poisson's Ratio: ADE 0.0241 (±0.01), ARE 0.0818 (±0.03)
Density: ADE 142.69 kg/m³ (±166.90), ARE 0.0921 (±0.07)
Material Validity: log(E) 0.29% (±1.23), ν 0.00% (±0.00), ρ 11.75% (±4.02) relative error to nearest real-world material

On ABO-500 mass estimation benchmark:

ALDE: 0.631, ADE: 8.433, ARE: 0.887, MnRE: 0.576

Data Collection Method by dataset:

Hybrid: Human, Automated (for GVM test split); Human (for ABO-500)

Labeling Method by dataset:

Hybrid: Human, Automated (for GVM test split); Human (for ABO-500 ground truth mass)

Properties (Quantity, Dataset Descriptions, Sensor(s)): GVM test set: 166 objects, 4.9M annotated voxels with ground truth (E, ν, ρ) per voxel. Significantly larger evaluation benchmark than prior works (e.g., NeRF2Physics used 31 points across 11 objects). ABO-500: 500 Amazon objects with measured mass labels, used for density validation via volume integration. Both evaluation sets contain non-personal, non-sensitive 3D geometry and material/mass labels.

Inference:

Acceleration Engine: PyTorch with CUDA acceleration; DINOv2 optimized implementation (TensorRT)

Test Hardware:

NVIDIA A100 80GB GPU (primary development/testing)
NVIDIA RTX A6000 GPU (primary development/testing)

Average Inference Time: 3.59s (±1.36s) per object on A100 GPU for Gaussian Splats

Rendering: 2.11s
Voxelization: 0.03s (31ms for Gaussian Splats)
DINOv2 feature extraction: 0.86s
Feature aggregation: 0.58s
Geometry Transformer: 0.0082s
MatVAE decoding: 0.00032s

Model Limitations and Responsible Use:

(1) Material property predictions should be validated through physical testing before deployment in safety-critical applications (structural engineering, robotics, medical devices), as inaccurate predictions could lead to unsafe designs if used without verification. (2) The model is trained on common objects (furniture, architectural elements, vegetation) and may not generalize to specialized materials (advanced composites, metamaterials, biological tissues) outside the training distribution. (3) Fixed 64³ voxel resolution limits capture of fine-grained heterogeneous material structures; users should assess whether resolution is sufficient for their application. (4) Isotropic material assumption may not hold for anisotropic materials (wood grain, fiber composites); users should verify appropriateness for their use case. (5) Model outputs (E, ν, ρ) are compatible with accurate FEM simulators but may require adaptation for fast approximate simulators (XPBD, MPM) that use simulator-specific parameter scales.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.

Please report model quality, risk, security vulnerabilities or concerns at: https://app.intigriti.com/programs/nvidia/nvidiavdp/detail.

Bias

Field	Response
Participation considerations from adversely impacted groups protected classes in model design and testing:	None.
Measures taken to mitigate against unwanted bias:	None.
Bias Metric (If Measured):	None.

Explainability

Field	Response
Intended Task/Domain:	3D Vision, Physical Simulation, Digital Twins, Real-2-Sim, Sim-2-Real
Model Type:	Geometry Transformer (3D Sparse Transformer with Swin attention), MatVAE (VAE)
Intended Users:	Researchers in Computer Vision and Graphics, Simulation Engineers, Digital Twin Developers, Robotics Researchers, Machine Learning Engineers, 3D Content Creators
Output:	Volumetric mechanical property fields: Young's modulus (E, Pa), Poisson's ratio (ν), and density (ρ, kg/m³) for every voxel within a 3D object.
Describe how the model works:	We present two models trained on two datasets. Geometry Transformer: Predicts per‑voxel 2‑D material latents from geometry + appearance cues; inputs are 64³ voxel grid and aggregated DINOv2 features from 150 renders; uses 3D Swin attention with 12 blocks, 12 heads, 768 channels; outputs latent field at up to 32,768 non‑empty voxels. MatVAE: Decodes 2‑D latents into (E, ν, ρ) and regularizes the latent manifold to real‑world physics; uses VAE with 2‑D latent, 3‑layer encoder/decoder, radial flow posterior, β‑TC‑VAE objective; outputs per‑voxel physically plausible material properties.
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:	Not Applicable.
Technical Limitations & Mitigation:	This model may not work well under the following conditions: (1) Highly heterogeneous internal material structures may be oversmoothed due to fixed 64³ voxel resolution; (2) Anisotropic materials (e.g., wood grain) are approximated as isotropic since training assumes part-level isotropy; (3) Very thin structures or fine geometric details below the voxel resolution may be lost during voxelization; (4) Objects significantly different from training distribution (commercial, residential, simready, vegetation datasets) may yield less accurate predictions; (5) Extremely large objects exceeding the 32,768 voxel processing limit are stochastically subsampled, potentially missing localized material variations; (6) The model predicts only E, ν, ρ and does not estimate additional properties like yield strength, shear modulus, or thermal expansion coefficients.
Verified to have met prescribed NVIDIA quality standards:	Yes
Performance Metrics:	Average Log Displacement Error (ALDE), Average Displacement Error (ADE), Average Log Relative Error (ALRE), Average Relative Error (ARE) for Young's modulus, Poisson's ratio, and density; Mass Estimation metrics (ALDE, ADE, ARE, MnRE); Material Validity (relative error to nearest real-world material range); Wall-clock inference time; Throughput; Simulation fidelity through visual inspection of elastodynamic simulations.
Potential Known Risks:	If the model does not work as intended: (1) Inaccurate material property predictions may lead to unrealistic physics simulations with incorrect deformation behavior, affecting Digital Twin fidelity, Sim-2-Real transfer, and design validation workflows; (2) Overestimated stiffness (Young's modulus) may cause objects to appear overly rigid in simulation, while underestimation causes excessive deformation; (3) Incorrect density predictions impact mass distribution, gravitational effects, and momentum in simulations; (4) Invalid Poisson's ratio values could cause simulation instability or non-physical volume changes.
Licensing:	Use of the model is governed by the NVIDIA Open Model License Agreement.

Privacy

Field	Response
Generatable or reverse engineerable personal data?	No.
Personal data used to create this model?	No.
How often is dataset reviewed?	During dataset creation, model training, evaluation, and before release
Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model?	No.
Is there provenance for all datasets used in training?	Yes. Per‑dataset provenance documented: MTD: Public material science databases (MatWeb, Wikipedia, Engineering Toolbox, Cambridge); GVM: NVIDIA asset libraries (commercial, residential, simready, vegetation) with VLM‑assisted labels
Does data labeling (annotation, metadata) comply with privacy laws?	Yes.
Is data compliant with data subject requests for data correction or removal, if such a request was made?	No. Not possible with externally-sourced data.
Applicable Privacy Policy	https://www.nvidia.com/en-us/about-nvidia/privacy-policy/

Safety

Field	Response
Model Application Field(s):	Robotics; Media & Entertainment; Digital Twins; Design and Engineering
Describe the life critical impact (if present).	Not Applicable
Use Case Restrictions:	Abide by NVIDIA Open Model License Agreement.
Model and dataset restrictions:	The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.

Citation

@article{dagli2025vomp,
    title={VoMP: Predicting Volumetric Mechanical Property Fields},
    author={Dagli, Rishit and Xiang, Donglai and Modi, Vismay and Loop, Charles and Tsang, Clement Fuji and 
            Chen, Anka He and Hu, Anita and State, Gavriel and Levin, David I.W. and Shugrina, Maria},
    journal={arXiv preprint},
    year={2025}
}