View Synthesis of Three-Dimensional Spaces Using Artificial Intelligence
For my W-Seminar project, I explored an intriguing and highly relevant topic in artificial intelligence: the synthesis of new views of three-dimensional spaces. This field, known as Novel View Synthesis, aims to develop algorithms capable of generating new views of a 3D scene based on a limited number of existing views.
Objective and Application
The primary goal of my project was to propose and optimize a new method for Novel View Synthesis. Additionally, I examined and compared established methods. One practical application of this technology is generating a bird's-eye view of a parking lot based on side and top-down images, which could significantly enhance the functionality of parking sensors or automotive cameras. This technique could eventually be used in self-driving cars to provide a 360° view of the surroundings, similar to how Tesla has successfully implemented it in their Full-Self-Driving Beta system.
Methodology and Testing
To evaluate the different approaches, I created a virtual 3D scene using the Unity engine, where cameras were placed in various positions. These cameras captured images of randomly positioned cars within a 5x5 grid. To test the effectiveness of the methods, I defined different levels of difficulty, which varied based on the rotation intervals and positions of the cars.
- Easy Difficulty: Minimal rotation of the cars, no additional positional variation.
- Medium Difficulty: Random rotation of the cars between 0° and 295°.
- Highest Difficulty: In addition to random rotation, the positions of the cars varied slightly within a small range.
The dataset consisted of 5100 image pairs, with 4000 pairs used for training, 1000 for testing, and 100 for validation.
Utilizing Conditional GANs
For view synthesis, I proposed using the image-to-image translation capabilities of Conditional Generative Adversarial Networks (cGANs), based on the groundbreaking pix2pix work from 2016. The cGANs were trained to learn the relationship between two views of a 3D scene, allowing them to generate a new view from an input view.
The tests demonstrated that cGANs could indeed learn this relationship and be successfully applied to view synthesis. However, in more complex scenes, the network struggled to capture details, resulting in artifacts. These issues might be addressed through fine-tuning the hyperparameters and further optimizations.
Results
This is a video of the learning process with the easy dataset:
And here is the video of the learning process with the medium dataset:
And here the first training session with the highest difficulty dataset:
The network seemed to struggle with the more complex scenes, as expected. This is why I tried optimizing the resuts by implementing "LSGAN". First I tried it with the easy dataset:
The results were promising, and the training process seemed to be more stable. I then tried the hard dataset again:
Here is the final result with LSGAN and a special dataset with asphalt texture and shadows enabled:
And here is a side by side comparison between CGAN and LSGAN:
Conclusion
My project demonstrated that Novel View Synthesis using cGANs is a promising approach for generating new perspectives of 3D scenes. This technology could play a significant role in applications such as automated vehicle navigation in the future. The challenges encountered in rendering details and handling complex scenes also present exciting opportunities for future research and development.
The full document is published on figshare: