| Student Names & IDs: | Shariful Islam (223009012) Tajnova Jahan (223009212) Nafis Khan (223009712) |
|---|---|
| Section: | 3 |
| Course Code: | CSE 466 |
| Course Name: | Python-Based Project Development |
| Instructor: | Arshiana Shamir |
| Department: | Computer Science & Engineering |
| University: | East Delta University |
| Date of Submission: | 30-12-2025 |
1.1 Background and Motivation 4
2.2 Dataset Size & Structure 5
2.4 Data Quality & Challenges 5
3.2 Technologies & Libraries 6
4.1 Neural Artifacts & Qualitative Results 8
4.2 Comparative Object Output 8
4.3 Error Analysis and Observations 13
5.1 Impact of Preprocessing 14
5.4 Geometric Discontinuity 14
6.4 Lack of Object Isolation 15
The project is a fully automated 3D vision pipeline which converts the standard 2D photos into high quality digital models. Though, traditional methods require an expensive hardware and manual calibration, in this system a model transformer-based method is implemented, where pointmaps of 3D points are directly regressed, with no hardware and smartphone photography data not calibrated. The thrust of the project will be a designed architecture of an engineered staged preprocessing layer of lighting normalization, a multi-view alignment system, using Minimum Spanning Tree (MST) optimization, and a probabilistic refinement system, using Statistical Outlier Removal (SOR). Results show that the system produces detailed point clouds, which can be applied in the digital archiving and robotics in real world. We conclude that using this pipeline, a very good solution to the conversion of the raw pixels into accurate spatial information with no special sensors is achieved.
Keywords: DUSt3R, Transformer, Encoder, Decoder, ViT, Pointmap Regression, Global Alignment, Minimum Spanning Tree, (X, Y, Z) Coordinates, Cross-Attention, Statistical Outlier Removal, Confidence Map, Confidence Score, Depth Map, Triangulation, PyTorch, Trimesh, Gradio, SciPy, Point Cloud, Mesh, Digital Twin
The realm of computers is abandoning the world of flat images and moving to the 3D world. The transformation of 2D images into 3D objects will become mandatory to virtual reality, digital archiving and robotics. It is an entertaining exercise because photos are never given depth per se, and the computer will need to use artificial intelligence to make a likely guess about the distance of each element of an object relative to the lens.
It is intended to take a few pictures of an object and re-constructed the 3D geometry. It is a regression problem where the model does predictions of the spatial co-ordinates of every point in a scene. The key problems are to handle realistic noisy photographs in the real world, and to be sure that all the several camera angles are in the 3D location.
- High-level Objective: To attain successful construction of 3D objects with the use of simple 2D set of images.
- Staged Preprocessing: To make the image look more natural and bring out the sharpening of the images to bring out the neural attention.
- Neural Inference To categorize 3D pointmaps and confidence distributions of every image pair.
- Global Optimization: To synchronize different images of cameras with MST start up and linear learning schedule.
- Visualization: To provide a clear interface that would facilitate the visualization of the 3D result and Confidence Maps to verify the decisions of the model.
The project is founded on DUSt3R Transformer Model, and Trimesh library. The process focuses on reconstructing stationary objects and is limited by the GPU VRAM available in standard desktop computers.
- Source: The data that will be used by the current project is a custom dataset, which will be captured in a smart phone camera to ensure that the system is tested in the real world and unconstrained environment rather than test it under a studio environment.
- Format: The data provided by the source will be standard RGB pictures in JPEG format and PNG format.
- Total Samples: There were four or ten images of the target object on the pipeline in sets, to ensure that the viewpoint overlap was large enough.
- Dimensions Image: The input of the neural backbone was set to a size of 512 x 512 pixels and depth RGB, so all the raw images were scaled accordingly.
- Class Attributes: The data contains high-resolution data of texture and geometry landmarks that are used by the Transformer when regressing pointmaps.
A representative grid can be used to demonstrate the variety of the input data by presenting a set of objects (4 images) as given below. This grid indicates that in order to have a successful multi-angle coverage, it must have 360-degree coverage.
Figure 1: 2D image of the test object showing the front, side, back, and top perspectives.
- Environmental Noise: The captures come with few problems to the real world that include motion blur, uneven lighting gradient in various angles of the camera.
- Geometric Outliers: The algorithm had a tendency to pile background noise as random locations in the 3D space which were not included in the primary object.
- Ground Plane Interference: This is one of the biggest of the issues and the floor plane or table was there and the model was inclined to blend in with the object at the level of segmentation.
The system follows a modular workflow:
- Python 3.10
- PyTorch
- Trimesh
- SciPy
- OpenCV
- Gradio
- Dust3r
- NVIDIA GPU (Google Colab)
To obtain the ability to train the neural backbone to perceive features in varying viewpoint, I added a progressive processing layer. This was a step that was quite relevant in the standardization of the input information prior to being entered into the model:
- Staged Normalization: I used the image contrast at 1.2x.
- Edge Sharpening: I used the sharpening 1.5x on all the pictures. I applied the Attention Mechanism of the Transformer to view the geometrical shape of the object appropriately by sharping the edges of the surfaces.
- Spatial Standardizing: The dimensions of all the raw images were brought to 512x512.
The model we have utilized is DUSt3R Transformer as the model can handle images without prior information of the settings to be used on the camera mode.
- Architecture: It applies Transformer Encoder to scan image pairs and a Decoder to predict 3D points.
- Training Establishment: The Global Alignment was implemented 300 times in order to make the various views in line.
- Optimization: An MST initiation was performed in the attempt to find the optimal starting point in space.
My work is predominantly a 3D construction that I have done using ordinary photos. To determine whether the system is functioning as desired, I made a glance of three different images that are generated by the AI in the course of its thinking: the RGB image, the Depth Map, and the Confidence Heatmap.
- RGB Perspective: This is the image which I had obtained after using my cleaning and sharpening effects.
- Depth Map: It is a picture displaying the distance between the AI and every part of the object. The darker and brighter colors demonstrate the distance to the camera and it was accurate during the tests regarding the shape of the object.
- Confidence Heatmap: This is a highly important section of the project since it indicates the degree of the degree of certainty that the AI is capable of existing in its predictions. The red regions reveal that the AI is quite sure about the shape whereas the blue ones reveal that the AI is simply guessing and it probably occurs in the background.
Two objects of varying geometry and textures were used to test the pipeline in the effort to determine the power of the same in relation to varying textures and geometry.
Object 1: A figurine of a cat
Figure 2: Multi-view image of the cat figurine.
RGB Image
Figure 3: Input RGB image of Object 1 after staged preprocessing.
Depth Map
Figure 4: Depth Map showing predicted spatial distance of the cat figurine.
Confidence Heatmap
Figure 5: Confidence Heatmap visualizing model reliability (Red: High Confidence; Blue: Low Confidence).
Point Cloud
Figure 6: Reconstructed 3D Point Cloud of the cat figurine
3D Mesh
Figure 7: Solid Mesh output of the cat figurine.
Object 2: Perfume Bottle
Figure 8: Multi-view image of the perfume bottle.
RGB Image
Figure 9: Input RGB image of Object 2 after staged preprocessing.
Depth Map
Figure 10: Depth Map showing predicted spatial distance of the perfume bottle.
Confidence Heatmap
Figure 11: Confidence Heatmap visualizing model reliability of perfume bottle (Red: High Confidence; Blue: Low Confidence).
Point Cloud
Figure 12: Reconstructed 3D Point Cloud of the perfume bottle.
3D Mesh
Figure 13: Solid Mesh output of the perfume bottle.
By examining the results of the two objects, I identified a number of patterns in the functionality of the system:
- Point Cloud vs Mesh: Both the cat and the perfume bottle had the same issue as the small details were very well represented with the Point Cloud but they were made into blocky with the Mesh.
- Material Issues: The AI was able to map the cat figurine matte surface with ease. Nevertheless, the glossy appearance of the perfume bottle made the AI malfunction in certain areas, which made a minor hole in the model since the reflections were misleading the depth prediction.
- Confidence Logic: The Confidence Heatmaps got the background and the floor (blue), which the system was not sure of, which the Confidence Logic identified correctly, still knew the objects (red) were present.
- Geometric Errors: In both instances the Ground Plane was still attached to the object and some Stray Points had been left floating in space by noise in the photos.
- Alignment: Both did a good job with the Global Alignment, the various camera angles were aligned without distorting or tilting the objects.
To normalise and clean data was the key to success of the project. The images were contrasted and brightened to ensure that the AI could detect similar spots in the darkened and light-bright areas of the images.
The amount of alignment iterations was the most important hyperparameter. An intermediate pace between speed and accuracy was the one most suitable with 300 iterations.
The black plastic was of high specularity and there was much noise. Dark and bright spots were also an issue to the AI because it incorrectly interpreted the light reflections as geometrical forms of the objects and produced a hollow texture of the resultant mesh as seen in the confidence heatmap.
The system failed to remove the negative space (the hole) surrounding the nozzle. The depth map could not encode sharp edges of interior with respect to the dark surface and so provided disjointed points instead of a clean void.
The system does not have the ability to handle dark plastic materials or shiny materials. The light reflections are mixed up as actual physical bumps and the texture of the surface in the end mesh is disjointed and distortionary.
The model is not a reproduction of the negative space, that of the nozzle hole. Instead of being a hole, it assigns phantom points or solid surfaces to a hole.
Even when there is slight errors in the first frame of the camera the entire system of coordinates is tipped. This causes disconnection or propensity of the object towards the ground plane.
The AI does not distinguish between the object and the surrounding. It recreates the table and the background as a similar 3D object and one has to do manual cleaning to separate the model.
It was applied and shown that 3D reconstruction on 2D images can be achieved with the assistance of deep learning pipeline. A lot of success was gained by use of preprocessing to stabilize point matching under the varying light conditions.
The system produced the optimum performances with high contrast and matte surfaces. However, the presence of specular highlights and holes in the mesh is the most important cause of errors in reconstruction and mesh artifact.
Even though the current pipeline generates the right information on space, it still requires additional semantic processing to distinguish objects with their background. A low cost image based 3D scanner solution can have a good foundation on the project.
| import os�import sys��%cd /content�if not os.path.exists('/content/dust3r'):� !git clone -b dev --recursive https://github.com/camenduru/dust3r��%cd /content/dust3r�sys.path.append('/content/dust3r')��!pip install -q roma gradio einops trimesh scipy�!pip install -q https://github.com/camenduru/wheels/releases/download/colab/curope\-0.0.0\-cp310-cp310-linux\_x86\_64.whl��import torch, numpy as np, tempfile, functools, trimesh, copy, gradio�from PIL import Image, ImageEnhance�from scipy.spatial.transform import Rotation�from torchvision import transforms�import matplotlib.pyplot as pl��# downloading pretrained weight from dust3r�!mkdir -p /content/dust3r/checkpoints�WEIGHTS_PATH = '/content/dust3r/checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth'�if not os.path.exists(WEIGHTS_PATH):� !wget https://huggingface.co/camenduru/dust3r/resolve/main/DUSt3R\_ViTLarge\_BaseDecoder\_512\_dpt.pth -P /content/dust3r/checkpoints��import tqdm�from dust3r.image_pairs import make_pairs�from dust3r.utils.image import load_images, rgb�from dust3r.utils.device import to_numpy, to_cpu, collate_with_cat as collate�from dust3r.viz import add_scene_cam, CAM_COLORS, OPENGL, pts3d_to_trimesh, cat_meshes�from dust3r.model import AsymmetricCroCo3DStereo, inf�from dust3r.cloud_opt import global_aligner, GlobalAlignerMode��torch.backends.cuda.matmul.allow_tf32 = True�BATCH_SIZE = 1��def initialize(model_path, device):� ckpt = torch.load(model_path, map_location='cpu', weights_only=False)� args = ckpt['args'].model.replace("ManyAR_PatchEmbed", "PatchEmbedDust3R")�� if 'landscape_only' not in args:� args = args[:-1] + ', landscape_only=False)'� else:� args = args.replace(" ", "").replace('landscape_only=True', 'landscape_only=False')�� net = eval(args)� net.load_state_dict(ckpt['model'], strict=False)� return net.to(device)��def interleave(img1, img2):� res = {}� for key, value1 in img1.items():� value2 = img2[key]� if isinstance(value1, torch.Tensor):� value = torch.stack((value1, value2), dim=1).flatten(0, 1)� else:� value = [x for pair in zip(value1, value2) for x in pair]� res[key] = value� return res� �def loss_function(batch, model, device):� view1, view2 = batch� for view in batch:� for name in 'img pts3d valid_mask camera_pose camera_intrinsics'.split():� if name in view:� view[name] = view[name].to(device, non_blocking=True)�� view1, view2 = (interleave(view1, view2), interleave(view2, view1))�� with torch.cuda.amp.autocast(enabled=True):� pred1, pred2 = model(view1, view2)�� return dict(view1=view1, view2=view2, pred1=pred1, pred2=pred2)��@torch.no_grad()�def inference(pairs, model, device, batch_size=1):� result = []� for i in tqdm.trange(0, len(pairs), batch_size):� res = loss_function(collate(pairs[i:i+batch_size]), model, device)� result.append(to_cpu(res))�� return collate(result, lists=True)��def preprocess(image_paths):� cleaned_paths = []� for i, path in enumerate(image_paths):� img = Image.open(path).convert("RGB")�� enhancer = ImageEnhance.Contrast(img)� img = enhancer.enhance(1.2)�� enhancer = ImageEnhance.Sharpness(img)� img = enhancer.enhance(1.5)�� save_path = os.path.join(tempfile.gettempdir(), f"input_refined_{i}.png")� img.save(save_path)� cleaned_paths.append(save_path)� return cleaned_paths��def create_final(outdir, imgs, pts3d, mask, focals, cams2world, as_pointcloud=False):� pts3d, imgs, focals, cams2world = map(to_numpy, [pts3d, imgs, focals, cams2world])� scene = trimesh.Scene()�� if as_pointcloud:� pts = np.concatenate([p[m] for p, m in zip(pts3d, mask)])� col = np.concatenate([imgs[i][mask[i]] for i in range(len(imgs))])� geometry = trimesh.PointCloud(pts.reshape(-1, 3), colors=col.reshape(-1, 3))� else:� meshes = [pts3d_to_trimesh(imgs[i], pts3d[i], mask[i]) for i in range(len(imgs))]� geometry = trimesh.Trimesh(**cat_meshes(meshes))�� centroid = geometry.centroid� geometry.apply_translation(-centroid)�� scene.add_geometry(geometry)�� flip_correction = np.eye(4)� flip_correction[1, 1] = -1� flip_correction[2, 2] = -1� scene.apply_transform(flip_correction)�� outfile = os.path.join(outdir, 'object.glb')� scene.export(file_obj=outfile)� return outfile��def run_pipeline(outdir, model, device, img_size, filelist, niter, as_pc, refinement):� processed_list = preprocess(filelist)�� imgs = load_images(processed_list, size=img_size)� if len(imgs) == 1:� imgs = [imgs[0], copy.deepcopy(imgs[0])]� imgs[1]['idx'] = 1�� pairs = make_pairs(imgs, scene_graph="complete", prefilter=None, symmetrize=True)� inference_output = inference(pairs, model, device, batch_size=BATCH_SIZE)�� mode = GlobalAlignerMode.PointCloudOptimizer if len(imgs) > 2 else GlobalAlignerMode.PairViewer� scene_obj = global_aligner(inference_output, device=device, mode=mode)�� if mode == GlobalAlignerMode.PointCloudOptimizer:� scene_obj.compute_global_alignment(init='mst', niter=niter, schedule='linear', lr=0.01)�� if refinement:� scene_obj = scene_obj.clean_pointcloud()�� glb_path = create_final(outdir, scene_obj.imgs, scene_obj.get_pts3d(),� to_numpy(scene_obj.get_masks()), scene_obj.get_focals().cpu(),� scene_obj.get_im_poses().cpu(), as_pointcloud=as_pc)�� artifacts = []� cmap = pl.get_cmap('jet')� depths = to_numpy(scene_obj.get_depthmaps())� confs = to_numpy([c for c in scene_obj.im_conf])�� for i in range(len(scene_obj.imgs)):� artifacts.append((scene_obj.imgs[i], f"View {i+1}: RGB"))�� d_norm = depths[i] / depths[i].max()� artifacts.append((rgb(d_norm), f"View {i+1}: Depth Map"))�� c_norm = cmap(confs[i] / confs[i].max())� artifacts.append((rgb(c_norm), f"View {i+1}: Confidence Heatmap"))�� return scene_obj, glb_path, artifacts��import gradio as gr�import functools��css = """�footer {display: none !important;}�#gradio-menu, .built-with, .api-link, #settings-button {display: none !important;}��:root {� --primary-500: #FFFFFF !important;� --body-background-fill: #000000 !important;� --block-background-fill: #000000 !important;� --input-background-fill: #000000 !important;� --border-color-primary: #333333 !important;� --background-fill-secondary: #000000 !important;�}��.gradio-container {� background-color: #000000 !important;� color: #FFFFFF !important;� font-family: 'Inter', system-ui, sans-serif !important;�}��button.primary {� background-color: #FFFFFF !important;� color: #000000 !important;� border-radius: 0px !important;� font-weight: 600 !important;� text-transform: uppercase;� letter-spacing: 1px;�}��button.primary:hover {� background-color: #B2B2B2 !important;�}��.label { color: #808080 !important; text-transform: uppercase; font-size: 11px !important; }��#model-container:fullscreen {� background-color: black;� width: 100vw;� height: 100vh;�}��.generating::after {� content: "SYSTEM::RECONSTRUCTING_GEOMETRY";� color: #808080;� font-size: 10px;� letter-spacing: 2px;� animation: blink 1.2s infinite;�}�@keyframes blink { 50% { opacity: 0; } }�"""��fullscreen_js = """�() => {� const el = document.getElementById('model-container');� if (el.requestFullscreen) {� el.requestFullscreen();� } else if (el.webkitRequestFullscreen) { /* Safari */� el.webkitRequestFullscreen();� } else if (el.msRequestFullscreen) { /* IE11 */� el.msRequestFullscreen();� }�}�"""��def build(out_dir, vision_engine, dev_mode):� pipeline = functools.partial(run_pipeline, out_dir, vision_engine, dev_mode, 512)�� with gr.Blocks(title="3D Object Reconstruction", css=css, theme=gr.themes.Base(), fill_width=True) as interface:� gr.Markdown("# 3D Object Reconstruction")�� with gr.Row():� with gr.Column(scale=1):� input_files = gr.File(file_count="multiple", label="Images")� run_btn = gr.Button("Run Inference", variant="primary")� with gr.Accordion("Settings", open=False):� n_iterations = gr.Slider(100, 1000, 300, label="Alignment Iteration")� render_mode = gr.Checkbox(True, label="Render as Point Cloud")� post_proc = gr.Checkbox(True, label="Filter Background Points")� clean_depth = gr.Checkbox(True, label="Clean-up depthmaps")�� with gr.Column(scale=2):� output_model = gr.Model3D(label="3D Output", height=600, elem_id="model-container")� full_screen_btn = gr.Button("Toggle Full Screen ⛶", size="sm")�� gr.Markdown("---")�� with gr.Row():� with gr.Column():� gr.Markdown("## RGB | DEPTH | CONFIDENCE")� artifact_gallery = gr.Gallery(columns=3, height="auto", label="Logs")�� full_screen_btn.click(None, None, None, js=fullscreen_js)�� saved_state = gr.State()� run_btn.click(fn=pipeline,� inputs=[input_files, n_iterations, render_mode, post_proc],� outputs=[saved_state, output_model, artifact_gallery],� show_progress="minimal")�� interface.launch(share=True, show_api=False)��OUTPUT_DIR = '/content/dust3r/results'�os.makedirs(OUTPUT_DIR, exist_ok=True)�print("Loading Model...")�model_engine = initialize(WEIGHTS_PATH, 'cuda')��build(OUTPUT_DIR, model_engine, 'cuda') | | :---- |