[1909.12146v1] DISCOMAN: Dataset of Indoor SCenes for Odometry, Mapping And Navigation
We have presented a new dataset and benchmark suite for training and evaluation of semantic SLAM models
Abstract: We present a novel dataset for training and benchmarking semantic SLAM
methods. The dataset consists of 200 long sequences, each one containing
3000-5000 data frames. We generate the sequences using realistic home layouts.
For that we sample trajectories that simulate motions of a simple home robot,
and then render the frames along the trajectories. Each data frame contains a)
RGB images generated using physically-based rendering, b) simulated depth
measurements, c) simulated IMU readings and d) ground truth occupancy grid of a
house. Our dataset serves a wider range of purposes compared to existing
datasets and is the first large-scale benchmark focused on the mapping
component of SLAM. The dataset is split into train/validation/test parts
sampled from different sets of virtual houses. We present benchmarking results
forboth classical geometry-based and recent learning-based SLAM algorithms, a
baseline mapping method, semantic segmentation and panoptic segmentation.
‹Fig. 1: DISCOMAN dataset provides realistic indoor sequences with ground truth annotation for odometry, mapping and semantic segmentation. (Introduction)Fig. 2: Sample trajectories from outdoor KITTI  (top row), and indoor DISCOMAN (middle row) and TUM RGB-D  (bottom row) benchmarks. The trajectories in DISCOMAN are slightly more difficult compared to KITTI, but less complex compared to TUM RGB-D. (Introduction)Fig. 3: Samples of generated trajectories. Color coding: red sampled keypoints, blue final trajectory after smoothing, black occupied areas, white free area, grey the area of an image where keypoints cannot be sampled. One can see the effect of choosing different number of keypoints per trajectory: (a) 10 keypoints, (b) 30 keypoints, (c) 100 keypoints per trajectory. One can see that the more keypoints we add, the more curved the trajectory gets. (Related work)Fig. 4: Example frames from DISCOMAN dataset. From top to bottom: RGB image, depth with emulated sensor noise, pixel-wise semantic annotation. Notice holes in depth maps for reflecting and black surfaces. (Related work)Fig. 5: Qualitative results of trajectory estimation. One can see that DISCOMAN dataset is difficult for sparse SLAM methods like DSO (monocular) and ORBSLAM2 (RGB-D). The main reasons for that are the abundance of fast rotations and low-textured surfaces, e.g. white walls. Learning-based methods LS-VO (monocular) and Motion Maps (RGB-D) show higher robustness, but in most cases lower accuracy. (Trajectory estimation)Fig. 6: Example of mapping result obtained by Open3D using camera poses from Motion Maps method. (a) occupancy grid of a 3d scene, (b) occupancy grid obtained using Open3D with ground truth camera poses and ground truth depth, which we take for ground truth map, (c) map from ground truth camera poses and noisy depth, (d) map from camera poses provided by Motion Maps  and ground truth depth, (e) map from poses from Motion Maps and noisy depth. (Mapping)Fig. 7: Failure cases for semantic segmentation. First row – input image, second row — ground truth semantic labelling, third row — result of DeepLabV3+ RGB segmentation, fourth row — result of DeepLabV3+ RGB-D segmentation. One can see that in some cases adding depth information helps to deal with ambiguities, but overall the effect of using depth for semantic segmentation is not dramatic. (Semantic/panoptic segmentation)›