Library
|
Your profile |
Software systems and computational methods
Reference:
Zakharov, A.A. (2024). A method for detecting objects in images based on neural networks on graphs and a small number of training examples. Software systems and computational methods, 4, 66–75. https://doi.org/10.7256/2454-0714.2024.4.72558
A method for detecting objects in images based on neural networks on graphs and a small number of training examples
DOI: 10.7256/2454-0714.2024.4.72558EDN: UTTFCHReceived: 03-12-2024Published: 11-12-2024Abstract: In the presented work, the object of research is computer vision systems. The subject of the study is a method for detecting objects in images based on neural networks on graphs and a small number of training examples. Such aspects of the topic as the use of a structural representation of the scene to improve the accuracy of object detection are discussed in detail. It is proposed to share information about the structure of the scene based on neural networks on graphs and training from "multiple shots" to increase the accuracy of object detection. Relationships between classes are established using external semantic links. To do this, a knowledge graph is pre-created. The method contains two stages. At the first stage, object detection is performed based on training with "multiple shots". At the second stage, the detection accuracy is improved using a neural network on graphs. The basis of the developed method is the use of convolution based on spectral graph theory. Each vertex represents a category in the knowledge graph, and the edge weight of the graph is calculated based on conditional probability. Based on the convolution, information from neighboring vertices and edges is combined to update the vertex values. The scientific novelty of the developed method lies in the joint use of convolutional networks on graphs and training from "multiple shots" to increase the accuracy of object detection. A special contribution of the author to the research of the topic is the use of a convolutional network based on a knowledge graph to improve the results of the object detection method using a small number of training examples. The method was studied on test sets of images from the field of computer vision. Using the PASCAL VOC and MS COCO datasets, it is demonstrated that the proposed method increases the accuracy of object detection by analyzing structural relationships. The average accuracy of object detection using the developed method increases by 1-5% compared to the "multiple shots" training method without using a structural representation. Keywords: computer vision, object detection, convolutional networks, small data set, deep learning, limited annotation, graph, pattern recognition, artificial intelligence, structural representation of scenesThis article is automatically translated. You can find original text of the article here. Introduction
Object detection is an important task of computer vision, which consists in finding the object of interest in the input image, and then accurately classify it to a certain class. The most significant characteristics of object detection results are the accuracy of localization and classification, as well as the speed of detection. Object detection serves as the basis for many other areas: autonomous navigation, human-machine interfaces, process control, remote sensing of the Earth, medical diagnostics, biometrics, video surveillance, etc. In recent years, the rapid development of deep learning techniques has significantly contributed to progress in the field of object detection. Currently, a large number of object detection methods using neural networks have been developed. Object detection methods based on deep learning are usually divided into one-stage and two-stage [1]. Single-stage methods include YOLO [2], SSD [3], RetinaNet [4], etc. Two-stage methods include Fast R-CNN [5], Faster R-CNN [6], Mask R-CNN [7], etc. It should be noted that object detection methods based on deep learning face the following critical problems: - variety of scenes with real observations. A large number of object detection methods work quite well in the laboratory: uniform lighting, uniform static background, no camera movement, etc. However, in real-world observation conditions, the accuracy of object detection is significantly reduced. This is due to the presence of a complex textured background and low contrast of images, the presence of extraneous moving objects, the presence of mutual overlaps, shaded areas and noise. - the need to mark up large sets of images manually. A key component of the deep learning revolution has been the availability of large annotated datasets. Despite the fact that most computer vision datasets are crowdsourced, this process is still expensive and time-consuming., which becomes a bottleneck when deploying deep learning systems. Many existing detection methods based on deep learning show good results in the laboratory using large datasets. However, these methods are difficult to implement in real-world conditions due to the inability to create large annotated datasets. - the need to detect objects by categories that have very few instances in the training set. There is often a problem of detecting objects by categories that do not have instances in the training set or their number is limited. If the number of training examples is too small compared to all possible variations, then there is a problem of learning using a small sample. Thus, one of the main problems in the implementation of object detection methods based on deep learning is the need to create a large amount of annotated data, which is not always possible for economic and technical reasons. In recent years, object detection methods based on "few-shot learning" have been actively developed, which try to solve the problem using a small number of examples [8]. It often happens that the data of the base class is limited to a few examples for training. In this case, the model is pre-trained on a large-scale dataset from another domain. The main purpose of several examples is to adapt the presentation to the subject area [9]. Thus, significantly less labeled data is required to train the model. The following object detection methods based on "low shot" training are known: Multi-Scale Positive Sample Refinement for Few-Shot Object Detection (MPSR) [10], Frustratingly Simple Few-Shot Object Detection (TFA) [11], Few-Shot Object Detection via Feature Reweighting (MetaYOLO) [12] and others. However, the accuracy of detection methods using a small amount of training data remains low. It is proposed to use a structural representation of the scene to increase the accuracy of object detection. Graphs will be used to describe the structure. Graph features allow you to evaluate the relationships between image elements. Graphs are used in various fields of computer vision for image segmentation [13], detection of significant areas [14], clustering [15], etc. Graph capabilities allow you to analyze the structural relationships between scene objects, which makes it possible to collect more information compared to local data analysis. In recent years, the study of graphs has progressed rapidly due to the availability of large data sets, powerful computing resources, as well as advances in machine learning and artificial intelligence [16]. Deep learning methods can efficiently encode and represent graph data as vectors. These vectors can then be used in various high-performance tasks. A Graph Neural Network (GNN) is a deep learning architecture specifically designed for data described using graphs [17-20]. Unlike traditional deep learning algorithms, which were primarily developed for text and images, GNNs are designed specifically for processing and analyzing structured datasets. The aim of the study is to develop a method that improves the accuracy of object detection. It is assumed that when using a structural representation, the accuracy of object detection based on a small number of training examples will increase. The scientific novelty of the developed method lies in the joint use of neural networks on graphs and learning from "multiple shots" to increase the accuracy of object detection.
Development of a method for detecting objects in images based on neural networks on graphs and a small number of training examples
The developed method includes two stages. Stage 1. Object detection using training with "multiple shots" [12]. Step 2. Improving the accuracy of detection using a neural network on graphs. The block diagram of the method is shown in Figure 1. Fig. 1. A block diagram of a method for detecting objects in images based on neural networks on graphs and a small number of training examples
In stage 1, objects are detected using "multiple shots" training. Next, based on the detected bounding rectangles of objects, a probability matrix Y 1 is constructed: Y1=ℝBxC, where B is the number of detected bounding areas, C is the number of classes, Y 1 is the probability of the cth class of the bth bounding area. At stage 2, the detection accuracy is improved using a neural network on graphs. Relationships between objects are defined using the knowledge graph. The knowledge graph is a semantic network that stores information about various classes of objects and the relationships between them. The knowledge graph describing the relationships between classes of objects is shown in Fig. 2. Fig.2. Knowledge graph describing the relationships between classes of objects
It is assumed that the joint presence of objects such as a person and a bicycle, a person and a motorcycle, a person and a boat, etc. in the figure will help to increase the accuracy of detection using existing connections (Fig. 3).
Fig. 3. The joint presence of objects of different classes in the images
In the knowledge graph, each vertex represents a certain class, and the edge of the graph from vertex V 1 to vertex V 2 is a conditional probability [21] P (V2/ V1). For example, if a person and a bicycle appear together in the dataset 10 times, and a person appears in the dataset 20 times in total, then the edge from the vertex of the "person" class to the vertex of the "bicycle" class will have the value P (person/bike) =0.5. The knowledge graph is described by the adjacency matrix A= ℝ Cx C, where C is the number of classes represented by the graph. The following vector is fed to the input of the convolutional network on graphs: G1, c=maxb=1, 2,…, B(Y1,bc), c=1, 2, C, where G is 1,c is the value of the maximum probability of the cth class among all detected bounding rectangles. In the node classification task, GNN uses the information to create a vector representation of each node in the graph. This representation includes not only the initial characteristics of the vertex, but also information about the connections between the vertices. Instead of being limited to the source attributes, GNN adds attributes from neighboring vertices and edges to the properties of the source vertices, which makes the representation much more complete and meaningful. The new vertex representations are then used to perform specific tasks such as vertex classification, regression, or link prediction. In particular, GNN defines a graph convolution operation that combines information from neighboring vertices and edges to update representations. This operation is performed iteratively, which allows the model to study more complex relationships between vertices as the number of iterations increases (Fig. 4). In operation, the network contains four convolutional layers.
Fig. 4. Classification of graph nodes based on a convolutional network
A convolutional network on graphs is described by the following rule of information flow between layers [17]: Hl+1=ReLU((aD-1+I)HlWl)+Bl, where I is the unit matrix , H l is the activation matrix of layer l, H l+1 is the activation matrix of layer l+1, D is the power matrix of the graph, W l is the matrix of weights, B l is a variable for controlling the average value of the output signal, a is a configurable parameter that is used to determine the influence of related nodes. By default, a=0.5. The result of the neural network operation on graphs is presented in the form of a matrix of corrective weights: G2=GNN(G1). The final result is a piecemeal product of the original matrix Y 1 and the matrix of corrective weights G 2: Y2=Y1G2. Investigation of the method Images from the PASCAL VOC dataset were used to conduct the study. K copies were selected for each category: K = 1, 3, 5, 10. The mean average precision was calculated: mAP=(SUMk=1...n(APk))/n,
where n is the number of classes, AP k is the average accuracy of class k The experimental results were summarized in Table 1.
Table 1. Calculation of the mAP indicator for different number of instances
In the process of training a neural network on graphs, losses were calculated. The neural network on graphs has been trained for 500 epochs (Fig. 5).
Fig. 5. Graph of the loss function during neural network training on graphs
The ADAM algorithm was used for optimization. ADAM allows you to adjust the learning rate for each parameter. A GeForce RTX 3060 graphics processor was used to train the model. Thus, the developed method allows to increase the average accuracy by 1-5% due to the analysis of structural connections between objects.
Conclusion The article proposes a method that partially compensates for the disadvantages of the training method with a "small number of shots" when detecting objects. The method is based on the use of a neural network on graphs to describe the structure of the analyzed scene. Experiments have shown the effectiveness of the developed method. The average accuracy of object detection has increased to five percent. It was also shown that the proposed method allows for even greater accuracy with an increase in the amount of training data. References
1. Zou, Z., Chen, K., Shi, Z., Guo, Y., & Ye, J. (2023). Object Detection in 20 Years: A Survey. Proceedings of the IEEE, 111(3), 257-276. doi:10.1109/JPROC.2023.3238524
2. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. IEEE Conference on Computer Vision and Pattern Recognition, 779-788. doi:10.1109/CVPR.2016.91 3. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. European Conference on Computer Vision, 21-37. doi:10.1007/978-3-319-46448-0_2 4. Lin, T.Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2018). Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318-327. doi:10.1109/ICCV.2017.324 5. Girshick, P. Fast R-CNN. (2015). IEEE International Conference on Computer Vision (ICCV), 1440-1448. doi:10.1109/ICCV.2015.169 6. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing System, 91-99. doi:10.1109/TPAMI.2016.2577031 7. He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). Mask R-CNN. IEEE International Conference on Computer Vision, 2961-2969. doi:10.1109/ICCV.2017.322 8. Köhler, M., Eisenbach, M., & Gross, H. M. (2024). Few-Shot Object Detection: A Survey. IEEE Transactions on Neural Networks and Learning Systems, 35(9), 11958-11978. doi:10.48550/arXiv.2112.11699 9. Huang, G., Laradji, I., Vazquez, D., Lacoste-Julien, S., & Rodriguez, P. (2023). A Survey of Self-Supervised and Few-Shot Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4), 4071-4089. doi:10.1109/TPAMI.2022.3199617 10. Wu, J., Liu, S., Huang, D., & Wang, Y. (2020). Multi-scale positive sample refinement for few-shot object detection. European Conference on Computer Vision, 456-472. doi:10.1007/978-3-030-58517-4_27 11. Wang, X., Huang, T. E., Gonzalez, J., Darrell, T., & Yu, F. (2020). Frustratingly simple few-shot object detection. Proceedings of the 37th International Conference on Machine Learning (ICML), 9919-9928. doi:10.48550/arXiv.2003.06957 12. Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., & Darrell, T. (2019). Few-shot object detection via feature reweighting. IEEE/CVF International Conference on Computer Vision. doi:10.1109/ICCV.2019.00851 13. Zakharov, A.A., & Tuzhilkin, A.Y. (2018). Segmentation of satellite images based on super pixels and sections on graphs. Software systems and computational methods, 1, 7-17. doi:10.7256/2454-0714.2018.1.25629 Retrieved from http://en.e-notabene.ru/itmag/article_25629.html 14. Zakharov, A.A., Titov D.V., Zhiznyakov, A.L., & Titov, V.S. (2020). Visual attention method based on vertex ranking of graphs by heterogeneous image attributes. Computer Optics, 44(3), 427-435. doi:10.18287/2412-6179-CO-658 15. Barinov, A.E., & Zakharov, A.A. (2015). Clustering using a random walk on graph for head pose estimation. International Conference on Mechanical Engineering, Automation and Control Systems, MEACS 2015. doi:10.1109/MEACS.2015.7414876 16. Cao, P., Zhu, Z., Wang, Z., Zhu, Y., & Niu, Q. (2022). Applications of graph convolutional networks in computer vision. Neural Computing and Applications, 34, 13387-13405. doi:10.1007/s00521-022-07368-1 17. Kipf, T.N. (2020). Deep Learning with Graph-Structured Representations [DX Reader version]. Retrieved from https://pure.uva.nl/ws/files/46900201/Thesis.pdf 18. Li, W., Liu, X., & Yuan, Y. (2023). SIGMA++: Improved Semantic-Complete Graph Matching for Domain Adaptive Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7), 9022-9040. doi:10.1109/TPAMI.2023.3235367 19. Chen, C., Li, J., Zhou, H.Y., Han, X., Huang, Y., Ding, X., & Yu, Y. (2023). Relation matters: Foreground-aware graph-based relational reasoning for domain adaptive object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3677-3694. doi:10.48550/arXiv.2206.02355 20. Chen, T., Lin, L., Chen, R., Hui, X., & Wu, H. (2022). Knowledge-Guided Multi-Label Few-Shot Learning for General Image Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 1371-1384. doi:10.1109/TPAMI.2020.3025814 21. Liu, Z., Jiang, Z., Feng, W., & Feng, H. (2020). OD-GCN: Object Detection Boosted by Knowledge GCN. 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). doi:10.1109/ICMEW46912.2020.9105952
Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
|