Abstract:[Objectives] Waterbird monitoring plays a crucial role in understanding population dynamics and guiding conservation efforts, but it has traditionally been a time-consuming process. In this study, our objective is to integrate unmanned aerial vehicle (UAV) remote sensing with convolutional neural networks (CNN) to achieve rapid and accurate estimation of waterbird populations. [Methods] We employed the DJI Mavic 2 Zoom UAV to capture high-resolution remote sensing images in the West Dongting Lake National Nature Reserve in Hunan. The UAV was flown at an altitude of 75 m, with its camera positioned in a vertically downward-facing orientation. We obtained images with a ground resolution of 1.2 cm/pixel, Table 1 displays the waterbirds captured in the images. We selected 503 images to construct a dataset, including two categories:Anas crecca/A. falcata and Cygnus columbianus, with 3 778 and 395 samples respectively. The dataset has several training sets of different sizes (Table 2) and a validation set of 3 032 samples. For each training set, we independently developed Mask R-CNN and YOLOv3 models, evaluating their performance using the validation set. Evaluation metrics include average precision, recall, precision, and F1-score. [Results] When identifying A. crecca/A. falcata, Mask R-CNN model achieved a recall rate of 93.00% and a precision of 90.83% (Table 4, Fig. 4), while the YOLOv3 model achieved a recall rate of 93.00% and a precision of 88.79% (Table 5, Fig. 5). After reaching 178 ind for A. crecca/A. falcata in the training set, further augmentation did not result in a significant improvement in the performance of both models. When identifying C. columbianus, the performance of both models improved with an increase in the size of the training set. The Mask R-CNN model achieved a recall rate of 84.00% and a precision of 84.38% (Table 6, Fig. 6), while the YOLOv3 model achieved a recall rate of 90.00% and a precision of 81.69% (Table 7, Fig. 7). The Mask R-CNN model detected images at a speed of approximately 12 images/s, while the YOLOv3 model detected images at a speed of 20﹣30 images/s. [Conclusion] Our study proposes a potential solution for efficient and accurate waterbird population monitoring in natural habitats. Our models demonstrated high accuracy in identifying A. crecca/A. falcata, the recognition accuracy difference between Mask R-CNN and YOLO was minimal. Remarkably, by integrating UAV remote sensing with CNN, our approach demonstrates the potential for training highly efficient and accurate waterbird identification models with minimal annotated data—perhaps requiring fewer than 250 ind per waterbird species, as suggested by our results.