Abstract:
Timely diagnosis and identification of apple leaf diseases are imperative for preventing
the spread of diseases and ensuring the sound development of the apple industry.
Convolutional Neural Networks (CNNs) have achieved phenomenal success in the area
of leaf disease detection, which can greatly benefit the agriculture industry. However,
their large model size and intricate design continue to pose a challenge when it comes
to deploying these models on lightweight devices. Although several successful models
(e.g. EfficientNet and MobileNet) have been designed to adapt to resource-constrained
devices, these models have not been able to achieve significant results in leaf disease
detection task and leave a performance gap behind. This research gap has motivated
us to develop an apple leaf disease detection model that can not only be deployed on
lightweight devices but also can outperform the existing models. In this work, we propose
AppVit, a hybrid vision model, combining the features of convolution and multihead
self-attention, to compete with the best performing models. Specifically, we begin
by introducing the convloution blocks that narrows-down the size of the feature maps
and helps the model to encode local features progressively. Then, we stack VIT blocks
in combination with convolution blocks allowing the network to capture non-local dependencies
and spatial patterns. Embodied with these designs and a hierarchical structure,
AppVIT demonstrates excellent performance on apple leaf disease detection task.
Specifically, it achieves 96.38 % accuracy on plant pathology 2021 - FGVC8 with about
1.3 million parameters, which is 11.3% and 4.3% more accurate than ResNet-50 and
EfficientNet-B3. The precision, recall and f-score of our proposed model on apple leaf
disease detection and classification are 0.967, 0.959, 0.963 respectively.