Enhancing Person Re-identification by Learning Local Parts Association with  Self-attention and Contextual Mapping

Perwaiz, Nazia

DSpace Home
→
E-Theses
→
SEECS
→
Information Technology
→
PhD
→
View Item

Enhancing Person Re-identification by Learning Local Parts Association with Self-attention and Contextual Mapping

Perwaiz, Nazia

URI: http://10.250.8.41:8080/xmlui/handle/123456789/31212

Date: 2022

Abstract:

Person re-identification (Re-ID) is a key component of smart visual surveillance. A per son Re-ID system uses the raw images taken by CCTV cameras and relies on people’s appearances rather than specialist biometric technology to re-identify a person across several non-overlapping cameras. In order to learn global and/or part-based human representations, Convolutional Neural Networks (CNN) are the most common founda tion upon which Re-ID systems are built. The ability to learn the associations between far-flung regions of a person’s image, which plays a significant role in the human vision system, is a major limitation of the CNNs. In order to create reliable person representations, this thesis investigates the significance of local salient regions of a person’s image and the role of local parts association across multiple images of the same identity (person). First of all, a hand-crafted, mid-level, and deep person features based discriminative hybrid person representation is proposed, followed by the use of metric learning techniques to strengthen the integrated feature space. After learning the significance of local parts associations through handcrafted and hy brid person features, the deep architectures are further explored in this context for performance enhancement and scalability. A novel Hierarchically Refined local Saliency Associations-based Re-ID Network (HRSAN) is proposed. Additionally, a complete per son Re-ID pipeline is developed, i.e., Saliency based Simultaneous Person Detection and Re-identification system (SSPDR), and the proposed HRSAN is embedded as its Re ID module. The proposed SSPDR system is designed to work on the raw surveillance images captured by CCTV cameras instead of cropped person images. One of the inherited characteristics of the CNNs is the local receptive field at a particular layer, which prevents a layer from having a full view of self-associations among distant iii salient regions. However, the role of a salient local part in context with other salient distant regions can play a significant role in person re-identification. For example, a person wearing a white hat and red shoes can be easily identified if the underlying network learns this distinctive combination at each layer. Therefore, in order to capture the self-attention among distant salient regions of a per son’s image to enhance the local parts associations, the specialized self-attention archi tectures i.e., Transformers, are studied and enhanced. The proposed baseline architec ture is further enhanced by mapping the self-attention and self-contextual information among local and far-flung regions of a person’s image. The proposed system Performer attains a holistic perspective at the early network layers, and captures the global de pendencies via associating salient local parts of an image. And finally, a novel and specialized self-attention architecture is developed for pose in variant person representation i.e., TransPose. The proposed system introduced multiple streams to learn global and local dependencies. Multi-stream architecture jointly learns the global and local patch-based person embeddings through Global Self-attention Mod ule (GSM) and Local Self-attention Module (LSM). The LSM is enhanced by stochasti cally grouping the local patches and then establishing alignments among them. And the pose and viewpoint variations are handled by introducing an attention feature learning module (AFLM) in LSM. The AFLM lays off the semantically aligned region from the whole set of feature maps across a batch and learns the rest of patch-wise attentive local features and is supervised by GSM to learn the pose invariant local features. The proposed methods are evaluated upon various person Re-ID datasets. The initial work is evaluated on small to medium scale Re-ID datasets i.e., VIPeR, PRID450s, GRID and CUHK01 and the latest research is evaluated upon large-scale Re-ID datasets i.e., Market1501, DukeMTMC-ReID and MSMT-17. The proposed self-attention based Re-ID architecture (Performer) consistently outper formed the existing CNN based counterparts for various Re-ID benchmarks. i.e., The Re-ID accuracies are improved by 5.5%, 4.6% and 17% for Market1501, DukeMTMC ReID and MSMT-17 datasets, respectively. Similarly, the proposed specialized self attention architecture (TransPose) outperformed the recent state-of-the-art customized vision transformer based Re-ID solutions as well. In the future, the potential of trans formers will be explored for cross datasets person Re-ID. i

Show full item record