Abstract:
Person re-identification (Re-ID) is a key component of smart visual surveillance. A per son Re-ID system uses the raw images taken by CCTV cameras and relies on people’s
appearances rather than specialist biometric technology to re-identify a person across
several non-overlapping cameras. In order to learn global and/or part-based human
representations, Convolutional Neural Networks (CNN) are the most common founda tion upon which Re-ID systems are built. The ability to learn the associations between
far-flung regions of a person’s image, which plays a significant role in the human vision
system, is a major limitation of the CNNs.
In order to create reliable person representations, this thesis investigates the significance
of local salient regions of a person’s image and the role of local parts association across
multiple images of the same identity (person). First of all, a hand-crafted, mid-level,
and deep person features based discriminative hybrid person representation is proposed,
followed by the use of metric learning techniques to strengthen the integrated feature
space.
After learning the significance of local parts associations through handcrafted and hy brid person features, the deep architectures are further explored in this context for
performance enhancement and scalability. A novel Hierarchically Refined local Saliency
Associations-based Re-ID Network (HRSAN) is proposed. Additionally, a complete per son Re-ID pipeline is developed, i.e., Saliency based Simultaneous Person Detection and
Re-identification system (SSPDR), and the proposed HRSAN is embedded as its Re ID module. The proposed SSPDR system is designed to work on the raw surveillance
images captured by CCTV cameras instead of cropped person images.
One of the inherited characteristics of the CNNs is the local receptive field at a particular
layer, which prevents a layer from having a full view of self-associations among distant
iii
salient regions. However, the role of a salient local part in context with other salient
distant regions can play a significant role in person re-identification. For example, a
person wearing a white hat and red shoes can be easily identified if the underlying
network learns this distinctive combination at each layer.
Therefore, in order to capture the self-attention among distant salient regions of a per son’s image to enhance the local parts associations, the specialized self-attention archi tectures i.e., Transformers, are studied and enhanced. The proposed baseline architec ture is further enhanced by mapping the self-attention and self-contextual information
among local and far-flung regions of a person’s image. The proposed system Performer
attains a holistic perspective at the early network layers, and captures the global de pendencies via associating salient local parts of an image.
And finally, a novel and specialized self-attention architecture is developed for pose in variant person representation i.e., TransPose. The proposed system introduced multiple
streams to learn global and local dependencies. Multi-stream architecture jointly learns
the global and local patch-based person embeddings through Global Self-attention Mod ule (GSM) and Local Self-attention Module (LSM). The LSM is enhanced by stochasti cally grouping the local patches and then establishing alignments among them. And the
pose and viewpoint variations are handled by introducing an attention feature learning
module (AFLM) in LSM. The AFLM lays off the semantically aligned region from the
whole set of feature maps across a batch and learns the rest of patch-wise attentive local
features and is supervised by GSM to learn the pose invariant local features.
The proposed methods are evaluated upon various person Re-ID datasets. The initial
work is evaluated on small to medium scale Re-ID datasets i.e., VIPeR, PRID450s, GRID
and CUHK01 and the latest research is evaluated upon large-scale Re-ID datasets i.e.,
Market1501, DukeMTMC-ReID and MSMT-17.
The proposed self-attention based Re-ID architecture (Performer) consistently outper formed the existing CNN based counterparts for various Re-ID benchmarks. i.e., The
Re-ID accuracies are improved by 5.5%, 4.6% and 17% for Market1501, DukeMTMC ReID and MSMT-17 datasets, respectively. Similarly, the proposed specialized self attention architecture (TransPose) outperformed the recent state-of-the-art customized
vision transformer based Re-ID solutions as well. In the future, the potential of trans formers will be explored for cross datasets person Re-ID.
i