Two Stream Deep CNN-RNN Attentive pooling architecture for Video Based Person Re-Identification

Wajeeha Ansar

DSpace Home
→
E-Theses
→
SEECS
→
Computer Science
→
MS
→
View Item

dc.contributor.author	Wajeeha Ansar
dc.date.accessioned	2020-11-24T13:19:57Z
dc.date.available	2020-11-24T13:19:57Z
dc.date.issued	2018
dc.identifier.uri	http://10.250.8.41:8080/xmlui/handle/123456789/13785
dc.description	Supervisor:	en_US
dc.description.abstract	Person re-identification task is building up the correspondence between person images taken at different places and time from different cameras. It is an essential task for visual surveillance system. In this thesis, we propose a novel two stream convolutional - recurrent model with attentive pooling. Each stream of the model is a Siamese network and acquire different characteristics of feature maps. To fully utilize learned feature maps and to have some common features, we fuse the output of two streams. The use of attentive pooling in our model can choose only informative frames over the whole input video sequence. Then learn spatial and temporal information for these selective frames. For proposed model we used [1] as our base model; but we perform a lot of changes in this base code as follow: (i) we make it two stream model (ii) we add attentive pooling layer in it, which is till now only used for action recognition tasks (iii) we add extra dropout layer in CNN base model (iv) RGB and optical flows are separately treat as input to learn spatial and temporal information separately (v) two stream fusion is done in proposed model to make one siamese cost feature for person re-ID. Fusion is done using weighted function. Which gives more weights to spatial features because these features are more discriminative as compare to temporal features. Experiments are performed on three publically available person re-ID datasets: MARS, PRID-2011 and iLIDS-VID. Experimental results shows that our proposed model is considerably best for feature extraction, and it outperforms existing state-of-the-art supervised models. Results are more efficiently increased by using both RGB and Optical flows as input rather than using either of them independently. Proposed model gives 14.6%, 14.0% and 16% better accuracy for iLIDS-VID, PRID-2011 and MARS respectively at rank1 than base model.	en_US
dc.publisher	SEECS, National University of Sciences and Technology, Islamabad	en_US
dc.subject	Computer Science	en_US
dc.title	Two Stream Deep CNN-RNN Attentive pooling architecture for Video Based Person Re-Identification	en_US
dc.type	Thesis	en_US