[1] ZHANG L, WANG L, LIN W. Generalized biased discriminant analysis for content-based image retrieval[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2011, 42(1): 282-290.
[2] SHEN F M , SHEN C H, LIU W, et al. Supervised discrete Hashing[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 37-45.
[3] VOORHEES E M. Using WordNet to disambiguate word senses for text retrieval[C]//Proceedings of the 16th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, Jun 27-Jul 1, 1993. New York: ACM, 1993: 171-180.
[4] ZHENG L, YANG Y, TIAN Q. SIFT meets CNN: a decade survey of instance retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(5): 1224-1244.
[5] HU W, XIE N, LI L, et al. A survey on visual content-based video indexing and retrieval[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2011, 41(6): 797-819.
[6] ZHU L, TIAN X M, CAO S N, et al. Subspace cross-modal retrieval based on high-order semantic correlation[J]. Data Analysis and Knowledge Discovery, 2020, 4(5): 84-91.
朱路, 田晓梦, 曹赛男, 等. 基于高阶语义相关的子空间跨模态检索方法研究[J]. 数据分析与知识发现, 2020, 4(5): 84-91.
[7] LI D, DIMITROVA N, LI M, et al. Multimedia content pro-cessing through cross-modal association[C]//Proceedings of the 11th ACM International Conference on Multimedia, Berkeley, Nov 2-8, 2003. New York: ACM, 2003: 604-611.
[8] THOMPSON B. Canonical correlation analysis[M]//THOMPSON B. Encyclopedia of Statistics in Behavioral Science. Hoboken: John Wiley & Sons, 2005.
[9] ROSIPAL R, KR?MER N. Overview and recent advances in partial least squares[C]//LNCS 3940: Proceedings of the International Statistical and Optimization Perspectives Workshop on Subspace, Latent Structure and Feature Selection, Bohinj, Feb 23-25, 2005. Berlin, Heidelberg: Springer, 2005: 34-51.
[10] ZHANG H, LIU Y, MA Z. Fusing inherent and external know-ledge with nonlinear learning for cross-media retrieval[J]. Neurocomputing, 2013, 119: 10-16.
[11] ANDREW G, ARORA R, BILMES J, et al. Deep canonical correlation analysis[C]//Proceedings of the 30th International Conference on Machine Learning, Atlanta, Jun 16-21, 2013: 1247-1255.
[12] JIA Y, SALZMANN M, DARRELL T. Learning cross-modality similarity for multinomial data[C]//Proceedings of the 2011 IEEE International Conference on Computer Vision, Barcelona, Nov 6-13, 2011. Washington: IEEE Computer Society, 2011: 2407-2414.
[13] RASIWASIA N, PEREIRA J C, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]//Pro-ceedings of the 18th International Conference on Multimedia 2010, Firenze, Oct 25-29, 2010. New York: ACM, 2010: 251-260.
[14] RASIWASIA N, MAHAJAN D, MAHADEVAN V, et al. Cluster canonical correlation analysis[C]//Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, Reykjavik, Apr 22-25, 2014: 823-831.
[15] RANJAN V, RASIWASIA N, JAWAHAR C V. Multi-label cross-modal retrieval[C]//Proceedings of the 2015 IEEE Inter-national Conference on Computer Vision, Santiago, Dec 7-13, 2015. Piscataway: IEEE, 2015: 4094-4102.
[16] GONG Y, KE Q, ISARD M, et al. A multi-view embedding space for modeling internet images, tags, and their semantics[J]. International Journal of Computer Vision, 2014, 106(2): 210-233.
[17] ZHANG L, MA B, LI G, et al. Generalized semi-supervised and structured subspace learning for cross-modal retrieval[J]. IEEE Transactions on Multimedia, 2017, 20(1): 128-141.
[18] FROME A, CORRADO G S, SHLENS J, et al. Devise: a deep visual-semantic embedding model[C]//Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, Dec 5-8, 2013. Red Hook: Curran Associates, 2013: 2121-2129.
[19] KIROS R, SALAKHUTDINOV R, ZEMEL R. Multimodal neural language models[C]//Proceedings of the 31st Inter-national Conference on Machine Learning, Beijing, Jun 21-26, 2014: 595-603.
[20] NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning[C]//Proceedings of the 28th International Conference on Machine Learning, Bellevue, Jun 28-Jul 2, 2011. Madison: Omni Press, 2011: 689-696.
[21] SRIVASTAVA N, SALAKHUTDINOV R. Multimodal lear-ning with deep Boltzmann machines[C]//Proceedings of the 25th Annual Conference on Neural Information Processing Systems, Lake Tahoe, Dec 3-6, 2012. Red Hook: Curran Associates, 2012: 2231-2239.
[22] FENG F, WANG X, LI R. Cross-modal retrieval with corre-spondence autoencoder[C]//Proceedings of the 2014 ACM International Conference on Multimedia, Orlando, Nov 3-7, 2014. New York: ACM, 2014: 7-16.
[23] ZHANG H, YANG Y, LUAN H, et al. Start from scratch: towards automatically identifying, modeling, and naming visual attributes[C]//Proceedings of the 2014 ACM International Conference on Multimedia, Orlando, Nov 3-7, 2014. New York: ACM, 2014: 187-196.
[24] HU P, ZHEN L, PENG D, et al. Scalable deep multimodal learning for cross-modal retrieval[C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, Jul 21-25, 2019. New York: ACM, 2019: 635-644.
[25] WANG C, YANG H J, MEINEL C. Deep semantic mapping for cross-modal retrieval[C]//Proceedings of the 27th IEEE International Conference on Tools with Artificial Intelligence, Vietri sul Mare, Nov 9-11, 2015. Washington: IEEE Computer Society, 2015: 234-241.
[26] YAN F, MIKOLAJCZYK K. Deep correlation for matching images and text[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 3441-3450.
[27] WEI Y, ZHAO Y, LU C, et al. Cross-modal retrieval with CNN visual features: a new baseline[J]. IEEE Transactions on Cybernetics, 2016, 47(2): 449-460.
[28] CASTREJóN L, AYTAR Y, VONDRICK C, et al. Learning aligned cross-modal representations from weakly aligned data[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 2940-2949.
[29] ZHANG Y F, ZHOU W G, WANG M, et al. Deep relation embedding for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2021, 30: 617-627.
[30] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami, Jun 20-25, 2009. Washington: IEEE Computer Society, 2009: 248-255.
[31] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th Annual Conference on Neural Information Processing Systems, Lake Tahoe, Dec 3-6, 2012. Red Hook: Curran Associates, 2012: 1106-1114.
[32] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409. 1556, 2014.
[33] SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-v4, inception-ResNet and the impact of residual connections on learning[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, Feb 4-9, 2017. Menlo Park: AAAI, 2017: 4278-4284.
[34] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778.
[35] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]//Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 2261-2269.
[36] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 7132-7141.
[37] CANZIANI A, PASZKE A, CULURCIELLO E. An analysis of deep neural network models for practical applications[J]. arXiv:1605.07678, 2016.
[38] HUANG Y, WANG W, WANG L. Instance-aware image and sentence matching with selective multimodal LSTM[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 7254-7262.
[39] WANG Z, LIU X, LI H, et al. CAMP: cross-modal adaptive message passing for text-image retrieval[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 5764-5773.
[40] ZHANG Q, LEI Z, ZHANG Z X, et al. Context-aware attention network for image-text retrieval[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 3536-3545.
[41] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]//Proceedings of the 27th Annual Conference on Neural Information Processing Systems, Montreal, Dec 8-13, 2014. Red Hook: Curran Associates, 2014: 2672-2680.
[42] GU J X, CAI J F, JOTY S R, et al. Look, imagine and match: improving textual-visual cross-modal retrieval with generative models[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 7181-7189.
[43] WU F, JING X Y, WU Z, et al. Modality-specific and shared generative adversarial network for cross-modal retrieval[J]. Pattern Recognition, 2020, 104: 107335.
[44] WANG J, LIU W, KUMAR S, et al. Learning to Hash for indexing big data—a survey[J]. Proceedings of the IEEE, 2015, 104(1): 34-57.
[45] ZHANG D, WANG F, SI L. Composite Hashing with multiple information sources[C]//Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, Jul 25-29, 2011. New York: ACM, 2011: 225-234.
[46] ZHEN Y, YEUNG D Y. Co-regularized Hashing for multimodal data[C]//Proceedings of the 25th Annual Conference on Neural Information Processing Systems, Lake Tahoe, Dec 3-6, 2012. Red Hook: Curran Associates, 2012: 1385-1393.
[47] ZHANG D Q, LI W J. Large-scale supervised multimodal Hashing with semantic correlation maximization[C]//Proceedings of the 28th AAAI Conference on Artificial Intelligence, Québec City, Jul 27-31, 2014. Menlo Park: AAAI, 2014: 2177-2183.
[48] DING G G, GUO Y C, ZHOU J. Collective matrix factori-zation Hashing for multimodal data[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014. Washington: IEEE Computer Society, 2014: 2083-2090.
[49] LIN Z J, DING G G, HU M Q, et al. Semantics-preserving Hashing for cross-view retrieval[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 3864-3872.
[50] LIU X, HU Z K, LING H B, et al. MTFH: a matrix tri-factorization Hashing framework for efficient cross-modal retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(3): 964-981.
[51] MENG M, WANG H T, YU J, et al. Asymmetric supervised consistent and specific Hashing for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2021, 30: 986-1000.
[52] JIANG Q Y, LI W J. Deep cross-modal Hashing[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3270-3278.
[53] ZHEN L L, HU P, WANG X, et al. Deep supervised cross-modal retrieval[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 10394-10403.
[54] LI C, DENG C, LI N, et al. Self-supervised adversarial Hashing networks for cross-modal retrieval[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 4242-4251.
[55] LIN Q B, CAO W M, HE Z H, et al. Semantic deep cross-modal Hashing[J]. Neurocomputing, 2020, 396: 113-122.
[56] SU S P, ZHONG Z S, ZHONG C. Deep joint-semantics recon-structing Hashing for large-scale unsupervised cross-modal retrieval[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 3027-3035.
[57] WU G S, LIN Z J, HAN J G, et al. Unsupervised deep Hashing via binary latent factor models for large-scale cross-modal retrieval[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Jul 13-19, 2018: 2854-2860.
[58] XIE D, DENG C, LI C, et al. Multi-task consistency-preserving adversarial Hashing for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2020, 29: 3626-3637.
[59] PLUMMER B A, WANG L W, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 2641-2649.
[60] CHUA T S, TANG J H, HONG R C, et al. NUS-WIDE: a real-world web image database from National University of Singapore[C]//Proceedings of the 8th ACM International Conference on Image and Video Retrieval, Santorini Island, Jul 8-10, 2009. New York: ACM, 2009: 1-9.
[61] PENG Y X, HUANG X, ZHAO Y Z. An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 28(9): 2372-2385.
[62] HUISKES M J, LEW M S. The MIR flickr retrieval evalu-ation[C]//Proceedings of the 1st ACM SIGMM International Conference on Multimedia Information Retrieval, Vancouver, Oct 30-31, 2008. New York: ACM, 2008: 39-43.
[63] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//LNCS 8693: Proceedings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 740-755.
[64] PENG Y X, QI J W, HUANG X, et al. CCL: cross-modal correlation learning with multigrained fusion by hierarchical network[J]. IEEE Transactions on Multimedia, 2017, 20(2): 405-420.
[65] AU J H, BALDWIN T. An empirical evaluation of doc2vec with practical insights into document embedding generation[J]. arXiv:1607.05368, 2016.
[66] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 30th Annual Conference on Neural Information Processing Systems, Long Beach, Dec 4-9, 2017. Red Hook: Curran Associates, 2017: 5998-6008.
[67] TAN Z, WANG M, XIE J, et al. Deep semantic role labeling with self-attention[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, Feb 2-7, 2018. Menlo Park: AAAI, 2018: 4929-4936.
[68] ZHANG H, GOODFELLOW I J, METAXAS D N, et al. Self-attention generative adversarial networks[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 7354-7363.
[69] ZHAO H S, JIA J Y, KOLTUN V. Exploring self-attention for image recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 10076-10085.
[70] MIECH A, ALAYRAC J B, LAPTEV I, et al. Thinking fast and slow: efficient text-to-visual retrieval with transformers[J]. arXiv:2103.16553, 2021.
[71] SALVADOR A, GUNDOGDU E, BAZZANI L, et al. Revam-ping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning[J]. arXiv:2103.13061, 2021. |