[1] KOH P W, NGUYEN T, TANG Y S, et al. Concept bottleneck models[C]//Proceedings of the 37th International Conference on Machine Learning, 2020: 5338-5348.
[2] LAMPERT C H, NICKISCH H, HARMELING S. Attribute-based classification for zero-shot visual object categorization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(3): 453-465.
[3] RUSSAKOVSKY O, LI F F. Attribute learning in large-scale datasets[C]//Proceedings of the 12th European Conference on Computer Vision: Trends and Topics in Computer Vision. Berlin, Heidelberg: Springer, 2012: 1-14.
[4] XU W J, XIAN Y Q, WANG J N, et al. Attribute prototype network for zero-shot learning[C]//Advances in Neural Information Processing Systems 33, 2020: 21969-21980.
[5] YUN T, BHALLA U, PAVLICK E, et al. Do vision-language pretrained models learn composable primitive concepts? [EB/OL]. [2025-03-02]. https://arxiv.org/abs/2203.17271.
[6] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 8748-8763.
[7] YUKSEKGONUL M, WANG M, ZOU J. Post-hoc concept bottleneck models[EB/OL]. [2025-03-05]. https://arxiv.org/abs/2205.15480.
[8] SPEER R, CHIN J, HAVASI C. ConceptNet 5.5: an open multilingual graph of general knowledge[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4444-4451.
[9] KIM E, JUNG D, PARK S, et al. Probabilistic concept bottleneck models[C]//Proceedings of the 40th International Conference on Machine Learning, 2023: 16521-16540.
[10] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2009: 248-255.
[11] MAJI S, RAHTU E, KANNALA J, et al. Fine-grained visual classification of aircraft[EB/OL]. [2025-03-06]. https://arxiv. org/abs/1306.5151.
[12] OIKARINEN T, DAS S, NGUYEN L M, et al. Label-free concept bottleneck models[C]//Proceedings of the 11th International Conference on Learning Representations, 2023.
[13] YANG Y, PANAGOPOULOU A, ZHOU S H, et al. Language in a bottle: language model guided concept bottlenecks for interpretable image classification[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19187-19197.
[14] YAN A, WANG Y, ZHONG Y W, et al. Learning concise and descriptive attributes for visual recognition[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2024: 3067-3077.
[15] MARGELOIU A, ASHMAN M, BHATT U, et al. Do concept bottleneck models learn as intended?[EB/OL]. [2025-03-06]. https://arxiv.org/abs/2105.04289.
[16] MAHINPEI A, CLARK J, LAGE I, et al. Promises and pitfalls of black-box concept learning models[EB/OL]. [2025-03-06]. https://arxiv.org/abs/2106.13314.
[17] WEI J, WANG X Z, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Advances in Neural Information Processing Systems 35, 2022: 24824-24837.
[18] DAI W L, LI J N, LI D X, et al. InstructBLIP: towards general-purpose vision-language models with instruction tuning[C]//Advances in Neural Information Processing Systems, 2023.
[19] ALLARD J, KILPATRICK L, HEIDEL S, et al. GPT-3.5 turbo fine-tuning and API updates[EB/OL]. [2025-03-06]. https://openai.com/blog/gpt-3-5-turbo/.
[20] TISHBY N, PEREIRA F C, BIALEK W. The information bottleneck method[EB/OL]. [2025-03-06]. https://arxiv.org/abs/physics/0004057.
[21] COOK R D. Detection of influential observation in linear regression[J]. Technometrics, 1977, 19(1): 15-18.
[22] RIBEIRO M T, SINGH S, GUESTRIN C. Anchors: high-precision model-agnostic explanations[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 1527-1535.
[23] RIBEIRO M T, SINGH S, GUESTRIN C. “Why should I trust you?”: Explaining the predictions of any classifier[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2016: 1135-1144.
[24] LUNDBERG S M, LEE S I. A unified approach to interpreting model predictions[C]//Advances in Neural Information Processing Systems 30, 2017: 4768-4777.
[25] SHAPLEY L S. A value for n-person games[J]. Annals of Mathematics Studies, 1953, 28: 307-318.
[26] SHRIKUMAR A, GREENSIDE P, KUNDAJE A. Learning important features through propagating activation differences[C]//Proceedings of the 34th International Conference on Machine Learning, 2017: 3145-3153.
[27] SIMONYAN K, VEDALDI A, ZISSERMAN A. Deep inside convolutional networks: visualising image classification models and saliency maps[EB/OL]. [2025-03-06]. https://arxiv.org/abs/1312.6034.
[28] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[J]. International Journal of Computer Vision, 2020, 128(2): 336-359.
[29] BACH S, BINDER A, MONTAVON G, et al. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation[J]. PLoS One, 2015, 10(7): e0130140.
[30] CHEN C F, LI O, TAO C F, et al. This looks like that: deep learning for interpretable image recognition[C]//Advances in Neural Information Processing Systems 32, 2019: 8928-8939.
[31] LI O, LIU H, CHEN C F, et al. Deep learning for case-based reasoning through prototypes: a neural network that explains its predictions[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 3530-3537.
[32] CHEN Z, BEI Y J, RUDIN C. Concept whitening for interpretable image recognition[J]. Nature Machine Intelligence, 2020, 2(12): 772-782.
[33] FONG R, VEDALDI A. Net2Vec: quantifying and explaining how concepts are encoded by filters in deep neural networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 8730-8738.
[34] GHORBANI A, WEXLER J, ZOU J, et al. Towards automatic concept-based explanations[C]//Advances in Neural Information Processing Systems 32, 2019: 9273-9282.
[35] BANG Y J, CAHYAWIJAYA S, LEE N, et al. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity[EB/OL]. [2025-03-09]. https://arxiv.org/abs/2302.04023.
[36] GUERREIRO N M, ALVES D M, WALDENDORF J, et al. Hallucinations in large multilingual translation models[J]. Transactions of the Association for Computational Linguistics, 2023, 11: 1500-1517.
[37] LI Y F, DU Y F, ZHOU K, et al. Evaluating object hallucination in large vision-language models[EB/OL]. [2025-04-20]. https://arxiv.org/abs/2305.10355.
[38] SUN Z Q, SHEN S, CAO S C, et al. Aligning large multimodal models with factually augmented RLHF[EB/OL]. [2025-04-20]. https://arxiv.org/abs/2309.14525.
[39] 张虎成, 李雷孝, 刘东江. 多模态数据融合研究综述[J]. 计算机科学与探索, 2024, 18(10): 2501-2520.
ZHANG H C, LI L X, LIU D J. Survey of multimodal data fusion research[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(10): 2501-2520.
[40] KRIZHEVSKY A, HINTON G. Learning multiple layers of features from tiny images[EB/OL]. [2025-04-20] https://www. cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
[41] WAH C, BRANSON S, WELINDER P, et al. The Caltech-UCSD birds-200-2011 dataset: CNS-TR-2010-001[R]. California Institute of Technology, 2011.
[42] BOSSARD L, GUILLAUMIN M, VAN GOOL L. Food-101-mining discriminative components with random forests [C]//Proceedings of the 13th European Conference on Computer Vision. Cham: Springer, 2014: 446-461.
[43] NILSBACK M E, ZISSERMAN A. Automated flower classification over a large number of classes[C]//Proceedings of the 2008 6th Indian Conference on Computer Vision, Graphics & Image Processing. Piscataway: IEEE, 2009: 722-729.
[44] DESAI S, RAMASWAMY H G. Ablation-CAM: visual explanations for deep convolutional network via gradient-free localization[C]//Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2020: 972-980.
[45] JIANG P T, ZHANG C B, HOU Q B, et al. LayerCAM: exploring hierarchical class activation maps for localization[J]. IEEE Transactions on Image Processing, 2021, 30: 5875-5888.
[46] MUHAMMAD M B, YEASIN M. Eigen-CAM: class activation map using principal components[C]//Proceedings of the 2020 International Joint Conference on Neural Networks. Piscataway: IEEE, 2020: 1-7.
[47] BACH F. Convex analysis and optimization with submodular functions: a tutorial[EB/OL]. [2025-03-06]. https://arxiv. org/abs/1010.4207.
[48] ZHONG Y W, YANG J W, ZHANG P C, et al. RegionCLIP: region-based language-image pretraining[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 16772-16782.
[49] LI L H, ZHANG P C, ZHANG H T, et al. Grounded language-image pre-training[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 10955-10965.
[50] LIU S L, ZENG Z Y, REN T H, et al. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection[C]//Proceedings of the 18th European Conference on Computer Vision. Cham: Springer, 2024: 38-55.
[51] REN T H, CHEN Y H, JIANG Q, et al. DINO-X: a unified vision model for open-world object detection and understanding[EB/OL]. [2025-03-06]. https://arxiv.org/abs/2411. 14347.
[52] REN T H, LIU S L, ZENG A L, et al. Grounded SAM: assembling open-world models for diverse visual tasks[EB/OL]. [2025-03-06]. https://arxiv.org/abs/2401.14159.
[53] WANG H X, VASU P K A, FAGHRI F, et al. SAM-CLIP: merging vision foundation models towards semantic and spatial understanding[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2024: 3635-3647.
[54] LI F, ZHANG H, SUN P Z, et al. Segment and recognize anything at any granularity[C]//Proceedings of the 18th European Conference on Computer Vision. Cham: Springer, 2024: 467-484. |