Improving Joint Learning of Chest X-Ray and Radiology Report by Word Region Alignment. Academic Article uri icon

Overview

abstract

  • Self-supervised learning provides an opportunity to explore unlabeled chest X-rays and their associated free-text reports accumulated in clinical routine without manual supervision. This paper proposes a Joint Image Text Representation Learning Network (JoImTeRNet) for pre-training on chest X-ray images and their radiology reports. The model was pre-trained on both the global image-sentence level and the local image region-word level for visual-textual matching. Both are bidirectionally constrained on Cross-Entropy based and ranking-based Triplet Matching Losses. The region-word matching is calculated using the attention mechanism without direct supervision about their mapping. The pre-trained multi-modal representation learning paves the way for downstream tasks concerning image and/or text encoding. We demonstrate the representation learning quality by cross-modality retrievals and multi-label classifications on two datasets: OpenI-IU and MIMIC-CXR. Our code is available at https://github.com/mshaikh2/JoImTeR_MLMI_2021.

publication date

  • September 21, 2021

Identity

PubMed Central ID

  • PMC9134785

Scopus Document Identifier

  • 85092194300

Digital Object Identifier (DOI)

  • 10.1007/978-3-030-58577-8_7

PubMed ID

  • 35647616

Additional Document Info

volume

  • 12966