Locating transcription factor binding sites by fully convolutional neural network

Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University

Corresponding authors: Qi Liu, Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Siping Road 1239, Shanghai 200092, China. E-mail: qiliu@tongji.edu.cn; De-Shuang Huang, Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Caoan Road 4800, Shanghai 201804, China. E-mail: dshuang@tongji.edu.cn.

Search for other works by this author on: De-Shuang Huang De-Shuang Huang Institute of Machines Learning and Systems Biology, Tongji University

Search for other works by this author on:

Briefings in Bioinformatics, Volume 22, Issue 5, September 2021, bbaa435, https://doi.org/10.1093/bib/bbaa435

26 January 2021 13 November 2020 Revision received: 11 December 2020 26 December 2020 26 January 2021

Cite

Qinhu Zhang, Siguo Wang, Zhanheng Chen, Ying He, Qi Liu, De-Shuang Huang, Locating transcription factor binding sites by fully convolutional neural network, Briefings in Bioinformatics, Volume 22, Issue 5, September 2021, bbaa435, https://doi.org/10.1093/bib/bbaa435

Navbar Search Filter Mobile Enter search term Search Navbar Search Filter Enter search term Search

Abstract

Transcription factors (TFs) play an important role in regulating gene expression, thus identification of the regions bound by them has become a fundamental step for molecular and cellular biology. In recent years, an increasing number of deep learning (DL) based methods have been proposed for predicting TF binding sites (TFBSs) and achieved impressive prediction performance. However, these methods mainly focus on predicting the sequence specificity of TF-DNA binding, which is equivalent to a sequence-level binary classification task, and fail to identify motifs and TFBSs accurately. In this paper, we developed a fully convolutional network coupled with global average pooling (FCNA), which by contrast is equivalent to a nucleotide-level binary classification task, to roughly locate TFBSs and accurately identify motifs. Experimental results on human ChIP-seq datasets show that FCNA outperforms other competing methods significantly. Besides, we find that the regions located by FCNA can be used by motif discovery tools to further refine the prediction performance. Furthermore, we observe that FCNA can accurately identify TF-DNA binding motifs across different cell lines and infer indirect TF-DNA bindings.

Introduction

Transcription factors (TFs) can activate or suppress transcription of genes by binding to specific DNA noncoding regions, thereby playing an integral role in gene expression. Previous studies have confirmed that TF binding sites (TFBSs) are some short DNA sequences and relatively conserved in the long-term evolution [ 1], and generally have specific patterns that are commonly called TF-DNA binding motifs. Identification of TFBSs and their corresponding motifs have become a fundamental step for molecular and cellular biology [ 2].

Due to the fast development of high-throughput sequencing technology in the last decades, particularly, Chromatin Immunoprecipitation sequencing (ChIP-seq) [ 3] provides a large amount of TF-DNA binding data and enables new insights into gene regulation. Abundant TF-DNA binding data provide an unprecedented opportunity for developing computational methods to predict TFBSs and motifs. Based on these binding data, a series of computational methods have been proposed for predicting motifs. For example, MEME (Multiple EM for Motif Elicitation) [ 4], based on expectation maximization (EM), predicted TF-DNA binding motifs by searching for repeated, ungapped sequence patterns that occur in the biological sequences. DREME (Discriminative Regular Expression Motif Elicitation) [ 5] used a simpler, nonprobabilistic model (regular expressions) to describe the short binding motifs characteristic of single TFs, which is often used as the complement of MEME. MEME-ChIP [ 6] identified motifs from ChIP-seq peak regions by assembling two complementary motif discovery tools: MEME and DREME. However, the high computational complexity of these motif discovery tools restricts the number of input sequences or the range of search space, which may sacrifice the accuracy of identifying motifs. Over the past 5 years, deep learning (DL) have achieved impressive performance in many fields, such as computer vision and natural language processing, inspiring researchers to design DL-based methods to predict TFBSs and motifs [ 7–9]. For example, DeepBind [ 10], one of the earliest and most well-verified DL-based algorithms, applied convolutional neural networks (CNNs) to predict the sequence specificity of TF-DNA binding. DeepSea [ 11], another impressive DL-based algorithm, also used deep CNN to predict TF-NDA binding motifs and the chromatin effects of sequence alterations from large-scale chromatin-profiling data. DanQ [ 12] predicted TF-NDA binding motifs and prioritized functional SNPs by combining CNN with recurrent neural networks (RNNs). However, these DL-based methods mainly focus on predicting the sequence specificity of TF-DNA binding, and fail to identify motifs and TFBSs accurately. Besides, they view motif discovery as a sequence-level binary classification task, thereby they need to carefully select negative sequences for positive sequences (peak regions), and different selection strategies will give rise to diverse predictions.

In this paper, we developed a novel motif discovery method which is mainly based on a fully CNN coupled with global average pooling, namely fully convolutional network coupled with global average pooling (FCNA). The proposed model FCNA views motif discovery as a nucleotide-level binary classification task, which can (i) avoid generating negative sequences, and (ii) locate some short regions that contain TFBSs, and (iii) predict TF-DNA binding motifs accurately. Specifically, (i) high-quality position counting matrices (PCMs) were collected from the HOCOMOCO motif database [ 13], by which each nucleotide in DNA sequences was annotated; (ii) FCNA, which incorporates a fully CNN, a global average pooling, and a hard negative mining loss, was trained on the annotated TF-DNA binding data; (iii) the trained FCNA was used to locate TFBSs and predict motifs on the test data. Experimental results on the ChIP-seq datasets show that FCNA outperforms other competing methods significantly. Besides, FCNA was first to locate some short regions that contain TFBSs, on which motif discovery tools were then trained to predict TF-DNA binding motifs. As a result, we find that the regions located by FCNA can contribute to further refining the performance of predicting motifs. Furthermore, according to the predicted motifs, we observe that FCNA can accurately identify TF-DNA binding motifs across different cell lines and infer indirect TF-DNA bindings.

Materials and methods

Data preparation

We collected 53 TF ChIP-seq datasets from the ENCODE project, which are separately from three cell lines including A549 (20), GM12878 (21), and MCF7 (12), and downloaded high-quality PCMs (marked as A) from the HOCOMOCO motif database. For each TF dataset, 500 bp regions surrounding peaks were extracted, and its corresponding PCM was used to annotate each nucleotide in the 500 bp regions as 0 or 1 in which label ‘1’ means that the nucleotide belongs to TFBSs. Briefly, since PCM not only provide the counting number of four nucleotides at each position but also the exact length of the corresponding motif, each region of the same length as the motif was scored by the PCM, and then the region with the highest score was chosen as the positive data, which assumes that the chosen region is the TFBS. As we known, TFBSs are short sequences ranging from 5 bp to 22 bp. Therefore, the annotated data are extremely imbalanced in the experiments, and the ratio of negative to positive is about 32.

The framework of FCNA

The fully convolutional network (FCN) was originally applied to image segmentation [ 14], which replaces all fully-connected layers with convolutional layers and can take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. FCN often adapts classification networks (e.g. VGG [ 15], ResNet [ 16]) into FCN, and uses deconvolution operations to gradually restore downsampled feature maps to the original image size, and uses a skip line to combine semantic information from a deep layer with appearance information from a shallow layer. Another typical segmentation model U-Net [ 17] was applied to biomedical image segmentation, which is similar to FCN but adopts a symmetrical architecture, and uses upsample operations and normal convolution operations to gradually restore downsampled feature maps to the original image size.

Inspired by the original FCN and U-Net, we designed a novel motif discovery method namely FCNA for locating TFBSs and predicting TF-DNA binding motifs. As shown in Figure 1, FCNA is a symmetrical architecture, which consists of a top–down encoding process (left), a bottom–up decoding process (right). The source code and data are available at: https://github.com/turningpoint1988/FCNA.

The framework of FCNA, which mainly contains a top–down encoding process (left), a bottom–up decoding process (right).

Figure 1

The framework of FCNA, which mainly contains a top–down encoding process (left), a bottom–up decoding process (right).