CTC 的工作原理


博客:noahsnail.com  |  CSDN  |  简书

图片 1


在实际的序列学习场景中经常会遇到从没有切分的,带有噪音的数据中对序列的label进行预测,例如语音识别中需要将声学信号转换成words或者是 sub-words.


Fig. 1. How CTC combine a word (source: )



  1. 通常需要大量的领域相关知识,例如定义相关的label,选择crf的输入特征等等。
  2. 需要基于一定的相关性假设,例如HMM中观察序列之间是相互独立的。
  3. 对HMM 来说训练是产生式的,即使序列标注问题是一个判别问题。

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

这篇文章主要解释CTC 的工作原理。


  1. 不需要任何关于数据的先验知识,除了选择定义输入和输出的格式。
  2. 具有强大的特征表达能力,能够很好的对问题进行建模,并且可以表示判别式模型。
  3. 对于噪音问题处理的更加鲁棒。
  4. 可以利用long-range seququence 建模。


Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

CTC 的全称是Connectionist Temporal Classification. 这个方法主要是解决神经网络label 和output 不对齐的问题(Alignment problem). 这种问题经常出现在scene text recognition, speech recognition, handwriting recognition 这样的应用里。 比如 Fig. 1 中的语音识别, 就会识别出很多个ww, 很多个r, 如果不做处理, 就会变成wworrrlld 我忽略了空白blank). 有一种简单粗暴的方法就是把重复的都当做一个字母。 但是这样 就会出现不识别单词 happy 这样的单词。 这个时候, 空白的作用就非常大了, 在这里他把同一个字母隔开, 比如happy, 就会变成和hhh aaaa ppp ppp yyyy --> happy.




基于图像的序列识别一直是计算机视觉中长期存在的研究课题。在本文中,我们研究了场景文本识别的问题,这是基于图像的序列识别中最重要和最具挑战性的任务之一。提出了一种将特征提取,序列建模和转录整合到统一框架中的新型神经网络架构。与以前的场景文本识别系统相比,所提出的架构具有四个不同的特性:(1)与大多数现有的组件需要单独训练和协调的算法相比,它是端对端训练的。(2)它自然地处理任意长度的序列,不涉及字符分割或水平尺度归一化。(3)它不仅限于任何预定义的词汇,并且在无词典和基于词典的场景文本识别任务中都取得了显著的表现。(4)它产生了一个有效而小得多的模型,这对于现实世界的应用场景更为实用。在包括IIIT-5K,Street View Text和ICDAR数据集在内的标准基准数据集上的实验证明了提出的算法比现有技术的更有优势。此外,提出的算法在基于图像的音乐配乐识别任务中表现良好,这显然证实了它的泛化性。

用数学的表达式来说, 这个是一个mapping, X-> Y 而且是一个 monotonic mapping , 同时多个X 会映射到同一个Y. 最后一个特性就比较让人头疼, 就是Y 的长度不能 长于X 的长度。 想必用过tensorflow tf.nn.ctc_loss 应该都知道这样的 error message 把



1. Introduction

Recently, the community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, specifically Deep Convolutional Neural Networks (DCNN), in various vision tasks. However, majority of the recent works related to deep neural networks have devoted to detection or classification of object categories [12, 25]. In this paper, we are concerned with a classic problem in computer vision: image-based sequence recognition. In real world, a stable of visual objects, such as scene text, handwriting and musical score, tend to occur in the form of sequence, not in isolation. Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label. Therefore, recognition of such objects can be naturally cast as a sequence recognition problem. Another unique property of sequence-like objects is that their lengths may vary drastically. For instance, English words can either consist of 2 characters such as “OK” or 15 characters such as “congratulations”. Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.

“No valid path found, Loss: inf ”

Temporal Classification

1. 引言


Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text). For example, the algorithms in [35, 8] firstly detect individual characters and then recognize these detected characters with DCNN models, which are trained using labeled character images. Such methods often require training a strong character detector for accurately detecting and cropping each character out from the original word image. Some other approaches (such as [22]) treat scene text recognition as an image classification problem, and assign a class label to each English word (90K words in total). It turns out a large trained model with a huge number of classes, which is difficult to be generalized to other types of sequence-like objects, such as Chinese texts, musical scores, etc., because the numbers of basic combinations of such kind of sequences can be greater than 1 million. In summary, current systems based on DCNN can not be directly used for image-based sequence recognition.


Recurrent neural networks (RNN) models, another important branch of the deep neural networks family, were mainly designed for handling sequences. One of the advantages of RNN is that it does not need the position of each element in a sequence object image in both training and testing. However, a preprocessing step that converts an input object image into a sequence of image features, is usually essential. For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features. The preprocessing step is independent of the subsequent components in the pipeline, thus the existing systems based on RNN can not be trained and optimized in an end-to-end fashion.


Several conventional scene text recognition methods that are not based on neural networks also brought insightful ideas and novel representations into this field. For example, Almaza`n et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. Yao et al. [36] and Gordo et al. [14] used mid-level features for scene text recognition. Though achieved promising performance on standard benchmarks, these methods are generally outperformed by previous algorithms based on neural networks [8, 22], as well as the approach proposed in this paper.


The main contribution of this paper is a novel neural network model, whose network architecture is specifically designed for recognizing sequence-like objects in images. The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN. For sequence-like objects, CRNN possesses several distinctive advantages over conventional neural network models: 1) It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters); 2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 3) It has the same property of RNN, being able to produce a sequence of labels; 4) It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases; 5) It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts [23, 8]; 6) It contains much less parameters than a standard DCNN model, consuming less storage space.


出现这个warning/Error message 的原因就是 Y 的长度大于X 的长度, 导致CTC 无法计算。 解决方案就是检查训练集 中的label 和 实际内容是否相符。 比如, 对于 scene text recognition, 图像中是 hello, how are you doing。 而label 是 hello. 这样的训练集肯定会出问题的。


$S$表示从一个分布$D_{X times Z}$提取的训练样本
输入空间$X = (Rm)*$ 是所有的$m$维的实数向量。
目标空间 $Z = L ^* $ 是所有的Label(有限的)组成的序列空间。
我们称$L^*$中的元素为label sequences 或labellings。

任意的一个样本$s in S,s= (x,z)$,
目标序列$z = (z_1, z_2, ..., z_U)$,
输入序列$x = (x_1, x_2,...,x_T)$
其中$U <= T$

目标是用$S$来训练一个temporal classifier s.t. $h: X -> Z$

2. The Proposed Network Architecture

The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top.

图片 2

Figure 1

Figure 1. The network architecture. The architecture consists of three parts: 1) convolutional layers, which extract a feature sequence from the input image; 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence.

图片 3

Label Error Rate

2. 提出的网络架构


图片 4

Figure 1

图1。网络架构。架构包括三部分:1) 卷积层,从输入图像中提取特征序列;2) 循环层,预测每一帧的标签分布;3) 转录层,将每一帧的预测变为最终的标签序列。

At the bottom of CRNN, the convolutional layers automatically extract a feature sequence from each input image. On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers. The transcription layer at the top of CRNN is adopted to translate the per-frame predictions by the recurrent layers into a label sequence. Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function.


Source: Ref [1]


给定一个测试集$S^prime subset D_{X times Z}$
定义$h$的标签错误率(LER = Label Error Rate)为$S ^ prime$数据集上的标准化的 分类结果和目标之间的编辑距离。
LER(h,S^prime) = {1 over {|S ^ prime|}} sum _{(x, z) in S ^ prime} { {ED(h(x),z)} over {|z|} }


2.1. Feature Sequence Extraction

In CRNN model, the component of convolutional layers is constructed by taking the convolutional and max-pooling layers from a standard CNN model (fully-connected layers are removed). Such component is used to extract a sequential feature representation from an input image. Before being fed into the network, all the images need to be scaled to the same height. Then a sequence of feature vectors is extracted from the feature maps produced by the component of convolutional layers, which is the input for the recurrent layers. Specifically, each feature vector of a feature sequence is generated from left to right on the feature maps by column. This means the i-th feature vector is the concatenation of the i-th columns of all the maps. The width of each column in our settings is fixed to single pixel.

这个图解释了 framewise 的方法和CTC 的方法。 Framewise 的方法需要做每个音素的标记 (independent labelling of each time-step or frame of the input sequence) 而CTC 没有这个需要。 CTC 只是预测了一系列 峰值 紧接着 一些 可能空白 用来区分字母。 但是 Framewise 基于的方法出现了 mis allignling segment boundaries error. 就是说两个 label 的概率分布图太近了, 比如 在发音 dh, dh 和ax 有明显重叠而CTC 的方法却没有。

Connectionist Temporal Classification

介绍了如何表示RNN的输出可以用来做CTC,最直接的想法是将RNN的输出表示为一个给定的输入的关于Label sequences的条件概率分布

2.1. 特征序列提取


As the layers of convolution, max-pooling, and element-wise activation function operate on local regions, they are translation invariant. Therefore, each column of the feature maps corresponds to a rectangle region of the original image (termed the receptive field), and such rectangle regions are in the same order to their corresponding columns on the feature maps from left to right. As illustrated in Fig. 2, each vector in the feature sequence is associated with a receptive field, and can be considered as the image descriptor for that region.

图片 5

Figure 2

Figure 2. The receptive field. Each vector in the extracted feature sequence is associated with a receptive field on the input image, and can be considered as the feature vector of that field.


图片 6

Figure 2


Being robust, rich and trainable, deep convolutional features have been widely adopted for different kinds of visual recognition tasks [25, 12]. Some previous approaches have employed CNN to learn a robust representation for sequence-like objects such as scene text [22]. However, these approaches usually extract holistic representation of the whole image by CNN, then the local deep features are collected for recognizing each component of a sequence-like object. Since CNN requires the input images to be scaled to a fixed size in order to satisfy with its fixed input dimension, it is not appropriate for sequence-like objects due to their large length variation. In CRNN, we convey deep features into sequential representations in order to be invariant to the length variation of sequence-like objects.


CTC Loss 的计算比较复杂,参考链接有比较详细的推到过程。 所以这边的解释主要通过截图论文 [1] 公式加以解释。 以下公式和图片都来自于论文 [1].


输出层有$|L| + 1$个输出单元,

2.2. Sequence Labeling

A deep bidirectional Recurrent Neural Network is built on the top of the convolutional layers, as the recurrent layers. The recurrent layers predict a label distribution $y_t$ for each frame $x_t$ in the feature sequence $x = x_1,...,x_T$. The advantages of the recurrent layers are three-fold. Firstly, RNN has a strong capability of capturing contextual information within a sequence. Using contextual cues for image-based sequence recognition is more stable and helpful than treating each symbol independently. Taking scene text recognition as an example, wide characters may require several successive frames to fully describe (refer to Fig. 2). Besides, some ambiguous characters are easier to distinguish when observing their contexts, e.g. it is easier to recognize “il” by contrasting the character heights than by recognizing each of them separately. Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network. Thirdly, RNN is able to operate on sequences of arbitrary lengths, traversing from starts to ends.

CTC 的计算包含一个softmax output layer, 而且也会多一个label .


2.2. 序列标注

一个路径path 的概率计算如下。


输入序列$X,|X| = T$
RNN $m$个输入,$n$个输出($n = |L| + 1$)和权重向量$w$ as space a space continuous space map :
$N_w : (R^m) ^ T -> (R^n) ^ T$
定义$y = N_w(x)$是神经网络从输出,并且$y_k ^ t$是第$k$个单元的在时间$t$时的激活值
即label $k$在时间$t$概率,则有长度$T$的序列关于$L ^ prime = L bigcup {blank}$概率为:
p(pi | X) = prod _ {t = 1} ^ T y_{pi _t} ^ t, forall pi in {L ^ prime } ^ T



A traditional RNN unit has a self-connected hidden layer between its input and output layers. Each time it receives a frame $x_t$ in the sequence, it updates its internal state $h_t$ with a non-linear function that takes both current input $x_t$ and past state $h_{t−1}$ as its inputs: $h_t = g(x_t, h_{t−1})$. Then the prediction $y_t$ is made based on $h_t$. In this way, past contexts $lbrace x_{tprime} rbrace _{t prime < t}$ are captured and utilized for prediction. Traditional RNN unit, however, suffers from the vanishing gradient problem [7], which limits the range of context it can store, and adds burden to the training process. Long-Short Term Memory [18, 11] (LSTM) is a type of RNN unit that is specially designed to address this problem. An LSTM (illustrated in Fig. 3) consists of a memory cell and three multiplicative gates, namely the input, output and forget gates. Conceptually, the memory cell stores the past contexts, and the input and output gates allow the cell to store contexts for a long period of time. Meanwhile, the memory in the cell can be cleared by the forget gate. The special design of LSTM allows it to capture long-range dependencies, which often occur in image-based sequences.

图片 7

Figure 3

Figure 3. (a) The structure of a basic LSTM unit. An LSTM consists of a cell module and three gates, namely the input gate, the output gate and the forget gate. (b) The structure of deep bidirectional LSTM we use in our paper. Combining a forward (left to right) and a backward (right to left) LSTMs results in a bidirectional LSTM. Stacking multiple bidirectional LSTM results in a deep bidirectional LSTM.

传统的RNN单元在其输入和输出层之间具有自连接的隐藏层。每次接收到序列中的帧$x_t$时,它将使用非线性函数来更新其内部状态$h_t$,该非线性函数同时接收当前输入$x_t$和过去状态$h_{t−1}$作为其输入:$h_t = g(x_t, h_{t−1})$。那么预测$y_t$是基于$h_t$的。以这种方式,过去的上下文{$lbrace x_{tprime} rbrace _{t prime < t}$被捕获并用于预测。然而,传统的RNN单元有梯度消失的问题[7],这限制了其可以存储的上下文范围,并给训练过程增加了负担。长短时记忆[18,11](LSTM)是一种专门设计用于解决这个问题的RNN单元。LSTM(图3所示)由一个存储单元和三个多重门组成,即输入,输出和遗忘门。在概念上,存储单元存储过去的上下文,并且输入和输出门允许单元长时间地存储上下文。同时,单元中的存储可以被遗忘门清除。LSTM的特殊设计允许它捕获长距离依赖,这经常发生在基于图像的序列中。

图片 8

Figure 3

图3。(a) 基本的LSTM单元的结构。LSTM包括单元模块和三个门,即输入门,输出门和遗忘门。(b)我们论文中使用的深度双向LSTM结构。合并前向(从左到右)和后向(从右到左)LSTM的结果到双向LSTM中。在深度双向LSTM中堆叠多个双向LSTM结果。

LSTM is directional, it only uses past contexts. However, in image-based sequences, contexts from both directions are useful and complementary to each other. Therefore, we follow [17] and combine two LSTMs, one forward and one backward, into a bidirectional LSTM. Furthermore, multiple bidirectional LSTMs can be stacked, resulting in a deep bidirectional LSTM as illustrated in Fig. 3.b. The deep structure allows higher level of abstractions than a shallow one, and has achieved significant performance improvements in the task of speech recognition [17].


In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3.b, i.e. Back-Propagation Through Time (BPTT). At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers. In practice, we create a custom network layer, called “Map-to-Sequence”, as the bridge between convolutional layers and recurrent layers.


图片 9图片 10


$beta : (L ^ prime )^T -> L {<=T}$,$L{<= T}$是可能的标签
$beta (aab)= beta (aaabb) = aa$,其中代表空格

2.3. Transcription

Transcription is the process of converting the per-frame predictions made by RNN into a label sequence. Mathematically, transcription is to find the label sequence with the highest probability conditioned on the per-frame predictions. In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions. A lexicon is a set of label sequences that prediction is constraint to, e.g. a spell checking dictionary. In lexicon-free mode, predictions are made without any lexicon. In lexicon-based mode, predictions are made by choosing the label sequence that has the highest probability.

这里, x 是输入数据, y 是输出数据, 都是序列。 L 是 序列标签的label, L‘ 是序列标签+blank. 其中公式, 解释了 给定一个输入, 从时间t=1 到T 每个时间点的概率相乘, 最后就得到了对应的路径的概率。

计算关于$l in L^ {<= T}$的条件概率

p(l|x) = sum _ {pi in beta ^ {-1}(l)} p (pi | x)

2.3. 转录











【声明】   大许多Java应用程序都会通过HTTP公约来调用接口访谈种种互连网资源,JDK也提供了相应的HTTP工具包,不过...


Fountain 丨 PDF转Word 用谷歌文档进行在线OCR识别

PDF转word是一个永恒的话题, 原因有二 对于不少文字工作者来说,多文件格式的转换可谓是“家常便饭”,而最让人头...