乐虎游戏|乐虎国际登录|欢迎你

CRNN论文翻译——中英文对照

日期:2020-01-23编辑作者:计算机资讯

CTC 的工作原理

概述

文章作者:Tyan
博客:noahsnail.com  |  CSDN  |  简书

图片 1

背景

在实际的序列学习场景中经常会遇到从没有切分的,带有噪音的数据中对序列的label进行预测,例如语音识别中需要将声学信号转换成words或者是 sub-words.
文章提出了一种新的rnn训练方法,可以将没有切分的序列生成相应的标签。

翻译论文汇总:https://github.com/SnailTyan/deep-learning-papers-translation

Fig. 1. How CTC combine a word (source: )

传统方法的问题

传统方法中常用的HMM,CRF等有一些可以较好的解决这种序列标注问题,但是它们也存在相应的问题:

  1. 通常需要大量的领域相关知识,例如定义相关的label,选择crf的输入特征等等。
  2. 需要基于一定的相关性假设,例如HMM中观察序列之间是相互独立的。
  3. 对HMM 来说训练是产生式的,即使序列标注问题是一个判别问题。

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

这篇文章主要解释CTC 的工作原理。

RNN的优势

  1. 不需要任何关于数据的先验知识,除了选择定义输入和输出的格式。
  2. 具有强大的特征表达能力,能够很好的对问题进行建模,并且可以表示判别式模型。
  3. 对于噪音问题处理的更加鲁棒。
  4. 可以利用long-range seququence 建模。

Abstract

Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

CTC 的全称是Connectionist Temporal Classification. 这个方法主要是解决神经网络label 和output 不对齐的问题(Alignment problem). 这种问题经常出现在scene text recognition, speech recognition, handwriting recognition 这样的应用里。 比如 Fig. 1 中的语音识别, 就会识别出很多个ww, 很多个r, 如果不做处理, 就会变成wworrrlld 我忽略了空白blank). 有一种简单粗暴的方法就是把重复的都当做一个字母。 但是这样 就会出现不识别单词 happy 这样的单词。 这个时候, 空白的作用就非常大了, 在这里他把同一个字母隔开, 比如happy, 就会变成和hhh aaaa ppp ppp yyyy --> happy.

效果

效果优于hmm和hmm-rnn模型。

摘要

基于图像的序列识别一直是计算机视觉中长期存在的研究课题。在本文中,我们研究了场景文本识别的问题,这是基于图像的序列识别中最重要和最具挑战性的任务之一。提出了一种将特征提取,序列建模和转录整合到统一框架中的新型神经网络架构。与以前的场景文本识别系统相比,所提出的架构具有四个不同的特性:(1)与大多数现有的组件需要单独训练和协调的算法相比,它是端对端训练的。(2)它自然地处理任意长度的序列,不涉及字符分割或水平尺度归一化。(3)它不仅限于任何预定义的词汇,并且在无词典和基于词典的场景文本识别任务中都取得了显著的表现。(4)它产生了一个有效而小得多的模型,这对于现实世界的应用场景更为实用。在包括IIIT-5K,Street View Text和ICDAR数据集在内的标准基准数据集上的实验证明了提出的算法比现有技术的更有优势。此外,提出的算法在基于图像的音乐配乐识别任务中表现良好,这显然证实了它的泛化性。

用数学的表达式来说, 这个是一个mapping, X-> Y 而且是一个 monotonic mapping , 同时多个X 会映射到同一个Y. 最后一个特性就比较让人头疼, 就是Y 的长度不能 长于X 的长度。 想必用过tensorflow tf.nn.ctc_loss 应该都知道这样的 error message 把

具体思路

将网络的输出表示为所有可能的label序列的概率分布,有了这个概率分布可以定义目标函数为直接最大化正确的label序列的概率。
因为目标函数可导,所以可以用标准的bp方法进行训练。

1. Introduction

Recently, the community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, specifically Deep Convolutional Neural Networks (DCNN), in various vision tasks. However, majority of the recent works related to deep neural networks have devoted to detection or classification of object categories [12, 25]. In this paper, we are concerned with a classic problem in computer vision: image-based sequence recognition. In real world, a stable of visual objects, such as scene text, handwriting and musical score, tend to occur in the form of sequence, not in isolation. Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label. Therefore, recognition of such objects can be naturally cast as a sequence recognition problem. Another unique property of sequence-like objects is that their lengths may vary drastically. For instance, English words can either consist of 2 characters such as “OK” or 15 characters such as “congratulations”. Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.

“No valid path found, Loss: inf ”

Temporal Classification

1. 引言

最近,社区已经看到神经网络的强大复兴,这主要受到深度神经网络模型,特别是深度卷积神经网络(DCNN)在各种视觉任务中的巨大成功的推动。然而,最近大多数与深度神经网络相关的工作主要致力于检测或分类对象类别[12,25]。在本文中,我们关注计算机视觉中的一个经典问题:基于图像的序列识别。在现实世界中,稳定的视觉对象,如场景文字,手写字符和乐谱,往往以序列的形式出现,而不是孤立地出现。与一般的对象识别不同,识别这样的类序列对象通常需要系统预测一系列对象标签,而不是单个标签。因此,可以自然地将这样的对象的识别作为序列识别问题。类序列对象的另一个独特之处在于它们的长度可能会有很大变化。例如,英文单词可以由2个字符组成,如“OK”,或由15个字符组成,如“congratulations”。因此,最流行的深度模型像DCNN[25,26]不能直接应用于序列预测,因为DCNN模型通常对具有固定维度的输入和输出进行操作,因此不能产生可变长度的标签序列。

Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text). For example, the algorithms in [35, 8] firstly detect individual characters and then recognize these detected characters with DCNN models, which are trained using labeled character images. Such methods often require training a strong character detector for accurately detecting and cropping each character out from the original word image. Some other approaches (such as [22]) treat scene text recognition as an image classification problem, and assign a class label to each English word (90K words in total). It turns out a large trained model with a huge number of classes, which is difficult to be generalized to other types of sequence-like objects, such as Chinese texts, musical scores, etc., because the numbers of basic combinations of such kind of sequences can be greater than 1 million. In summary, current systems based on DCNN can not be directly used for image-based sequence recognition.

已经针对特定的类似序列的对象(例如场景文本)进行了一些尝试来解决该问题。例如,[35,8]中的算法首先检测单个字符,然后用DCNN模型识别这些检测到的字符,并使用标注的字符图像进行训练。这些方法通常需要训练强字符检测器,以便从原始单词图像中准确地检测和裁剪每个字符。一些其他方法(如[22])将场景文本识别视为图像分类问题,并为每个英文单词(总共9万个词)分配一个类标签。结果是一个大的训练模型中有很多类,这很难泛化到其它类型的类序列对象,如中文文本,音乐配乐等,因为这种序列的基本组合数目可能大于100万。总之,目前基于DCNN的系统不能直接用于基于图像的序列识别。

Recurrent neural networks (RNN) models, another important branch of the deep neural networks family, were mainly designed for handling sequences. One of the advantages of RNN is that it does not need the position of each element in a sequence object image in both training and testing. However, a preprocessing step that converts an input object image into a sequence of image features, is usually essential. For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features. The preprocessing step is independent of the subsequent components in the pipeline, thus the existing systems based on RNN can not be trained and optimized in an end-to-end fashion.

循环神经网络(RNN)模型是深度神经网络家族中的另一个重要分支,主要是设计来处理序列。RNN的优点之一是在训练和测试中不需要序列目标图像中每个元素的位置。然而,将输入目标图像转换成图像特征序列的预处理步骤通常是必需的。例如,Graves等[16]从手写文本中提取一系列几何或图像特征,而Su和Lu[33]将字符图像转换为序列HOG特征。预处理步骤独立于流程中的后续组件,因此基于RNN的现有系统不能以端到端的方式进行训练和优化。

Several conventional scene text recognition methods that are not based on neural networks also brought insightful ideas and novel representations into this field. For example, Almaza`n et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. Yao et al. [36] and Gordo et al. [14] used mid-level features for scene text recognition. Though achieved promising performance on standard benchmarks, these methods are generally outperformed by previous algorithms based on neural networks [8, 22], as well as the approach proposed in this paper.

一些不是基于神经网络的传统场景文本识别方法也为这一领域带来了有见地的想法和新颖的表现。例如,Almaza`n等人[5]和Rodriguez-Serrano等人[30]提出将单词图像和文本字符串嵌入到公共向量子空间中,并将词识别转换为检索问题。Yao等人[36]和Gordo等人[14]使用中层特征进行场景文本识别。虽然在标准基准数据集上取得了有效的性能,但是前面的基于神经网络的算法[8,22]以及本文提出的方法通常都优于这些方法。

The main contribution of this paper is a novel neural network model, whose network architecture is specifically designed for recognizing sequence-like objects in images. The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN. For sequence-like objects, CRNN possesses several distinctive advantages over conventional neural network models: 1) It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters); 2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 3) It has the same property of RNN, being able to produce a sequence of labels; 4) It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases; 5) It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts [23, 8]; 6) It contains much less parameters than a standard DCNN model, consuming less storage space.

本文的主要贡献是一种新颖的神经网络模型,其网络架构设计专门用于识别图像中的类序列对象。所提出的神经网络模型被称为卷积循环神经网络(CRNN),因为它是DCNN和RNN的组合。对于类序列对象,CRNN与传统神经网络模型相比具有一些独特的优点:1)可以直接从序列标签(例如单词)学习,不需要详细的标注(例如字符);2)直接从图像数据学习信息表示时具有与DCNN相同的性质,既不需要手工特征也不需要预处理步骤,包括二值化/分割,组件定位等;3)具有与RNN相同的性质,能够产生一系列标签;4)对类序列对象的长度无约束,只需要在训练阶段和测试阶段对高度进行归一化;5)与现有技术相比,它在场景文本(字识别)上获得更好或更具竞争力的表现[23,8]。6)它比标准DCNN模型包含的参数要少得多,占用更少的存储空间。

出现这个warning/Error message 的原因就是 Y 的长度大于X 的长度, 导致CTC 无法计算。 解决方案就是检查训练集 中的label 和 实际内容是否相符。 比如, 对于 scene text recognition, 图像中是 hello, how are you doing。 而label 是 hello. 这样的训练集肯定会出问题的。

记号

$S$表示从一个分布$D_{X times Z}$提取的训练样本
输入空间$X = (Rm)*$ 是所有的$m$维的实数向量。
目标空间 $Z = L ^* $ 是所有的Label(有限的)组成的序列空间。
我们称$L^*$中的元素为label sequences 或labellings。

任意的一个样本$s in S,s= (x,z)$,
目标序列$z = (z_1, z_2, ..., z_U)$,
输入序列$x = (x_1, x_2,...,x_T)$
其中$U <= T$

目标是用$S$来训练一个temporal classifier s.t. $h: X -> Z$

2. The Proposed Network Architecture

The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top.

图片 2

Figure 1

Figure 1. The network architecture. The architecture consists of three parts: 1) convolutional layers, which extract a feature sequence from the input image; 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence.

图片 3

Label Error Rate

2. 提出的网络架构

如图1所示,CRNN的网络架构由三部分组成,包括卷积层,循环层和转录层,从底向上。

图片 4

Figure 1

图1。网络架构。架构包括三部分:1) 卷积层,从输入图像中提取特征序列;2) 循环层,预测每一帧的标签分布;3) 转录层,将每一帧的预测变为最终的标签序列。

At the bottom of CRNN, the convolutional layers automatically extract a feature sequence from each input image. On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers. The transcription layer at the top of CRNN is adopted to translate the per-frame predictions by the recurrent layers into a label sequence. Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function.

在CRNN的底部,卷积层自动从每个输入图像中提取特征序列。在卷积网络之上,构建了一个循环网络,用于对卷积层输出的特征序列的每一帧进行预测。采用CRNN顶部的转录层将循环层的每帧预测转化为标签序列。虽然CRNN由不同类型的网络架构(如CNN和RNN)组成,但可以通过一个损失函数进行联合训练。

Source: Ref [1]

度量错误

给定一个测试集$S^prime subset D_{X times Z}$
定义$h$的标签错误率(LER = Label Error Rate)为$S ^ prime$数据集上的标准化的 分类结果和目标之间的编辑距离。
begin{equation}
LER(h,S^prime) = {1 over {|S ^ prime|}} sum _{(x, z) in S ^ prime} { {ED(h(x),z)} over {|z|} }
label{eq:defLER}
end{equation}

其中$ED(p,q)$是sequence$p,q$之间的编辑距离。(插入,替换和删除)

2.1. Feature Sequence Extraction

In CRNN model, the component of convolutional layers is constructed by taking the convolutional and max-pooling layers from a standard CNN model (fully-connected layers are removed). Such component is used to extract a sequential feature representation from an input image. Before being fed into the network, all the images need to be scaled to the same height. Then a sequence of feature vectors is extracted from the feature maps produced by the component of convolutional layers, which is the input for the recurrent layers. Specifically, each feature vector of a feature sequence is generated from left to right on the feature maps by column. This means the i-th feature vector is the concatenation of the i-th columns of all the maps. The width of each column in our settings is fixed to single pixel.

这个图解释了 framewise 的方法和CTC 的方法。 Framewise 的方法需要做每个音素的标记 (independent labelling of each time-step or frame of the input sequence) 而CTC 没有这个需要。 CTC 只是预测了一系列 峰值 紧接着 一些 可能空白 用来区分字母。 但是 Framewise 基于的方法出现了 mis allignling segment boundaries error. 就是说两个 label 的概率分布图太近了, 比如 在发音 dh, dh 和ax 有明显重叠而CTC 的方法却没有。

Connectionist Temporal Classification

介绍了如何表示RNN的输出可以用来做CTC,最直接的想法是将RNN的输出表示为一个给定的输入的关于Label sequences的条件概率分布

2.1. 特征序列提取

在CRNN模型中,通过采用标准CNN模型(去除全连接层)中的卷积层和最大池化层来构造卷积层的组件。这样的组件用于从输入图像中提取序列特征表示。在进入网络之前,所有的图像需要缩放到相同的高度。然后从卷积层组件产生的特征图中提取特征向量序列,这些特征向量序列作为循环层的输入。具体地,特征序列的每一个特征向量在特征图上按列从左到右生成。这意味着第i个特征向量是所有特征图第i列的连接。在我们的设置中每列的宽度固定为单个像素。

As the layers of convolution, max-pooling, and element-wise activation function operate on local regions, they are translation invariant. Therefore, each column of the feature maps corresponds to a rectangle region of the original image (termed the receptive field), and such rectangle regions are in the same order to their corresponding columns on the feature maps from left to right. As illustrated in Fig. 2, each vector in the feature sequence is associated with a receptive field, and can be considered as the image descriptor for that region.

图片 5

Figure 2

Figure 2. The receptive field. Each vector in the extracted feature sequence is associated with a receptive field on the input image, and can be considered as the feature vector of that field.

由于卷积层,最大池化层和元素激活函数在局部区域上执行,因此它们是平移不变的。因此,特征图的每列对应于原始图像的一个矩形区域(称为感受野),并且这些矩形区域与特征图上从左到右的相应列具有相同的顺序。如图2所示,特征序列中的每个向量关联一个感受野,并且可以被认为是该区域的图像描述符。

图片 6

Figure 2

图2。感受野。提取的特征序列中的每一个向量关联输入图像的一个感受野,可认为是该区域的特征向量。

Being robust, rich and trainable, deep convolutional features have been widely adopted for different kinds of visual recognition tasks [25, 12]. Some previous approaches have employed CNN to learn a robust representation for sequence-like objects such as scene text [22]. However, these approaches usually extract holistic representation of the whole image by CNN, then the local deep features are collected for recognizing each component of a sequence-like object. Since CNN requires the input images to be scaled to a fixed size in order to satisfy with its fixed input dimension, it is not appropriate for sequence-like objects due to their large length variation. In CRNN, we convey deep features into sequential representations in order to be invariant to the length variation of sequence-like objects.

鲁棒的,丰富的和可训练的深度卷积特征已被广泛应用于各种视觉识别任务[25,12]。一些以前的方法已经使用CNN来学习诸如场景文本之类的类序列对象的鲁棒表示[22]。然而,这些方法通常通过CNN提取整个图像的整体表示,然后收集局部深度特征来识别类序列对象的每个分量。由于CNN要求将输入图像缩放到固定尺寸,以满足其固定的输入尺寸,因为它们的长度变化很大,因此不适合类序列对象。在CRNN中,我们将深度特征传递到序列表示中,以便对类序列对象的长度变化保持不变。

CTC Loss 的计算比较复杂,参考链接有比较详细的推到过程。 所以这边的解释主要通过截图论文 [1] 公式加以解释。 以下公式和图片都来自于论文 [1].

网络输出到标签序列

CTC网络有一个softmax输出层,
输出层有$|L| + 1$个输出单元,
其中$|L|$指的是label的个数,
1代表空格或其他非label的输出。

2.2. Sequence Labeling

A deep bidirectional Recurrent Neural Network is built on the top of the convolutional layers, as the recurrent layers. The recurrent layers predict a label distribution $y_t$ for each frame $x_t$ in the feature sequence $x = x_1,...,x_T$. The advantages of the recurrent layers are three-fold. Firstly, RNN has a strong capability of capturing contextual information within a sequence. Using contextual cues for image-based sequence recognition is more stable and helpful than treating each symbol independently. Taking scene text recognition as an example, wide characters may require several successive frames to fully describe (refer to Fig. 2). Besides, some ambiguous characters are easier to distinguish when observing their contexts, e.g. it is easier to recognize “il” by contrasting the character heights than by recognizing each of them separately. Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network. Thirdly, RNN is able to operate on sequences of arbitrary lengths, traversing from starts to ends.

CTC 的计算包含一个softmax output layer, 而且也会多一个label .

具体形式

2.2. 序列标注

一个路径path 的概率计算如下。

首先定义初步的映射关系

输入序列$X,|X| = T$
RNN $m$个输入,$n$个输出($n = |L| + 1$)和权重向量$w$ as space a space continuous space map :
$N_w : (R^m) ^ T -> (R^n) ^ T$
定义$y = N_w(x)$是神经网络从输出,并且$y_k ^ t$是第$k$个单元的在时间$t$时的激活值
即label $k$在时间$t$概率,则有长度$T$的序列关于$L ^ prime = L bigcup {blank}$概率为:
begin{equation}
p(pi | X) = prod _ {t = 1} ^ T y_{pi _t} ^ t, forall pi in {L ^ prime } ^ T
label{eq:ctcSeqProb}
end{equation}

一个深度双向循环神经网络是建立在卷积层的顶部,作为循环层。循环层预测特征序列$x

x_1,...,x_T$中每一帧$x_t$的标签分布$y_t$。循环层的优点是三重的。首先,RNN具有很强的捕获序列内上下文信息的能力。对于基于图像的序列识别使用上下文提示比独立处理每个符号更稳定且更有帮助。以场景文本识别为例,宽字符可能需要一些连续的帧来完全描述(参见图2)。此外,一些模糊的字符在观察其上下文时更容易区分,例如,通过对比字符高度更容易识别“il”而不是分别识别它们中的每一个。其次,RNN可以将误差差值反向传播到其输入,即卷积层,从而允许我们在统一的网络中共同训练循环层和卷积层。第三,RNN能够从头到尾对任意长度的序列进行操作。

A traditional RNN unit has a self-connected hidden layer between its input and output layers. Each time it receives a frame $x_t$ in the sequence, it updates its internal state $h_t$ with a non-linear function that takes both current input $x_t$ and past state $h_{t−1}$ as its inputs: $h_t = g(x_t, h_{t−1})$. Then the prediction $y_t$ is made based on $h_t$. In this way, past contexts $lbrace x_{tprime} rbrace _{t prime < t}$ are captured and utilized for prediction. Traditional RNN unit, however, suffers from the vanishing gradient problem [7], which limits the range of context it can store, and adds burden to the training process. Long-Short Term Memory [18, 11] (LSTM) is a type of RNN unit that is specially designed to address this problem. An LSTM (illustrated in Fig. 3) consists of a memory cell and three multiplicative gates, namely the input, output and forget gates. Conceptually, the memory cell stores the past contexts, and the input and output gates allow the cell to store contexts for a long period of time. Meanwhile, the memory in the cell can be cleared by the forget gate. The special design of LSTM allows it to capture long-range dependencies, which often occur in image-based sequences.

图片 7

Figure 3

Figure 3. (a) The structure of a basic LSTM unit. An LSTM consists of a cell module and three gates, namely the input gate, the output gate and the forget gate. (b) The structure of deep bidirectional LSTM we use in our paper. Combining a forward (left to right) and a backward (right to left) LSTMs results in a bidirectional LSTM. Stacking multiple bidirectional LSTM results in a deep bidirectional LSTM.

传统的RNN单元在其输入和输出层之间具有自连接的隐藏层。每次接收到序列中的帧$x_t$时,它将使用非线性函数来更新其内部状态$h_t$,该非线性函数同时接收当前输入$x_t$和过去状态$h_{t−1}$作为其输入:$h_t = g(x_t, h_{t−1})$。那么预测$y_t$是基于$h_t$的。以这种方式,过去的上下文{$lbrace x_{tprime} rbrace _{t prime < t}$被捕获并用于预测。然而,传统的RNN单元有梯度消失的问题[7],这限制了其可以存储的上下文范围,并给训练过程增加了负担。长短时记忆[18,11](LSTM)是一种专门设计用于解决这个问题的RNN单元。LSTM(图3所示)由一个存储单元和三个多重门组成,即输入,输出和遗忘门。在概念上,存储单元存储过去的上下文,并且输入和输出门允许单元长时间地存储上下文。同时,单元中的存储可以被遗忘门清除。LSTM的特殊设计允许它捕获长距离依赖,这经常发生在基于图像的序列中。

图片 8

Figure 3

图3。(a) 基本的LSTM单元的结构。LSTM包括单元模块和三个门,即输入门,输出门和遗忘门。(b)我们论文中使用的深度双向LSTM结构。合并前向(从左到右)和后向(从右到左)LSTM的结果到双向LSTM中。在深度双向LSTM中堆叠多个双向LSTM结果。

LSTM is directional, it only uses past contexts. However, in image-based sequences, contexts from both directions are useful and complementary to each other. Therefore, we follow [17] and combine two LSTMs, one forward and one backward, into a bidirectional LSTM. Furthermore, multiple bidirectional LSTMs can be stacked, resulting in a deep bidirectional LSTM as illustrated in Fig. 3.b. The deep structure allows higher level of abstractions than a shallow one, and has achieved significant performance improvements in the task of speech recognition [17].

LSTM是定向的,它只使用过去的上下文。然而,在基于图像的序列中,两个方向的上下文是相互有用且互补的。因此,我们遵循[17],将两个LSTM,一个向前和一个向后组合到一个双向LSTM中。此外,可以堆叠多个双向LSTM,得到如图3.b所示的深双向LSTM。深层结构允许比浅层抽象更高层次的抽象,并且在语音识别任务中取得了显著的性能改进[17]。

In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3.b, i.e. Back-Propagation Through Time (BPTT). At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers. In practice, we create a custom network layer, called “Map-to-Sequence”, as the bridge between convolutional layers and recurrent layers.

在循环层中,误差在图3.b所示箭头的相反方向传播,即反向传播时间(BPTT)。在循环层的底部,传播差异的序列被连接成映射,将特征映射转换为特征序列的操作进行反转并反馈到卷积层。实际上,我们创建一个称为“Map-to-Sequence”的自定义网络层,作为卷积层和循环层之间的桥梁。

图片 9图片 10

定义一个多对一的映射

映射形式如下:
$beta : (L ^ prime )^T -> L {<=T}$,$L{<= T}$是可能的标签
具体步骤:
将所有的空格和重复的标签从路径中移除,例如
$beta (aab)= beta (aaabb) = aa$,其中代表空格

2.3. Transcription

Transcription is the process of converting the per-frame predictions made by RNN into a label sequence. Mathematically, transcription is to find the label sequence with the highest probability conditioned on the per-frame predictions. In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions. A lexicon is a set of label sequences that prediction is constraint to, e.g. a spell checking dictionary. In lexicon-free mode, predictions are made without any lexicon. In lexicon-based mode, predictions are made by choosing the label sequence that has the highest probability.

这里, x 是输入数据, y 是输出数据, 都是序列。 L 是 序列标签的label, L‘ 是序列标签+blank. 其中公式, 解释了 给定一个输入, 从时间t=1 到T 每个时间点的概率相乘, 最后就得到了对应的路径的概率。

计算关于$l in L^ {<= T}$的条件概率

begin{equation}
p(l|x) = sum _ {pi in beta ^ {-1}(l)} p (pi | x)
label{eq:binverseprob}
end{equation}

2.3. 转录

转录是将RNN所做的每帧预测转换成标签序列的过程。数学上,转录是根据每帧预测找到具有最高概率的标签序列。在实践中,存在两种转录模式,即无词典转录和基于词典的转录。词典是一组标签序列,预测受拼写检查字典约束。在无词典模式中,预测时没有任何词典。在基于词典的模式中,通过选择具有最高概率的标签序列进行预测。

本文由乐虎游戏发布于计算机资讯,转载请注明出处:CRNN论文翻译——中英文对照

关键词:

Yolo系列其三:Yolo_v3

相信SSD的横空出世委实给了Yolo许多压力。在目标检测准确性上Yolo本来就弱于Faster-RCNN,它的提出之初就主打其能保持...

详细>>

应用预先练习互联网和脾气抽出大力进步图像识别率

大家在前几节介绍过卷积网络的运算原理,以致通过代码实施,体验到了卷积网络对图纸信息抽出的灵光。以往三个...

详细>>

Java开发小技巧(五):HttpClient工具类

【声明】   大许多Java应用程序都会通过HTTP公约来调用接口访谈种种互连网资源,JDK也提供了相应的HTTP工具包,不过...

详细>>

Fountain 丨 PDF转Word 用谷歌文档进行在线OCR识别

PDF转word是一个永恒的话题, 原因有二 对于不少文字工作者来说,多文件格式的转换可谓是“家常便饭”,而最让人头...

详细>>