Consideration-based Picture Captioning with Keras

In picture captioning, an algorithm is given a picture and tasked with producing a smart caption. It’s a difficult process for a number of causes, not the least being that it entails a notion of saliency or relevance. For this reason current deep studying approaches largely embody some “consideration” mechanism (typically even a couple of) to assist specializing in related picture options.

On this submit, we show a formulation of picture captioning as an encoder-decoder downside, enhanced by spatial consideration over picture grid cells. The concept comes from a current paper on Neural Picture Caption Era with Visible Consideration (Xu et al. 2015), and employs the identical type of consideration algorithm as detailed in our submit on machine translation.

We’re porting Python code from a current Google Colaboratory pocket book, utilizing Keras with TensorFlow keen execution to simplify our lives.

Conditions

The code proven right here will work with the present CRAN variations of tensorflow, keras, and tfdatasets.
Examine that you simply’re utilizing not less than model 1.9 of TensorFlow. If that isn’t the case, as of this writing, this

will get you model 1.10.

When loading libraries, please be sure you’re executing the primary 4 strains on this precise order.
We want to verify we’re utilizing the TensorFlow implementation of Keras (tf.keras in Python land), and we now have to allow keen execution earlier than utilizing TensorFlow in any manner.

No have to copy-paste any code snippets – you’ll discover the entire code (so as vital for execution) right here: eager-image-captioning.R.

The dataset

MS-COCO (“Frequent Objects in Context”) is one among, maybe the, reference dataset in picture captioning (object detection and segmentation, too).
We’ll be utilizing the coaching pictures and annotations from 2014 – be warned, relying in your location, the obtain can take a lengthy time.

After unpacking, let’s outline the place the pictures and captions are.

annotation_file <- "train2014/annotations/captions_train2014.json"
image_path <- "train2014/train2014"

The annotations are in JSON format, and there are 414113 of them! Fortunately for us we didn’t need to obtain that many pictures – each picture comes with 5 completely different captions, for higher generalizability.

annotations <- fromJSON(file = annotation_file)
annot_captions <- annotations[[4]]

num_captions <- size(annot_captions)

We retailer each annotations and picture paths in lists, for later loading.

all_captions <- vector(mode = "listing", size = num_captions)
all_img_names <- vector(mode = "listing", size = num_captions)

for (i in seq_len(num_captions))

Relying in your computing setting, you’ll for positive need to prohibit the variety of examples used.
This submit will use 30000 captioned pictures, chosen randomly, and put aside 20% for validation.

Under, we take random samples, break up into coaching and validation elements. The companion code may also retailer the indices on disk, so you’ll be able to decide up on verification and evaluation later.

num_examples <- 30000

random_sample <- pattern(1:num_captions, measurement = num_examples)
train_indices <- pattern(random_sample, measurement = size(random_sample) * 0.8)
validation_indices <- setdiff(random_sample, train_indices)

sample_captions <- all_captions[random_sample]
sample_images <- all_img_names[random_sample]
train_captions <- all_captions[train_indices]
train_images <- all_img_names[train_indices]
validation_captions <- all_captions[validation_indices]
validation_images <- all_img_names[validation_indices]

Interlude

Earlier than actually diving into the technical stuff, let’s take a second to replicate on this process.
In typical image-related deep studying walk-throughs, we’re used to seeing well-defined issues – even when in some instances, the answer could also be laborious. Take, for instance, the stereotypical canine vs. cat downside. Some canine might seem like cats and a few cats might seem like canine, however that’s about it: All in all, within the standard world we reside in, it needs to be a roughly binary query.

If, however, we ask individuals to explain what they see in a scene, it’s to be anticipated from the outset that we’ll get completely different solutions. Nonetheless, how a lot consensus there may be will very a lot rely on the concrete dataset we’re utilizing.

Let’s check out some picks from the very first 20 coaching objects sampled randomly above.

Figure from MS-COCO 2014 — Determine from MS-COCO 2014

Now this picture doesn’t go away a lot room for determination what to give attention to, and acquired a really factual caption certainly: “There’s a plate with one slice of bacon a half of orange and bread.” If the dataset have been all like this, we’d assume a machine studying algorithm ought to do fairly nicely right here.

Choosing one other one from the primary 20:

What could be salient info to you right here? The caption supplied goes “A smiling little boy has a checkered shirt.”
Is the look of the shirt as essential as that? You may as nicely give attention to the surroundings, – and even one thing on a totally completely different stage: The age of the picture, or it being an analog one.

Let’s take a last instance.

What would you say about this scene? The official label we sampled right here is “A bunch of individuals posing in a humorous manner for the digital camera.” Properly …

Please don’t neglect that for every picture, the dataset consists of 5 completely different captions (though our n = 30000 samples in all probability gained’t).
So this isn’t saying the dataset is biased – under no circumstances. As an alternative, we need to level out the ambiguities and difficulties inherent within the process. Truly, given these difficulties, it’s all of the extra superb that the duty we’re tackling right here – having a community mechanically generate picture captions – needs to be attainable in any respect!

Now let’s see how we are able to do that.

For the encoding a part of our encoder-decoder community, we’ll make use of InceptionV3 to extract picture options. In precept, which options to extract is as much as experimentation, – right here we simply use the final layer earlier than the absolutely related high:

image_model <- application_inception_v3(
  include_top = FALSE,
  weights = "imagenet"
)

For a picture measurement of 299×299, the output shall be of measurement (batch_size, 8, 8, 2048), that’s, we’re making use of 2048 function maps.

InceptionV3 being a “large mannequin,” the place each move by way of the mannequin takes time, we need to precompute options prematurely and retailer them on disk.
We’ll use tfdatasets to stream pictures to the mannequin. This implies all our preprocessing has to make use of tensorflow features: That’s why we’re not utilizing the extra acquainted image_load from keras under.

Our customized load_image will learn in, resize and preprocess the pictures as required to be used with InceptionV3:

load_image <- operate(image_path)

Now we’re prepared to avoid wasting the extracted options to disk. The (batch_size, 8, 8, 2048)-sized options shall be flattened to (batch_size, 64, 2048). The latter form is what our encoder, quickly to be mentioned, will obtain as enter.

preencode <- distinctive(sample_images) %>% unlist() %>% type()
num_unique <- size(preencode)

# adapt this based on your system's capacities  
batch_size_4save <- 1
image_dataset <-
  tensor_slices_dataset(preencode) %>%
  dataset_map(load_image) %>%
  dataset_batch(batch_size_4save)
  
save_iter <- make_iterator_one_shot(image_dataset)
  
until_out_of_range()

Earlier than we get to the encoder and decoder fashions although, we have to handle the captions.

Processing the captions

We’re utilizing keras text_tokenizer and the textual content processing features texts_to_sequences and pad_sequences to rework ascii textual content right into a matrix.

# we'll use the 5000 most frequent phrases solely
top_k <- 5000
tokenizer <- text_tokenizer(
  num_words = top_k,
  oov_token = "<unk>",
  filters = '!"#$%&()*+.,-/:;=?@[]^_`~ ')
tokenizer$fit_on_texts(sample_captions)

train_captions_tokenized <-
  tokenizer %>% texts_to_sequences(train_captions)
validation_captions_tokenized <-
  tokenizer %>% texts_to_sequences(validation_captions)

# pad_sequences will use 0 to pad all captions to the identical size
tokenizer$word_index["<pad>"] <- 0

# create a lookup dataframe that enables us to go in each instructions
word_index_df <- information.body(
  phrase = tokenizer$word_index %>% names(),
  index = tokenizer$word_index %>% unlist(use.names = FALSE),
  stringsAsFactors = FALSE
)
word_index_df <- word_index_df %>% prepare(index)

decode_caption <- operate(textual content) {
  paste(map(textual content, operate(quantity)
    word_index_df %>%
      filter(index == quantity) %>%
      choose(phrase) %>%
      pull()),
    collapse = " ")
}

# pad all sequences to the identical size (the utmost size, in our case)
# might experiment with shorter padding (truncating the very longest captions)
caption_lengths <- map(
  all_captions[1:num_examples],
  operate(c) str_split(c," ")[[1]] %>% size()
  ) %>% unlist()
max_length <- fivenum(caption_lengths)[5]

train_captions_padded <-  pad_sequences(
  train_captions_tokenized,
  maxlen = max_length,
  padding = "submit",
  truncating = "submit"
)

validation_captions_padded <- pad_sequences(
  validation_captions_tokenized,
  maxlen = max_length,
  padding = "submit",
  truncating = "submit"
)

Loading the info for coaching

Now that we’ve taken care of pre-extracting the options and preprocessing the captions, we’d like a solution to stream them to our captioning mannequin. For that, we’re utilizing tensor_slices_dataset from tfdatasets, passing within the listing of paths to the pictures and the preprocessed captions. Loading the pictures is then carried out as a TensorFlow graph operation (utilizing tf$pyfunc).

The unique Colab code additionally shuffles the info on each iteration. Relying in your {hardware}, this may occasionally take a very long time, and given the dimensions of the dataset it’s not strictly essential to get affordable outcomes. (The outcomes reported under have been obtained with out shuffling.)

batch_size <- 10
buffer_size <- num_examples

map_func <- operate(img_name, cap) {
  p <- paste0(img_name$decode("utf-8"), ".npy")
  img_tensor <- np$load(p)
  img_tensor <- tf$forged(img_tensor, tf$float32)
  listing(img_tensor, cap)
}

train_dataset <-
  tensor_slices_dataset(listing(train_images, train_captions_padded)) %>%
  dataset_map(
    operate(item1, item2) tf$py_func(map_func, listing(item1, item2), listing(tf$float32, tf$int32))
  ) %>%
  # optionally shuffle the dataset
  # dataset_shuffle(buffer_size) %>%
  dataset_batch(batch_size)

Captioning mannequin

The mannequin is mainly the identical as that mentioned within the machine translation submit. Please seek advice from that article for an evidence of the ideas, in addition to an in depth walk-through of the tensor shapes concerned at each step. Right here, we offer the tensor shapes as feedback within the code snippets, for fast overview/comparability.

Nonetheless, in the event you develop your individual fashions, with keen execution you’ll be able to merely insert debugging/logging statements at arbitrary locations within the code – even in mannequin definitions. So you’ll be able to have a operate

maybecat <- operate(context, x) {
  if (debugshapes) {
    title <- enexpr(x)
    dims <- paste0(dim(x), collapse = " ")
    cat(context, ": form of ", title, ": ", dims, "n", sep = "")
  }
}

And in the event you now set

you’ll be able to hint – not solely tensor shapes, however precise tensor values by way of your fashions, as proven under for the encoder. (We don’t show any debugging statements after that, however the pattern code has many extra.)

Encoder

Now it’s time to outline some some sizing-related hyperparameters and housekeeping variables:

# for encoder output
embedding_dim <- 256
# decoder (LSTM) capability
gru_units <- 512
# for decoder output
vocab_size <- top_k
# variety of function maps gotten from Inception V3
features_shape <- 2048
# form of consideration options (flattened from 8x8)
attention_features_shape <- 64

The encoder on this case is only a absolutely related layer, taking within the options extracted from Inception V3 (in flattened kind, as they have been written to disk), and embedding them in 256-dimensional area.

cnn_encoder <- operate(embedding_dim, title = NULL) {
    
  keras_model_custom(title = title, operate(self) {
      
    self$fc <- layer_dense(models = embedding_dim, activation = "relu")
      
    operate(x, masks = NULL) {
      # enter form: (batch_size, 64, features_shape)
      maybecat("encoder enter", x)
      # form after fc: (batch_size, 64, embedding_dim)
      x <- self$fc(x)
      maybecat("encoder output", x)
      x
    }
  })
}

Consideration module

Not like within the machine translation submit, right here the eye module is separated out into its personal customized mannequin.
The logic is similar although:

attention_module <- operate(gru_units, title = NULL) {
  
  keras_model_custom(title = title, operate(self) {
    
    self$W1 = layer_dense(models = gru_units)
    self$W2 = layer_dense(models = gru_units)
    self$V = layer_dense(models = 1)
      
    operate(inputs, masks = NULL) {
      options <- inputs[[1]]
      hidden <- inputs[[2]]
      # options(CNN_encoder output) form == (batch_size, 64, embedding_dim)
      # hidden form == (batch_size, gru_units)
      # hidden_with_time_axis form == (batch_size, 1, gru_units)
      hidden_with_time_axis <- k_expand_dims(hidden, axis = 2)
        
      # rating form == (batch_size, 64, 1)
      rating <- self$V(k_tanh(self$W1(options) + self$W2(hidden_with_time_axis)))
      # attention_weights form == (batch_size, 64, 1)
      attention_weights <- k_softmax(rating, axis = 2)
      # context_vector form after sum == (batch_size, embedding_dim)
      context_vector <- k_sum(attention_weights * options, axis = 2)
        
      listing(context_vector, attention_weights)
    }
  })
}

Decoder

The decoder at every time step calls the eye module with the options it obtained from the encoder and its final hidden state, and receives again an consideration vector. The eye vector will get concatenated with the present enter and additional processed by a GRU and two absolutely related layers, the final of which supplies us the (unnormalized) chances for the subsequent phrase within the caption.

The present enter at every time step right here is the earlier phrase: the proper one throughout coaching (trainer forcing), the final generated one throughout inference.

rnn_decoder <- operate(embedding_dim, gru_units, vocab_size, title = NULL) {
    
  keras_model_custom(title = title, operate(self) {
      
    self$gru_units <- gru_units
    self$embedding <- layer_embedding(input_dim = vocab_size, 
                                      output_dim = embedding_dim)
    self$gru <- if (tf$take a look at$is_gpu_available()) {
      layer_cudnn_gru(
        models = gru_units,
        return_sequences = TRUE,
        return_state = TRUE,
        recurrent_initializer = 'glorot_uniform'
      )
    } else {
      layer_gru(
        models = gru_units,
        return_sequences = TRUE,
        return_state = TRUE,
        recurrent_initializer = 'glorot_uniform'
      )
    }
      
    self$fc1 <- layer_dense(models = self$gru_units)
    self$fc2 <- layer_dense(models = vocab_size)
      
    self$consideration <- attention_module(self$gru_units)
      
    operate(inputs, masks = NULL) {
      x <- inputs[[1]]
      options <- inputs[[2]]
      hidden <- inputs[[3]]
        
      c(context_vector, attention_weights) %<-% 
        self$consideration(listing(options, hidden))
        
      # x form after passing by way of embedding == (batch_size, 1, embedding_dim)
      x <- self$embedding(x)
        
      # x form after concatenation == (batch_size, 1, 2 * embedding_dim)
      x <- k_concatenate(listing(k_expand_dims(context_vector, 2), x))
        
      # passing the concatenated vector to the GRU
      c(output, state) %<-% self$gru(x)
        
      # form == (batch_size, 1, gru_units)
      x <- self$fc1(output)
        
      # x form == (batch_size, gru_units)
      x <- k_reshape(x, c(-1, dim(x)[[3]]))
        
      # output form == (batch_size, vocab_size)
      x <- self$fc2(x)
        
      listing(x, state, attention_weights)
        
    }
  })
}

Loss operate, and instantiating all of it

Now that we’ve outlined our mannequin (constructed of three customized fashions), we nonetheless want to really instantiate it (being exact: the 2 courses we’ll entry from exterior, that’s, the encoder and the decoder).

We additionally have to instantiate an optimizer (Adam will do), and outline our loss operate (categorical crossentropy).
Notice that tf$nn$sparse_softmax_cross_entropy_with_logits expects uncooked logits as an alternative of softmax activations, and that we’re utilizing the sparse variant as a result of our labels should not one-hot-encoded.

encoder <- cnn_encoder(embedding_dim)
decoder <- rnn_decoder(embedding_dim, gru_units, vocab_size)

optimizer = tf$prepare$AdamOptimizer()

cx_loss <- operate(y_true, y_pred) {
  masks <- 1 - k_cast(y_true == 0L, dtype = "float32")
  loss <- tf$nn$sparse_softmax_cross_entropy_with_logits(
    labels = y_true,
    logits = y_pred
  ) * masks
  tf$reduce_mean(loss)
}

Coaching

Coaching the captioning mannequin is a time-consuming course of, and you’ll for positive need to save the mannequin’s weights!
How does this work with keen execution?

We create a tf$prepare$Checkpoint object, passing it the objects to be saved: In our case, the encoder, the decoder, and the optimizer. Later, on the finish of every epoch, we’ll ask it to write down the respective weights to disk.

restore_checkpoint <- FALSE

checkpoint_dir <- "./checkpoints_captions"
checkpoint_prefix <- file.path(checkpoint_dir, "ckpt")
checkpoint <- tf$prepare$Checkpoint(
  optimizer = optimizer,
  encoder = encoder,
  decoder = decoder
)

As we’re simply beginning to prepare the mannequin, restore_checkpoint is ready to false. Later, restoring the weights shall be as straightforward as

if (restore_checkpoint) {
  checkpoint$restore(tf$prepare$latest_checkpoint(checkpoint_dir))
}

The coaching loop is structured identical to within the machine translation case: We loop over epochs, batches, and the coaching targets, feeding within the right earlier phrase at each timestep.
Once more, tf$GradientTape takes care of recording the ahead move and calculating the gradients, and the optimizer applies the gradients to the mannequin’s weights.
As every epoch ends, we additionally save the weights.

num_epochs <- 20

if (!restore_checkpoint) {
  for (epoch in seq_len(num_epochs)) {
    
    total_loss <- 0
    progress <- 0
    train_iter <- make_iterator_one_shot(train_dataset)
    
    until_out_of_range({
      
      batch <- iterator_get_next(train_iter)
      loss <- 0
      img_tensor <- batch[[1]]
      target_caption <- batch[[2]]
      
      dec_hidden <- k_zeros(c(batch_size, gru_units))
      
      dec_input <- k_expand_dims(
        rep(listing(word_index_df[word_index_df$word == "<start>", "index"]), 
            batch_size)
      )
      
      with(tf$GradientTape() %as% tape, {
        
        options <- encoder(img_tensor)
        
        for (t in seq_len(dim(target_caption)[2] - 1)) {
          c(preds, dec_hidden, weights) %<-%
            decoder(listing(dec_input, options, dec_hidden))
          loss <- loss + cx_loss(target_caption[, t], preds)
          dec_input <- k_expand_dims(target_caption[, t])
        }
        
      })
      
      total_loss <-
        total_loss + loss / k_cast_to_floatx(dim(target_caption)[2])
      
      variables <- c(encoder$variables, decoder$variables)
      gradients <- tape$gradient(loss, variables)
      
      optimizer$apply_gradients(purrr::transpose(listing(gradients, variables)),
                                global_step = tf$prepare$get_or_create_global_step()
      )
    })
    cat(paste0(
      "nnTotal loss (epoch): ",
      epoch,
      ": ",
      (total_loss / k_cast_to_floatx(buffer_size)) %>% as.double() %>% spherical(4),
      "n"
    ))
    
    checkpoint$save(file_prefix = checkpoint_prefix)
  }
}

Peeking at outcomes

Similar to within the translation case, it’s fascinating to have a look at mannequin efficiency throughout coaching. The companion code has that performance built-in, so you’ll be able to watch mannequin progress for your self.

The essential operate right here is get_caption: It will get handed the trail to a picture, hundreds it, obtains its options from Inception V3, after which asks the encoder-decoder mannequin to generate a caption. If at any level the mannequin produces the finish image, we cease early. In any other case, we proceed till we hit the predefined most size.

get_caption <-
  operate(picture) {
    attention_matrix <-
      matrix(0, nrow = max_length, ncol = attention_features_shape)
    temp_input <- k_expand_dims(load_image(picture)[[1]], 1)
    img_tensor_val <- image_model(temp_input)
    img_tensor_val <- k_reshape(
      img_tensor_val,
      listing(dim(img_tensor_val)[1], -1, dim(img_tensor_val)[4])
    )
    options <- encoder(img_tensor_val)
    
    dec_hidden <- k_zeros(c(1, gru_units))
    dec_input <-
      k_expand_dims(
        listing(word_index_df[word_index_df$word == "<start>", "index"])
      )
    
    outcome <- ""
    
    for (t in seq_len(max_length - 1)) {
      
      c(preds, dec_hidden, attention_weights) %<-%
        decoder(listing(dec_input, options, dec_hidden))
      attention_weights <- k_reshape(attention_weights, c(-1))
      attention_matrix[t,] <- attention_weights %>% as.double()
      
      pred_idx <- tf$multinomial(exp(preds), num_samples = 1)[1, 1] 
                    %>% as.double()
      pred_word <-
        word_index_df[word_index_df$index == pred_idx, "word"]
      
      if (pred_word == "<finish>") {
        outcome <-
          paste(outcome, pred_word)
        attention_matrix <-
          attention_matrix[1:length(str_split(result, " ")[[1]]), , 
                           drop = FALSE]
        return (listing(outcome, attention_matrix))
      } else {
        outcome <-
          paste(outcome, pred_word)
        dec_input <- k_expand_dims(listing(pred_idx))
      }
    }
    
    listing(str_trim(outcome), attention_matrix)
  }

Three picks from the training set — Three picks from the coaching set

This actually tells that the community has been in a position to generalize over – let’s not name them ideas, however mappings between visible and textual entities, say It’s true that it’ll have seen a few of these pictures earlier than, as a result of pictures include a number of captions. You could possibly be extra strict establishing your coaching and validation units – however right here, we don’t actually care about goal efficiency scores and so, it does probably not matter.

Attention over image areas — Consideration over picture areas

Anderson et al. 2017) use object detection strategies to bottom-up isolate fascinating objects, and an LSTM stack whereby the primary LSTM computes top-down consideration guided by the output phrase generated by the second.

One other fascinating method involving consideration is utilizing a multimodal attentive translator (Liu et al. 2017), the place the picture options are encoded and offered in a sequence, such that we find yourself with sequence fashions each on the encoding and the decoding sides.

One other various is so as to add a discovered matter to the data enter (Zhu, Xue, and Yuan 2018), which once more is a top-down function present in human cognition.

When you discover one among these, or one more, method extra convincing, an keen execution implementation, within the type of the above, will possible be a sound manner of implementing it.

Anderson, Peter, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2017. “Backside-up and Prime-down Consideration for Picture Captioning and VQA.” CoRR abs/1707.07998. http://arxiv.org/abs/1707.07998.

Liu, Chang, Fuchun Solar, Changhu Wang, Feng Wang, and Alan L. Yuille. 2017. “A Multimodal Attentive Translator for Picture Captioning.” CoRR abs/1702.05658. http://arxiv.org/abs/1702.05658.

Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. “Present, Attend and Inform: Neural Picture Caption Era with Visible Consideration.” CoRR abs/1502.03044. http://arxiv.org/abs/1502.03044.

Zhu, Zhihao, Zhan Xue, and Zejian Yuan. 2018. “A Matter-Guided Consideration for Picture Captioning.” CoRR abs/1807.03514v1. https://arxiv.org/abs/1807.03514v1.

Consideration-based Picture Captioning with Keras

Conditions

The dataset

Interlude

Processing the captions

Loading the info for coaching

Captioning mannequin

Encoder

Consideration module

Decoder

Loss operate, and instantiating all of it

Coaching

Peeking at outcomes

Related Articles

Mars rover makes use of wiggly wheels impressed by lizard

This Week’s Superior Tech Tales From Across the Internet (By means of June 20)

AURA Foresight Reaches World XPRIZE Wildfire Finals in Alaska

LEAVE A REPLY Cancel reply

Latest Articles

Mars rover makes use of wiggly wheels impressed by lizard

This Week’s Superior Tech Tales From Across the Internet (By means of June 20)

AURA Foresight Reaches World XPRIZE Wildfire Finals in Alaska

Photo voltaic Beat Coal in US Electrical energy Combine for the First Time in Might

Robots-Weblog | RoboCup 2050: Werden Roboter einmal Fußball-Weltmeister?

ABOUT US