Skip to content

The enhanced RCNN model used for sentence similarity classification

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

Sentence Similarity

(mainly based on Enhanced-RCNN model and other baselines)

Getting Started

To clone this project, make sure git-lfs is installed.

Please use the following command to clone this project:


Clone repo without downloading real files with GitLFS

Quick Execute All

# Data preprocessing
# Train & Evaluate
./ [model name]

# Test Ant Submission functionality
bash raw_data/competition_train.csv ant_test_pred.csv
# pack the Ant Submission files
zip -r . -i \*.py \*.sh -i data/stopwords.txt


# Data preprocessing
## Ant
python3 [word/char] train
## PiPiDai

# Train & Evaluate
## Chinese
python3 --dataset [Ant/CCKS/PiPiDai] --model [model name] --word-segment [word/char]
# train all the model at once use ./
## English
python3 --dataset Quora --model [model name]

# Use Tensorboard
tensorboard --logdir log/same_as_model_log_dir
## remote connection(forward local port to remote port) (execute in local machine)
## then you should be able to access with http://localhost:$LOCAL_PORT
ssh -NfL $LOCAL_PORT:localhost:$REMOTE_PORT $REMOTE_USER@$REMOTE_IP > /dev/null 2>&1
### to close connection (just kill the ssh command which run in background)
ps aux | grep "ssh -NfL" | grep -v grep | awk '{print $2}' | xargs kill


  • ERCNN (default)
  • Transformer
    • ERCNN + replace the BiRNN with Transformer
  • Baseline
    • Siamese Series
      • SiameseCNN
        • Convolutional Neural Networks for Sentence Classification
        • Character-level Convolutional Networks for Text Classification
      • SiameseRNN
      • SiameseLSTM
        • Siamese Recurrent Architectures for Learning Sentence Similarity
      • SiameseRCNN
        • Siamese Recurrent Architectures for Learning Sentence Similarity
      • SiameseAttentionRNN
        • Text Classification Research with Attention-based Recurrent Neural Networks
    • Multi-Perspective Series
      • MPCNN
        • Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks
        • just a "more sentence similarity measurements" version of SiameseCNN (also use Siamese network to encode sentences)
        • TODO: Model too big to run.... (consume too much GPU memory) => Smaller batch size
      • MPLSTM: skip
      • BiMPM
        • Bilateral Multi-Perspective Matching for Natural Language Sentences
    • ESIM


  • Ant - Chinese
  • CCKS - Chinese
  • PiPiDai - Chinese (encoded)
  • Quora - English


  • train
    • using 70% training data
    • k-fold cross-validation (k == training epochs)
    • will test the performance using valid set when each epoch end and save the model
  • test
    • using 30% test data
    • will load the latest model with the same settings
  • both (include train and test)
  • predict
    • will load the latest model with the same settings


  • random (Original): data is skewed (the ratio is listed below)
  • balance: positive vs. negative data will be the same
    • generate-train
    • generate-test
$ python3 --help
usage: [-h] [--dataset dataset] [--mode mode] [--sampling mode]
              [--generate-train] [--generate-test] [--model model]
              [--word-segment WS] [--batch-size N] [--test-batch-size N]
              [--k-fold N] [--lr N] [--beta1 N] [--beta2 N] [--epsilon N]
              [--no-cuda] [--seed N] [--test-split N] [--log-interval N]
              [--test-interval N] [--not-save-model]

Enhanced RCNN on Sentence Similarity

optional arguments:
  -h, --help             show this help message and exit
  --dataset dataset      Chinese: Ant, CCKS; English: Quora (default: Ant)
  --mode mode            script mode [train/test/both/predict/submit(Ant)]
                         (default: both)
  --sampling mode        sampling mode during training (default: random)
  --generate-train       use generated negative samples when training (used in
                         balance sampling)
  --generate-test        use generated negative samples when testing (used in
                         balance sampling)
  --model model          model to use [ERCNN/Transformer/Siamese(CNN/RNN/LSTM/R
                         CNN/AttentionRNN)] (default: ERCNN)
  --word-segment WS      chinese word split mode [word/char] (default: char)
  --chinese-embed embed  chinese embedding (default: cw2vec)
  --not-train-embed      whether to freeze the embedding parameters
  --batch-size N         input batch size for training (default: 256)
  --test-batch-size N    input batch size for testing (default: 1000)
  --k-fold N             k-fold cross validation i.e. number of epochs to train
                         (default: 10)
  --lr N                 learning rate (default: 0.001)
  --beta1 N              beta 1 for Adam optimizer (default: 0.9)
  --beta2 N              beta 2 for Adam optimizer (default: 0.999)
  --epsilon N            epsilon for Adam optimizer (default: 1e-08)
  --no-cuda              disables CUDA training
  --seed N               random seed (default: 16)
  --test-split N         test data split (default: 0.3)
  --logdir path          set log directory (default: ./log)
  --log-interval N       how many batches to wait before logging training
  --test-interval N      how many batches to test during training
  --not-save-model       for not saving the current model
  --load-model name      load the specific model checkpoint file
  --submit-path path:    submission file path (currently for Ant dataset)

Related Additional Datasets



  • raw_data/competition_train.csv - Ant Financial

  • raw_data/train.csv - Quora Question Pairs

  • word2vec/substoke_char.vec.avg - Ant Financial

  • word2vec/substoke_word.vec.avg - Ant Financial

  • data/stopwords.txt - Ant Financial

  • word2vec/glove.word2vec.txt - Quora Question Pairs

  • raw_data/task3_train.txt - CCKS 2018

  • raw_data/task3_dev.txt - CCKS 2018

    unzip glove.840B.300d
    from gensim.scripts.glove2word2vec import glove2word2vec
    _ = glove2word2vec('glove.840B.300d.txt', 'word2vec/glove.word2vec.txt')
    rm glove.840B*


  • data/sentence_char_train.csv - Ant Financial
  • data/sentence_word_train.csv - Ant Financial
  • word2vec/Ant_char_tokenizer.pickle - Ant Financial
  • word2vec/Ant_char_embed_matrix.pickle - Ant Financial
  • word2vec/Ant_word_tokenizer.pickle - Ant Financial
  • word2vec/Ant_word_embed_matrix.pickle - Ant Financial
  • word2vec/Quora_tokenizer.pickle - Quora Question Pairs
  • word2vec/Quora_embed_matrix.pickle - Quora Question Pairs
  • model/*
  • log/*


ANT Financial Competition

Goal: classify whether two question sentences are asking the same thing => predict true or false

Evaluation: f1-score


  • Positive data: 18.23%

Quora Question Pairs

kaggle competitions download -c quora-question-pairs
unzip test.csv -d raw_data
unzip train.csv -d raw_data
rm *.zip

Goal: classify whether question pairs are duplicates or not => predict the probability that the questions are duplicates (a number between 0 and 1)

Evaluation: log loss between the predicted values and the ground truth


  • Positive data: 36.92%
  • 400K rows in train set and about 2.35M rows in test set
  • 6 columns in train set but only 3 of them are in test set
    • train set
      • id - the id of a training set question pair
      • qid1, qid2 - unique ids of each question (only available in train.csv)
      • question1, question2 - the full text of each question
      • is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise
    • test set
      • test_id
      • question1, question2
  • about 63% non-duplicate questions and 37% duplicate questions in the training data set

CCKS 2018

CCKS: China Conference on Knowledge Graph and Semantic Computing


  • Positive data: 50%
  • Data amount: 100000

CHIP 2018




  • Positive data: 52%
  • Data amount: 254386


  • More evaluation matrics: recall & f1-score
  • Continue training?!
  • Potential multi-class classification
    • num_class input
    • sigmoid => softmax
    • (but how about siamese model??)


Notes for unbalanced data

Balance data generator

In, the class BalanceDataHelper

Use different loss

  • Dice loss

    • Dice Loss PR · Issue #1249 · pytorch/pytorch

    • other approach

      if weight is None:
              weight = torch.ones(
                  y_pred.shape[-1], dtype=torch.float).to(device=y_pred.device)  # (C)
          if not mode:
              return self.simple_cross_entry(y_pred, golden, seq_mask, weight)
          probs = nn.functional.softmax(y_pred, dim=2)  # (B, T, C)
          B, T, C = probs.shape
          golden_index = golden.unsqueeze(dim=2)  # (B, T, 1)
          golden_probs = torch.gather(
              probs, dim=2, index=golden_index)  # (B, T, 1)
          probs_in_package = golden_probs.expand(B, T, T).transpose(1, 2)
          packages = np.array([np.eye(T)] * B)  # (B, T, T)
          probs_in_package = probs_in_package * \
              torch.tensor(packages, dtype=torch.float).to(device=probs.device)
          max_probs_in_package, _ = torch.max(probs_in_package, dim=2)
          golden_probs = golden_probs.squeeze(dim=2)
          golden_weight = golden_probs / (max_probs_in_package)  # (B, T)
          golden_weight = golden_weight.view(-1)
          golden_weight = golden_weight.detach()
          y_pred = y_pred.view(-1, C)
          golden = golden.view(-1)
          seq_mask = seq_mask.view(-1)
          negative_label = torch.tensor(
              [0] * (B * T), dtype=torch.long, device=y_pred.device)
          golden_loss = nn.functional.cross_entropy(
              y_pred, golden, weight=weight, reduction='none')
          negative_loss = nn.functional.cross_entropy(
              y_pred, negative_label, weight=weight, reduction='none')
          loss = golden_weight * golden_loss + \
              (1 - golden_weight) * negative_loss  # (B * T)
          loss =, seq_mask) / (torch.sum(seq_mask) + self.epsilon)
  • Triplet-Loss

  • N-pair Loss

Notes about Virtualenv

# this will create a env_name folder in current directory
virtualenv --python=/path/to/python3.x env_name

# activate the environment
source ./env_name/bin/activate

Add alias in bashrc

  • Goto work directory and activate the environment
    • alias davidlee="cd /home/username/working_dir; source env_name/bin/activate"
  • Use pip source when install packages
    • alias pipp="pip install -i"

Install Jupyter notebook use the virtualenv kernel

  1. make sure you activate the environment
  2. pip3 install jupyterlab
  3. python3 -m ipykernel install --user --name=python3.6virtualenv
  4. execute jupyter notebook as normal jupyter notebook
  5. Goto kernel > change kernel > select python3.6virtualenv






Related Project


Model Source Code



Candidate Set


Siamese Models

Siamese-CNN, Siamese-RNN, Siamese-LSTM, Siamese-RCNN, Siamese-Attention-RCNN

Contrastive Loss

Trouble Shooting

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

somehow the nn.Module in a list can't be auto connect to(device)


Sorry for the limitation of the Git-LFS bandwidth quota, might have some problem to clone this project.

git lfs clone --depth=1


class Attention(nn.Module):
    def __init__(self,
                 enc_hid_dim: int,
                 dec_hid_dim: int,
                 attn_dim: int):

        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim

        self.attn_in = (enc_hid_dim * 2) + dec_hid_dim

        self.attn = nn.Linear(self.attn_in, attn_dim)

    def forward(self,
                decoder_hidden: Tensor,
                encoder_outputs: Tensor) -> Tensor:

        src_len = encoder_outputs.shape[0]

        repeated_decoder_hidden = decoder_hidden.unsqueeze(
            1).repeat(1, src_len, 1)

        encoder_outputs = encoder_outputs.permute(1, 0, 2)

        energy = torch.tanh(self.attn(

        attention = torch.sum(energy, dim=2)

        return F.softmax(attention, dim=1)

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        attn_weights = F.softmax(
            self.attn([0], hidden[0]), 1)), dim=1)
        attn_applied = torch.bmm(attn_weights.unsqueeze(0),

        output =[0], attn_applied[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)

        output = F.relu(output)
        output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, attn_weights

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

class Attention(nn.Module):
    def __init__(self, feature_dim, step_dim, bias=True, **kwargs):
        super(Attention, self).__init__(**kwargs)

        self.supports_masking = True

        self.bias = bias
        self.feature_dim = feature_dim
        self.step_dim = step_dim
        self.features_dim = 0

        weight = torch.zeros(feature_dim, 1)
        self.weight = nn.Parameter(weight)

        if bias:
            self.b = nn.Parameter(torch.zeros(step_dim))

    def forward(self, x, mask=None):
        feature_dim = self.feature_dim
        step_dim = self.step_dim

        eij =
            x.contiguous().view(-1, feature_dim),
        ).view(-1, step_dim)

        if self.bias:
            eij = eij + self.b

        eij = torch.tanh(eij)
        a = torch.exp(eij)

        if mask is not None:
            a = a * mask

        a = a / (torch.sum(a, 1, keepdim=True) + 1e-10)

        weighted_input = x * torch.unsqueeze(a, -1)
        return torch.sum(weighted_input, 1)


No releases published


No packages published