In order to run the ParaTest system, a machine with at least one GPU is recommended but not required; ParaTest has a mode that does not rely on running models dependent on GPUs.
-
Step 1: No matter GPUs are available or not, you need to create and activate the computation environment using
requirements.txt
.# create a base conda environment conda create --name paratest python==3.9 # install dependencies pip install -r requirements.txt pip install -e . # activate the conda environment conda activate paratest
-
Step 2 (Optional): Download the
checkpoints.tar
(validity checkers) from here (about 7.1 GB) to thecheckpoints/
directory. After unwrapping, the directory should look like the following, where each folder corresponds to each specification mentioned in the paper.checkpoints/ ├── 01 ... ├── 14 └── 15
-
Step 3: Put your OpenAI key to
config.json
. If you do not have an OpenAI key, you could apply for one here; you will have a 20 USD quota for free, which is sufficient for you to replicate the experiments in the paper.{ "key": "<OPENAI-API-KEY>", ... }
After the preparation step, all you need to do is working with the run.py
file. For example, if you would like to generate test cases for the specification "Temporal: Before something vs. after something", which is numbered 13, you could:
-
Without a GPU: This requires you to annotate the validity the generated test cases with a human annotator in the loop. You will be prompted with a CLI-based window that asks for your label.
python run.py --specification 13
-
With a GPU: The validity checkers save the step of annotating validity of generated test cases; this makes the pipeline fully automatic (hence
--automatic
flag).python run.sh --specification 13 --automatic
You will find a directory of labeling/13
automatically created; it stores all the annotated data used for testing NLP models.
The test cases generated in the previous step are ready for testing NLP models. The example below tests the most downloaded sentiment classifier gchhablani/bert-base-cased-finetuned-sst2
on the HuggingFace model hub and generates a report of its error rates.
from paratest.suite import TestSuite
from paratest.classifier import TransformersSequenceClassifier
clf = TransformersSequenceClassifier(
model_name="gchhablani/bert-base-cased-finetuned-sst2",
model_type="bert",
num_labels=2,
)
suite = TestSuite(specifications=[1, 3])
suite.test(clf)
It is possible to apply the ParaTest system to the specifications that suit your own application yet not considered in the paper. All you need to do is to add a row to the dataset/initial_test_cases.json
. For example, you would like to test specification 123, you just need to run:
python run.sh --specification 123