Back to blog
Articles
Articles
September 19, 2023
·
4 min read

Comparing LLM Performance Against Prompt Techniques & Domain Specific Datasets

September 19, 2023
|
4 min read

Latest content

Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Announcements
2 min read

Full Circle: HumanFirst Welcomes Maeghan Smulders as COO

Personal and professional history might not repeat, but it certainly rhymes. I’m thrilled to join the team at HumanFirst, and reconnect with a team of founders I not only trust, but deeply admire.
February 13, 2024
Tutorials
4 min read

Accelerating Data Analysis with HumanFirst and Google Cloud

How to use HumanFirst with CCAI-generated data to accelerate data analysis.
January 24, 2024
Tutorials
4 min read

Exploring Contact Center Data with HumanFirst and Google Cloud

How to use HumanFirst with CCAI-generated data to streamline topic modeling.
January 11, 2024
Articles
5 min

Building In Alignment: The Role of Observability in LLM-Led Conversational Design

Building In Alignment: The Role of Observability in LLM-Led Conversational Design
December 6, 2023
Articles
5 min read

Rivet Is An Open-Source Visual AI Programming Environment

Rivet is suited for building complex agents with LLM Prompts, and it was Open Sourced recently.
September 27, 2023
Articles
6 min read

What Is The Future Of Prompt Engineering?

The skill of Prompt Engineering has been touted as the ultimate skill of the future. But, will prompt engineering be around in the near future? In this article I attempt to decompose how the future LLM interface might look like…considering it will be conversational.
September 26, 2023
Articles
4 min read

LLM Drift

A recent study coined the term LLM Drift. LLM Drift is definite changes in LLM responses and behaviour, over a relatively short period of time.
September 25, 2023
Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Announcements
2 min read

Full Circle: HumanFirst Welcomes Maeghan Smulders as COO

Personal and professional history might not repeat, but it certainly rhymes. I’m thrilled to join the team at HumanFirst, and reconnect with a team of founders I not only trust, but deeply admire.
February 13, 2024

Let your data drive.

Articles

Comparing LLM Performance Against Prompt Techniques & Domain Specific Datasets

COBUS GREYLING
September 19, 2023
.
4 min read

This study from August 2023 considers 10 different prompt techniques, over six LLMs and six data types.

This study compared 10 different zero-shot prompt reasoning strategies over six LLMs (davinci-002, davinci-003, GPT-3.5-turbo, GPT-4, Flan-T5-xxl & Cohere command-xlarge) referencing six QA datasets ranging from scientific to medical domains.

Some notable findings were:

  1. As is visible in the graphed data below, some models are optimised for specific prompting strategies and data domains.
  2. Gains from Chain-Of-Thought (CoT) reasoning strategies are effective across domains and LLMs.
  3. GPT-4 has the best performance across data domains and prompt techniques.

The header image depicts the performance of each of the six LLMs used in the study and their respective overall performances.

The image below shows the 10 prompt techniques used in the study, with an example of each prompt, and the score achieved by each prompt technique. The scores shown here are specifically related to the GPT-4 model.

Adapted From Source

The prompt template structure used…

The {instruction} is placed before the question and answer choices.

With the {question} being the multiple-choice question that the model is expected to answer.

The {answer_choices} are the options provided for the multiple-choice question.

The {cot_trigger} is placed after the question.

{instruction}

{question}

{answer_choices}

{cot_trigger}

The image below depicts the performance of the various prompting techniques (vertical) against LLMs performance (horizontal).

Something I found interesting is that Google’s FLan-T5-XXL model does not follow the trend of improved performance with the Zhou prompting technique.

And also the Cohere models seems to have a significant deprecation in performance with the Kojima prompting technique.

Source (Table 14: Accuracy of prompts per model averaged over datasets.)

The table below taken from the paper shows the six datasets with a description of each set.

Source

And the performance of each LLM based on the six datasets. The toughest datasets to navigate for the LLM were MedQA, MedMCQA and arguably OpenBookQA.

Throughout the study it is evident that GPT-4’s performance is stellar. Noticeable is Google’s good performance in OpenBookQA.

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox