Articles

September 19, 2023

4 min read

Comparing LLM Performance Against Prompt Techniques & Domain Specific Datasets

September 19, 2023

4 min read

Latest content

Customer Stories

4min read

Lightspeed Uses HumanFirst for In-House AI Enablement

Meet Caroline, an analyst-turned-AI-expert who replaced manual QA, saved countless managerial hours, and built new solutions for customer support.

December 10, 2024

Customer Stories

4 min read

How Infobip Generated 220+ Knowledge Articles with Gen AI For Smarter Self-Service and Better NPS

Partnering with HumanFirst, Infobip generated over 220 knowledge articles, unlocked 30% of their agents' time, and improved containment by a projected 15%.

September 16, 2024

Articles

6 min read

AI for CIOs: From One-Off Use to Company-Wide Value

A maturity model for three stages of AI adoption, including strategies for company leaders to progress to the next stage.

September 12, 2024

Articles

7 min read

Non-Technical AI Adoption: The Value of & Path Towards Workforce-Wide AI

Reviewing the state of employee experimentation and organizational adoption, and exploring the shifts in thinking, tooling, and training required for workforce-wide AI.

September 12, 2024

Tutorials

4 min read

Building Prompts for Generators in Dialogflow CX

How to get started with generative features.

August 15, 2024

Announcements

3 min read

HumanFirst and Infobip Announce a Partnership to Equip Enterprise Teams with Data + Generative AI

With a one-click integration to Conversations, Infobip’s contact center solution, HumanFirst helps enterprise teams leverage LLMs to analyze 100% of their customer data.

August 8, 2024

Tutorials

4 min read

Two Field-Tested Prompts for CX Teams

Get deeper insights from unstructured customer data with generative AI.

August 7, 2024

Customer Stories

5 min read

HomeServe Uses HumanFirst to Empower Non-Technical Teams with Conversation Data

July 29, 2024

Tutorials

5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.

April 23, 2024

Customer Stories

4min read

Lightspeed Uses HumanFirst for In-House AI Enablement

Meet Caroline, an analyst-turned-AI-expert who replaced manual QA, saved countless managerial hours, and built new solutions for customer support.

December 10, 2024

Customer Stories

4 min read

How Infobip Generated 220+ Knowledge Articles with Gen AI For Smarter Self-Service and Better NPS

Partnering with HumanFirst, Infobip generated over 220 knowledge articles, unlocked 30% of their agents' time, and improved containment by a projected 15%.

September 16, 2024

Articles

6 min read

AI for CIOs: From One-Off Use to Company-Wide Value

A maturity model for three stages of AI adoption, including strategies for company leaders to progress to the next stage.

September 12, 2024

Better context, better results

Join the Waitlist Book a Demo

Articles

Comparing LLM Performance Against Prompt Techniques & Domain Specific Datasets

COBUS GREYLING

September 19, 2023

4 min read

This study from August 2023 considers 10 different prompt techniques, over six LLMs and six data types.

This study compared 10 different zero-shot prompt reasoning strategies over six LLMs (davinci-002, davinci-003, GPT-3.5-turbo, GPT-4, Flan-T5-xxl & Cohere command-xlarge) referencing six QA datasets ranging from scientific to medical domains.

Some notable findings were:

As is visible in the graphed data below, some models are optimised for specific prompting strategies and data domains.
Gains from Chain-Of-Thought (CoT) reasoning strategies are effective across domains and LLMs.
GPT-4 has the best performance across data domains and prompt techniques.

The header image depicts the performance of each of the six LLMs used in the study and their respective overall performances.

The image below shows the 10 prompt techniques used in the study, with an example of each prompt, and the score achieved by each prompt technique. The scores shown here are specifically related to the GPT-4 model.

The prompt template structure used…

The {instruction} is placed before the question and answer choices.

With the {question} being the multiple-choice question that the model is expected to answer.

The {answer_choices} are the options provided for the multiple-choice question.

The {cot_trigger} is placed after the question.

{instruction}

{question}

{answer_choices}

{cot_trigger}

The image below depicts the performance of the various prompting techniques (vertical) against LLMs performance (horizontal).

Something I found interesting is that Google’s FLan-T5-XXL model does not follow the trend of improved performance with the Zhou prompting technique.

And also the Cohere models seems to have a significant deprecation in performance with the Kojima prompting technique.

Source (Table 14: Accuracy of prompts per model averaged over datasets.)

The table below taken from the paper shows the six datasets with a description of each set.

And the performance of each LLM based on the six datasets. The toughest datasets to navigate for the LLM were MedQA, MedMCQA and arguably OpenBookQA.

Throughout the study it is evident that GPT-4’s performance is stellar. Noticeable is Google’s good performance in OpenBookQA.

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Table of contents

Latest content

Lightspeed Uses HumanFirst for In-House AI Enablement

How Infobip Generated 220+ Knowledge Articles with Gen AI For Smarter Self-Service and Better NPS

AI for CIOs: From One-Off Use to Company-Wide Value

Non-Technical AI Adoption: The Value of & Path Towards Workforce-Wide AI

Building Prompts for Generators in Dialogflow CX

HumanFirst and Infobip Announce a Partnership to Equip Enterprise Teams with Data + Generative AI

Two Field-Tested Prompts for CX Teams

HomeServe Uses HumanFirst to Empower Non-Technical Teams with Conversation Data

Optimizing RAG with Knowledge Base Maintenance

Lightspeed Uses HumanFirst for In-House AI Enablement

How Infobip Generated 220+ Knowledge Articles with Gen AI For Smarter Self-Service and Better NPS

AI for CIOs: From One-Off Use to Company-Wide Value

Better context, better results

Comparing LLM Performance Against Prompt Techniques & Domain Specific Datasets

This study from August 2023 considers 10 different prompt techniques, over six LLMs and six data types.

Subscribe to HumanFirst Blog