Can Gen AI create learning questions better than a human?

AI questions can match human-written ones — under the right conditions

Lam et al. (2024) found that multiple-choice questions generated by Gen AI were comparable in quality to human-generated questions. The catch: this was only when the AI was given explicit prompt instructions, including a discipline-relevant framework and examples of questions. Generic prompting didn't get there.

Elkins et al. (2024) found that teachers preferred AI-generated questions built around Bloom's taxonomy over questions produced from simple, generic prompts. The pedagogy in the prompt mattered more than the model itself.

It probably won't save you time — at least not the first time

This one surprised us. Across teachers in the Elkins et al. (2024) study, the average time to write quizzes was roughly the same whether teachers used controlled AI generation, simple AI generation, or wrote the questions by hand.

If your reason for reaching for AI is purely speed, you may be disappointed. But if your reason is quality, consistency across Bloom's levels, or breaking through blank-page syndrome — the picture looks much better.

One exception: differentiation.

Differentiation is the practice of adjusting learning material so it suits learners working at different levels in the same classroom — extension for some, scaffolding for others, grade-level material for the middle.

Done well, it's powerful.

Done at all, it's time-consuming, which is why it often gets dropped from the planning week.

Akdeniz, Clark and Roberts (2025) found that AI-generated questions aligned well with lower-order cognitive tasks and remained consistent across subjects, with promise at higher-order levels too. The authors describe AI as a tool to support teachers in differentiating instruction — not replace their expertise. In a higher education context, differentiation might mean contextualising the same underlying concept for students from different disciplines in a shared course — framing a statistics question through a public health lens for one cohort and an economics lens for another. Or producing two versions of a tutorial question set: one pitched at second-year students still consolidating fundamentals, and one stretched for third-years ready to apply the concept in less familiar territory. Producing one version with AI takes roughly the same time as producing one by hand. Producing several tailored versions does not. This is the use case where the time argument finally lands in AI's favour.

Students notice the difference between AI and human questions

Kusam et al. (2025) compared student perceptions of AI-generated and manually-written quizzes. Their findings:

Students rated AI quizzes as clearer and fairer (higher clarity and accuracy ratings).
Students found manual quizzes more challenging and engaging, and often better aligned to lecture emphasis.
In open comments, students described AI quizzes as straightforward review, while manual quizzes pushed deeper reasoning but were occasionally ambiguous.

In other words: AI is good at clean, fair, well-formed questions. Humans are still better at the kind of cognitively demanding, context-aware questions that stretch a learner. That's a useful split to keep in mind when you decide what to delegate.

The failure modes are predictable — which means you can plan for them

Across the studies, three problems with AI-generated questions kept coming up:

Scope drift — some AI items covered out-of-scope topics.
Uneven topic coverage — over-representation of certain topics; gaps in others.
Ambiguity and missing context — some AI questions lacked the detail students needed to answer them, which hurt student performance.

Elkins et al. also noted a useful detail: "Empirical results from preliminary experimentation showed that generating all of the questions together produced more diverse outputs, whereas generating them separately produced duplicate questions." Batch your generations.

What the literature recommends

Pulling the threads together, here's what the studies converge on:

Use prompt engineering. Give the model a clear role, goal and constraints. Include examples (few-shot) to control style and taxonomic level.
Prefer controlled prompts tied to pedagogy. Reference frameworks like Bloom's taxonomy rather than asking generically. Generate one question per Bloom level to reduce duplication.
Generate in batches. Ask for all questions at once to get more diverse outputs.
Keep context short and focused. Use targeted passages or knowledge points so generations stay on-topic and answerable.
Provide domain examples. Three to five human-crafted exemplars worked well in experiments.

A quick note on few-shot prompting

Few-shot learning is a prompting strategy where you give the model a small number — typically one to five — of example input → output pairs inside the prompt, so the model can follow that pattern when generating new outputs. The examples teach the model the desired format, style or level without fine-tuning the model weights.

A quick checklist for a good few-shot prompt:

Provide a role and a goal (e.g. "You are a quiz writer. Generate a Bloom's-level question.").
Include three to five clean examples covering the variation you want.
Supply the target passage or knowledge points.
Specify output format (MCQ, open-ended, length, whether the answer is included).

Approach 1: Use a tool with a built-in quiz generator

If you want to get a feel for AI-generated questions without writing a single prompt, start here. Tools like Menti metre, NotebookLM (free use with a google account) and Thea have quiz generation built directly into the product. You can upload your source material, and the tool produces questions grounded in that content.

Best for: educators new to Gen AI, quick formative checks, and getting unstuck when you're staring at a blank page.

Watch for: limited control over Bloom's level and question style. You'll likely get clear, fair questions — but they may sit at lower cognitive levels by default. Uploading information or sources you don’t own.

Approach 2: Build your own question generator with a custom prompt

This is where the research findings really pay off. Using a general-purpose AI tool like Claude, Copilot or RMIT’s Val, you write a structured prompt that locks in the pedagogy: a clear role, the Bloom's level you want, two or three exemplar questions, and the source material. You save the prompt, and reuse it every time you need a new quiz.

A starting prompt looks something like this:

You are an experienced [subject] educator writing quiz questions for [year level / cohort]. Generate six multiple-choice questions based on the passage below: one at each level of Bloom's taxonomy (Remember, Understand, Apply, Analyse, Evaluate, Create). Each question should have four options and a clearly marked correct answer. Match the tone and style of these examples: [paste 2–3 of your own questions]. Passage: [paste source material].

Best for: educators who want consistency across quizzes, control over cognitive level, and the ability to refine a prompt over time so it gets better at producing your style of question.

Watch for: the first version of your prompt won't be the best version. Plan to iterate. Run the AI's questions past your learning outcomes, and ask the AI to answer its own questions — if it stumbles, your question probably needs more context.

Approach 3: Vibe-code your own interactive question experience

If you're comfortable letting AI generate code as well as content, you can go further: a custom interactive quiz, a scenario-based escape room, a self-marking practice tool, a discussion-prompt generator your students use directly. You describe what you want in plain English, and the AI builds it. This is what people mean by "vibe coding" — you don't need to write the code yourself, but you do need to be able to test it, give feedback, and know what good looks like.

Best for: educators wanting to experiment with format, build assets they can reuse for years, or design learning experiences that simply don't exist as off-the-shelf products.

Watch for: the highest effort of the three. Worth it for an asset you'll use repeatedly; probably not worth it for next Tuesday's lesson check-in.

Diagram with transparent blocks titled: Feedback Loop, AI Interaction Trace, Reflective Checkpoint, Revision Sequence, Decision Point, Improved Draft, and Educator Judgement, representing stages in an AI-supported learning or assessment process.

Can the process be the evidence

21 Jul 2026

The Education/Generative AI debate has reached a groundswell of debate about the struggle to secure assessment. TEQSA's latest resource, Assuring Quality Learning in a Gen AI-Integrated Future, marks a shift in a debate.

The Gen AI Skills Continuum: building AI capability and confidence in tertiary education

09 Jun 2026

The Generative AI Lab for Education (GAILE) is launching a new practical AI capability map specifically designed to support and guide educators in understanding the use of AI in teaching and learning practice and connecting them to resources for continuous professional development.

RCTCO: Five Elements for Higher Quality AI Outputs

01 Jun 2026

The RMIT College of Business and Law are using the RCTCO framework to create a clear structure for AI instructions. This method ensures that you understand the specific requirements of a task before it begins generating a response.

A practical blueprint for embedding AI into your course design

25 May 2026

Discussions about AI-proofing assessment tasks have been a mainstay in education since ChatGPT dropped into our lives and classrooms, ranging from debating rubric wording to redesigning tasks and trying to stay one step ahead of the tools our students are already using.

Can Gen AI create learning questions better than a human?