What is ToetsTester?
ToetsTester is a platform that supports teachers in grading both handwritten and digital tests with open and closed questions. The teacher provides the marking scheme and student answers, after which the ToetsTester AI provides a score and feedback. The primary focus is grading open questions. In addition, ToetsTester provides insights by performing various taxonomic analyses.
How does ToetsTester work?
The grading process within ToetsTester is designed to combine teacher expertise with the efficiency of artificial intelligence. This process is divided into four consecutive phases. The leading principle is human oversight: throughout the full process, the teacher retains full control and final responsibility for the assessment. The AI functions exclusively as a supporting instrument.
Use of ToetsTester
For handwritten tests, the process starts with scanning the files. Physical student tests are digitized and uploaded into the ToetsTester environment. The AI supports the teacher by automatically analyzing scans, splitting them, and assigning them to the correct students.
Once tests are in the system, the marking scheme phase follows. The teacher provides the system with the necessary assessment criteria or answer models. The AI reads these instructions and uses them as the framework for assessing student answers. The quality and clarity of this scheme are decisive for the accuracy of the final AI assessment. Teachers can adjust the grading model in advance or afterward.
In the third phase, grading answers, technical processing takes place. ToetsTester first converts handwritten answers into digital text through handwriting recognition. These answers are then weighed by the AI against the rules in the marking scheme. The result is a draft score and draft feedback for each question.
The final phase is teacher review. The teacher validates the scores and feedback proposed by the AI. Transcriptions can be corrected and scores can be adjusted or overwritten. Results are only finalized after explicit teacher approval.
Sub-processors and Data Protection
For analysis and text processing, ToetsTester uses specialized AI models through the sub-processors Gemini and Glama (router for the Claude model). Data storage by ToetsTester is strictly limited to the European Economic Area (EEA), using Microsoft Azure infrastructure combined with a central database in the Netherlands.
In line with signed data processing agreements, input data, including student scans and marking schemes, is explicitly not used to (re)train the AI models named above. Data remains the property of the educational institution at all times.
Impact
The teacher is considered the user. Teachers use ToetsTester to grade student tests. The purpose of using ToetsTester is to support teachers in their grading work. Through this, we aim to achieve time savings, better feedback, and more insight into student progress.
Grading with AI is classified as a high-risk application. In addition to the benefits mentioned above, there are also risks when grading with ToetsTester.
Impact on the teacher
Using ToetsTester has the following consequences for teaching practice:
- Time savings and feedback quality. The teacher can significantly reduce grading time per test while providing extensive, high-quality feedback without extra time investment.
- Insight into progress. Through faster data analysis, the teacher gains more immediate insight into learning performance, allowing lessons to be adapted more effectively to class needs.
- Shift in workload. The focus shifts from execution of grading to the front end of the process; creating an accurate marking scheme is essential for a reliable AI result.
- Automation bias and cognitive surrender. There is a risk that the teacher blindly trusts AI suggestions or becomes less sharp during review, which can undermine assessment quality.
- Biases in AI models. Models may contain unintended biases (for example based on writing style), which can lead to unfair student assessments.
Impact on the student
For students, we see impact in the following areas:
- Explainability. Students have the right to transparent substantiation of their results. AI involvement in this process must be traceable for them.
- Faster results. By using ToetsTester, students receive grades and feedback faster, strengthening the direct connection between performance and learning moment.
- Equality and biases. Students may experience effects of potential biases in AI models.
Risk mitigation
System mitigations
ToetsTester includes multiple safeguards to minimize risks:
- Check system. With ToetsTester, teachers are encouraged to review AI scores and feedback. This can be made mandatory at school level.
- Confidence percentages. ToetsTester AI indicates confidence levels for an assessment. It does this by using multiple AI graders that consult with one another.
- Reference to source file. In ToetsTester, AI transcription can be checked against the student's original text.
- Explainability of AI. ToetsTester aims to facilitate Explainable AI (XAI). AI feedback is an important instrument in this.
- Feedback. In line with post-market monitoring obligations, a direct feedback button is integrated into the application. Users can report incidents or hallucinations immediately.
- Training. Users are supported in critically validating AI output and reminded of their role as final decision-makers.
If serious incidents or structural hallucinations occur, ToetsTester has a protocol for manual shutdown. Processing can be centrally stopped immediately.
Human final responsibility
The core principle of ToetsTester is that the teacher retains control in all process steps.
- Input information. It is the teacher's responsibility to ensure the quality of entered information, such as the marking scheme.
- Outcome review. The teacher checks the system outcome. The teacher is always the one who must assess whether the system output is correct.
- Communication to students. ToetsTester proposes feedback or assessment, but the teacher remains ultimately responsible for communicating results.
Limitations and risks
Below, the limitations and risks are specified for each of the four phases of working with ToetsTester. In addition, as an AI system, ToetsTester has the following general risks:
- Biases. AI models may adopt patterns from their training data that lead to unconscious bias. In the context of grading, this can mean that the AI inadvertently assesses a student differently based on, for example, language use, sentence structure, or spelling, even when this is not part of the official marking scheme.
- Hallucinations. The model may make statements that are factually incorrect but sound very credible and convincing. This can result in feedback or a scoring justification based on information that the student did not write at all.
| Functions | Description | Limitations | Risks |
|---|---|---|---|
| Scanning & conversion | To use ToetsTester, teachers need to upload student answers. This can consist of one large PDF, multiple PDF files per student, or multiple images. When a single large PDF contains multiple students, AI is used to split it. | The ability to scan tests and the way that process works differs per school. A poor scanning facility within the school organization can limit the effectiveness of the system. In addition, errors can occur when splitting a single PDF into multiple students. | Poor organization of scanning facilities at the school may cause this function to take extra time. If the student split is done incorrectly and the teacher does not verify and/or correct it, the AI assessment will be inaccurate. |
| Marking scheme | ToetsTester uses the teacher's marking scheme to grade student answers. AI is used to read and process the marking scheme. | The accuracy of the AI assessment depends directly on the quality of the marking scheme. This requires unambiguous formulation of questions, answers, and scoring instructions. Noordhoff is responsible for standard schemes; when users make manual adjustments, responsibility for correct functioning shifts to that user. | An AI interpretation error in the marking scheme affects scores across the entire class. Modifications to the marking scheme can affect the quality of grading. |
| Grading answers by AI | In this step, ToetsTester first converts handwritten answers into digital text. Individual answers are then graded by an AI, comparing the answer to the marking scheme. | The effectiveness of handwriting recognition can differ per student. In addition, the AI works on the basis of probabilities, so a correct assessment cannot be guaranteed 100%. | For handwriting that is read poorly, ToetsTester may interpret a word incorrectly. AI can hallucinate. |
| Teacher review | The AI in ToetsTester acts as an assistant that makes an initial sorting of scores. The teacher reviews these scores, feedback, and the transcription. Results are then shared with students. | ToetsTester cannot guarantee that a teacher actually reviews everything. | Automation bias can occur: the risk that the teacher blindly trusts AI suggestions. In addition, cognitive surrender may occur, with reduced critical thinking due to familiarity with the system. |
Transparency and data processing
ToetsTester is used to recognize and preliminarily assess answers. The AI generates a proposal, but these outcomes are not binding: the teacher remains finally responsible. The system behavior is made explicit to users, including the limitation that the application supports and does not make autonomous decisions.
Data processing is based on the Education Privacy Covenant. Under this framework, the educational institution is the data controller and ToetsTester acts as the processor. Because operation is linked to current AI service capabilities, quality is continuously validated.
Technical documentation
ToetsTester maintains technical documentation on design, architecture, sub-processors, and risk management. This documentation is available to educational institutions on request.
Logging and monitoring
Monitoring takes place through dashboards and log files that remain available for at least 6 months. Logging is limited to actions such as login errors and system errors during the grading process.
Benchmarking and reliability
ToetsTester targets an Inter-Rater Reliability (IRR) of at least 0.75 (Cohen's Kappa). A score of 0.70 is generally accepted in Automated Essay Scoring (AES) literature as the standard for reliable supporting systems.
Transcription accuracy is validated using an internal dataset consisting of diverse handwritten texts. Models are configured with low temperature to minimize variability, so identical answers generate virtually the same score.
Human oversight
Human final responsibility is a core principle and a hard requirement for high-risk AI. AI acts only as an assistant. Teachers always retain the ability to adjust AI scores or stop using the system. Materials remain the property of the educational institution at all times.
Glossary
Automation Bias: the tendency to trust an AI judgment too quickly, even when it is wrong.
Cognitive offloading: outsourcing thinking tasks to AI, so you need to remember or compare less yourself.
Cognitive surrender: the moment when you effectively abandon your own judgment and follow AI without critical reflection.
Hallucinations: AI errors where the system invents something that sounds plausible but is incorrect.
Benchmarking: structured and objective evaluation and measurement of AI model performance by comparison with a fixed dataset.