Accuracy was a critical requirement. A response that
sounds correct is not sufficient in a real estate analytics context.
To address this, we built a dedicated
evaluation pipeline that validates the agent’s behaviour at multiple levels:
- correctness of parameter extraction,
- correctness of tool selection,
- consistency of the final analytical output.
Synthetic validation datasets were generated to simulate realistic user queries and edge cases, allowing the system to be tested statistically rather than anecdotally. This made it possible to detect errors in reasoning flow even when the final answer was phrased differently but contained the same information.