The synergy between deep learning models and traditional automation tools plays a pivotal role in developing robust neural theorem provers (NTPs). However, for proof synthesis with LLMs, previous work applies automation tools either only when the model explicitly calls the method, or only at a single granularity level, failing to fully exploit the power of built-in tactics and off-the-shelf automated theorem provers. In this work, we propose ProofAug, a novel theorem proving method that enjoys superior sample efficiency through equipping proof-generation LLMs with automation methods in different granularities via fine-grained structure analysis of model-generated proof proposals. Furthermore, ProofAug serves as a versatile plug-and-play module that seamlessly integrates with any tree-search algorithm, enabling our construction of an efficient recursive proving (ERP) module to further enhance performance. The superiority of our method is validated on the miniF2F-test benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant. Notably, by additionally employing a mixed prompting strategy, we achieve a cumulative pass rate of 66.0% after curation of the dataset (61.9% for the original version), setting a new SOTA for Isabelle NTPs with a total sample budget of only 2100. We also implement a Lean version of ProofAug that can improve the performance of Kimina-Prover-Preview-Distill-1.5B by 6.1%. Our code is available at this https://github.com/haoxiongliu/ProofAug.
The following flowchart illustrates our method. It improves the Draft, Sketch, and Prove method by applying automation tools at different granularity levels through fine-grained proof structure analysis.
Method | Model | Sample Budget | miniF2F-test |
---|---|---|---|
Methods using Isabelle | |||
Draft, Sketch, and Prove | CodeX | 100 | 39.3% |
Subgoal-XL | Fine-tuned Llama-8B | 64 | 39.3% |
16384† | 56.1% | ||
LEGO-Prover | mixed GPTs | 100 | 50.0% |
Lyra | GPT-4 | 100 | 47.1% |
200 | 51.2% | ||
POETRY | Fine-tuned ProofGPT (1.3B) | 1 × 32 × 128 | 42.2% |
Our Experiments (using Isabelle) | |||
DSP baseline | deepseek-math-7b-base | 1 | 28.7% |
10 | 40.6% | ||
100 | 49.2% | ||
ProofAug | deepseek-math-7b-base | 1 | 36.5% (+7.8%) |
10 | 44.7% (+4.1%) | ||
100 | 52.5% (+3.3%) | ||
ProofAug (0-shot) | deepseek-math-7b-base | 500 | 54.5% |
ProofAug (0-shot) w/ ERP | deepseek-math-7b-base | 500 | 56.1% |
ProofAug + Mixed Strategy | deepseek-math-7b-base | 1400† | 61.9% |
ProofAug + Mix. + Dataset Curation | deepseek-math-7b-base | 2100† | 66.0% |
Methods using Lean | |||
HTPS | Evariste (600M) | 64 × 5000 | 41.0% |
RMaxTS | DeepSeek-Prover-V1.5-RL (7B) | 32 × 6400† | 63.5% |
Best-First Search + CG | InternLM2.5-StepProver (7B) | 256 × 32 × 600 | 65.9% |
Please cite the paper and star this repo if you use ProofAug. Thanks!
@article{liu2025efficientneuraltheoremproving,
title={Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis},
author={Haoxiong Liu and Jiacheng Sun and Zhenguo Li and Andrew C Yao},
year={2025},
eprint={2501.18310},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2501.18310},
}