Efficient Neural Theorem Proving
via Fine-grained Proof Structure Analysis

Tsinghua University, Huawei Noah's Ark Lab, Shanghai Qi Zhi Institute

Abstract

The synergy between deep learning models and traditional automation tools plays a pivotal role in developing robust neural theorem provers (NTPs). However, for proof synthesis with LLMs, previous work applies automation tools either only when the model explicitly calls the method, or only at a single granularity level, failing to fully exploit the power of built-in tactics and off-the-shelf automated theorem provers. In this work, we propose ProofAug, a novel theorem proving method that enjoys superior sample efficiency through equipping proof-generation LLMs with automation methods in different granularities via fine-grained structure analysis of model-generated proof proposals. Furthermore, ProofAug serves as a versatile plug-and-play module that seamlessly integrates with any tree-search algorithm, enabling our construction of an efficient recursive proving (ERP) module to further enhance performance. The superiority of our method is validated on the miniF2F-test benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant. Notably, by additionally employing a mixed prompting strategy, we achieve a cumulative pass rate of 66.0% after curation of the dataset (61.9% for the original version), setting a new SOTA for Isabelle NTPs with a total sample budget of only 2100. We also implement a Lean version of ProofAug that can improve the performance of Kimina-Prover-Preview-Distill-1.5B by 6.1%. Our code is available at this https://github.com/haoxiongliu/ProofAug.

Method

The following flowchart illustrates our method. It improves the Draft, Sketch, and Prove method by applying automation tools at different granularity levels through fine-grained proof structure analysis.

MY ALT TEXT

Each box in the flowchart corresponds to a proof step. (a) The initial proof encounters an error when proving Conjecture 1-0. (b) We replace the original proof of Conjecture 1-0 with a sorry and continue, until a syntactic error occurs at Conjecture 1-1. (c) The proof of Conjecture 1 is replaced by sorry and we obtain a semi-proof with two sorrys. (d) Automation tools are called to prove Conjecture 0 and Conjecture 1, but failing at the latter one. (e) The efficient-recursive-proving (ERP) module sends a completion request at Conjecture 1. (f) The generated ERP proof still fails at Conjecture 1. (g) We finally resort to a more coarse level of proof and successfully find a proof with automation tools. Refer to Algorithm 1 for a pseudo-code description of ProofAug.

Main Results

Table 2: Performance on miniF2F-test

Comparison of different methods on MiniF2F-test. For BFS methods, the sample budget N × S × T corresponds to N attempts of S expansion with T iterations. As to tree-search methods, it becomes N × T, with the same meanings for the symbols. Results with a `†` are obtained using a mixed strategy.
Method Model Sample Budget miniF2F-test
Methods using Isabelle
Draft, Sketch, and Prove CodeX 100 39.3%
Subgoal-XL Fine-tuned Llama-8B 64 39.3%
16384† 56.1%
LEGO-Prover mixed GPTs 100 50.0%
Lyra GPT-4 100 47.1%
200 51.2%
POETRY Fine-tuned ProofGPT (1.3B) 1 × 32 × 128 42.2%
Our Experiments (using Isabelle)
DSP baseline deepseek-math-7b-base 1 28.7%
10 40.6%
100 49.2%
ProofAug deepseek-math-7b-base 1 36.5% (+7.8%)
10 44.7% (+4.1%)
100 52.5% (+3.3%)
ProofAug (0-shot) deepseek-math-7b-base 500 54.5%
ProofAug (0-shot) w/ ERP deepseek-math-7b-base 500 56.1%
ProofAug + Mixed Strategy deepseek-math-7b-base 1400† 61.9%
ProofAug + Mix. + Dataset Curation deepseek-math-7b-base 2100† 66.0%
Methods using Lean
HTPS Evariste (600M) 64 × 5000 41.0%
RMaxTS DeepSeek-Prover-V1.5-RL (7B) 32 × 6400† 63.5%
Best-First Search + CG InternLM2.5-StepProver (7B) 256 × 32 × 600 65.9%

Citation

Please cite the paper and star this repo if you use ProofAug. Thanks!

@article{liu2025efficientneuraltheoremproving,
    title={Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis}, 
    author={Haoxiong Liu and Jiacheng Sun and Zhenguo Li and Andrew C Yao},
    year={2025},
    eprint={2501.18310},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2501.18310}, 
}