Similarity Check Guide | CNKI, VIP, Wanfang Detection Principles and Reduction Techniques
AcademicIdeas covers thesis similarity detection: how CNKI/VIP/Wanfang work, acceptable similarity rates, common detection traps, and effective reduction methods.
What this page helps you do first
- CNKI, VIP, Wanfang detection principles and applicable scenarios comparison
- University acceptable similarity standards and red lines
- Practical techniques from high similarity to passing rates
How does similarity detection work
The core principle of similarity detection is comparing your paper's text against database records to identify character or phrase repetitions. When a passage exceeds the system's threshold for consecutive matches (typically 8-15 characters), it is flagged as duplicated content.
Important: Detection systems measure "text copy ratio" not "semantic similarity." Even if you paraphrase a sentence, if the paraphrased text remains highly similar to a database source, it will still be flagged.
CNKI, VIP, Wanfang: platform comparison
- CNKI: Most widely used by Chinese universities, with the most comprehensive database including historical theses. Detection triggers at "13+ consecutive characters repeated"
- VIP: More stringent algorithm than CNKI, triggering at "8+ consecutive characters repeated." Database covers Chinese journals, theses, conferences, and some internet resources
- Wanfang: Faster detection, relatively lower price, but database coverage less comprehensive than CNKI. Some schools use Wanfang only for initial checks, requiring CNKI final verification
- Initial check recommendation: VIP or Wanfang (lower cost); final verification: CNKI (most comprehensive). Results may vary 5%-15% across different systems
Acceptable similarity rate standards by university type
- Undergraduate thesis: Typically requires below 15%-30%; excellent thesis may require below 10%. Above 30% requires revision and recheck; above 50% may delay defense
- Master's thesis: Typically requires below 10%-20%; some "Double First-Class" universities require below 10%
- Journal submission: Core journals usually require below 15%; manuscripts above 20% are directly rejected
- Standards vary by school—check academic affairs or graduate school notices for specific requirements; some schools also limit citation rate and self-citation rate
High-similarity zones and common detection traps
- Literature review: Major high-similarity zone! Research status descriptions easily match existing literature. Suggestion: limit each source's description to 3 sentences max, add your own evaluation after citations
- Definitions: Professional term definitions and classic theory descriptions often cannot be paraphrased and easily flagged. Solution: add personal analysis after definitions, e.g., "this definition is widely used in XXX field, but XXX debates remain"
- Methods descriptions: Engineering/CS content like experimental steps, parameters, formula derivations have extremely high similarity. Suggestion: convert text descriptions to flowcharts or tables
- Conclusion templates: Phrases like "this research has important theoretical and practical significance" have very high repetition—delete or replace with specific descriptions
Effective similarity reduction methods
- Synonym replacement: replace professional terms with synonyms or hypernyms, e.g., "analysis method" → "analytical approach"
- Sentence structure transformation: active to passive, affirmative to double negative, split long sentences, merge short sentences
- Add personal analysis: after citations or definitions, add your own understanding, applications—this effectively reduces text copy ratio
- Convert expression forms: convert text to charts, formulas, flowcharts—these typically do not participate in detection
- Translation method: translate flagged passages to English (Google Translate), then back to Chinese for differentiated expressions
- Similarity reduction is iterative—typically 3-5 rounds of revision needed to reach acceptable rates, do not expect one pass
AIGC detection: new challenge since 2024
With the spread of AI tools like ChatGPT, domestic detection systems have added AIGC detection modules (CNKI and VIP both available). AIGC detection targets AI-generated text that has not been human-edited, based on language model probability distribution characteristics rather than text copying.
Core strategy for AIGC detection is "adding human writing traces": integrate personal research experiences, use more colloquial and personalized expressions, add critical analysis of existing conclusions, and supplement AI-generated content with first-hand data or interview content.
Before and after similarity check operations
- Before check: remove all personal information (name, student ID, advisor info); ensure file is Word format (some schools require PDF conversion before submission); do not include personal info in filename
- During check: school-provided free checks are limited—use VIP/Wanfang for initial tests and reduction, CNKI for final verification
- After check: download report, verify each red-marked passage, confirm whether it is "incorrect citation" or "citation format error" causing duplication
- Citation ≠ no similarity: improperly formatted citations (missing quotes, no source attribution) still count toward similarity ratio
Frequently asked questions
- How is similarity rate calculated?
- Similarity rate = (duplicate character count / total character count) × 100%. Systems automatically identify suspected duplicate passages (above threshold consecutive matches) and calculate their proportion of total text. Note: improperly formatted citations still count toward similarity.
- Does citing my own published papers count as similarity?
- Yes, this is "self-plagiarism." Copying your previous research into the current paper, even with proper citation, still counts toward similarity rate. You need to rephrase or add new analysis to reduce self-citation rate.
- Do formulas and charts participate in similarity checking?
- Most detection systems do not check formulas and charts, but system rules may update. Confirm in your similarity report. If your thesis's formula derivations and chart citations heavily reference literature, also paraphrase text descriptions appropriately.
- How far before defense should I do similarity checking?
- Complete final similarity checking at least 2-3 weeks before defense to allow sufficient time for reduction and format adjustment. If your school only offers 1-2 free checks, test with other platforms first, confirm revisions are complete, then use school checks for final verification.
- What does "suspected plagiarism" in the report mean?
- "Suspected plagiarism" means not only high similarity rate, but the passage is highly similar to a specific existing document, potentially involving academic misconduct. Thoroughly revise such passages and ensure proper source attribution in citations.