Defended PhD Thesis: Authorship Analysis in Military Intelligence

Research on the Application of Computational Linguistics-Based Authorship Analysis in Military Intelligence

Doctor of Philosophy (PhD) in Linguistics — Successfully Defended

Section Purpose: This section provides a high-level executive summary of the completed and successfully defended doctoral research by Dr. Ang Li (currently PI at Phaenarete ASI Lab). It outlines the core problem, the sponsoring entity, and the primary objective that the thesis successfully addressed, serving as an introduction to the intelligence gap closed by this research.

Abstract Summary

Traditional authorship analysis falls short in military intelligence contexts where communications are ultra-short and intentionally disguised by adversaries. This successfully defended thesis developed and validated a novel framework bridging traditional stylometry and modern deep learning to attribute highly obfuscated short-text intercepts reliably.

Project Metadata

Final Length: ~52,000 words (Complete)
Institution: The University of Edinburgh
Sponsorship: MI6 (Secret Intelligence Service, UK)
Core Output: Validated Authorship Analysis Framework

⚠️ The Intelligence Gap

Section Purpose: Here we define the specific problems faced by analysts when dealing with intercepted communications. It highlights the dual challenge of "Short Texts" and "Adversarial Disguise," explaining why current commercial and academic models fail in operational intelligence environments.

✂️

The Short-Text Problem

Military commands, forum posts, and intercepted messages rarely exceed 50-200 words. Traditional stylometry (which relies on large corpuses to build an "idiolect" profile) degrades rapidly in accuracy when applied to micro-texts.

🎭

Adversarial Disguise

Targets are aware of interception. They employ synonym substitution, syntactic alterations, and zero-width characters to spoof or hide their identity. Standard Transformer models overfit on clean data and fail completely against disguised inputs.

🧠 The HSTAR Framework

Section Purpose: This section breaks down the developed technical solution: The Hybrid Stylometric-Transformer with Adversarial Resilience (HSTAR). The interactive architecture diagram below allows you to explore the different components of the fully realized model, from feature engineering to the explainability layer.

HSTAR Architecture Flow

1. Raw Intelligence Intercepts

Short text data (50-200 words). Simulated military orders, anonymised forum communications.

⬇️

2a. Stylometric Feature Extraction

Extracting 500+ function words, punctuation habits, and POS tag frequencies. Resilient to simple synonym swaps.

2b. Contextual Embeddings

Domain-specific fine-tuned RoBERTa/BERT capturing deep semantic and syntactic meaning.

⬇️

3. Adversarial Training Module

Automatically generates disguised versions of training texts during the learning phase, forcing the model to rely on robust features rather than easily spoofed markers.

⬇️

4. SHAP Explainability Layer & Output

Outputs attribution probability. Crucially, uses SHAP values to explain *why* a decision was made (e.g., "Attributed to Author X due to use of specific function word Y"), ensuring legal/analytical admissibility.

📊 Experimental Results (Simulated)

Section Purpose: This section presents the quantitative validation of the thesis. Through interactive visualizations, it demonstrates how the developed HSTAR framework outperformed traditional baselines (SVM, basic BERT) across the two main challenge areas: text length and adversarial disguise.

Short-Text Attribution Accuracy

Context: Accuracy comparison across different document lengths. Hover over the bars for exact percentage values.

Key Takeaway: HSTAR maintains a >92% accuracy rate even when the intercept length drops to 50 words, significantly outperforming baselines that rely solely on broad stylometry or un-tuned contextual embeddings.

Model Robustness Under Adversarial Disguise

Context: How does accuracy drop when targets actively try to hide their identity? Lower drop indicates higher robustness.

Key Takeaway: When exposed to heavy adversarial disguise (synonym swaps, syntax changes), baseline accuracy plummets below 50%. HSTAR's adversarial training module keeps accuracy resilient at ~85%.

🎯 Operational Integration & Impact

Section Purpose: A successful PhD must demonstrate a novel, significant contribution. This section outlines how Dr. Li's research successfully translates from academic theory into practical, deployable tools for MI6 and GCHQ workflows.

⚡

Real-Time Triage

HSTAR was validated for automatic flagging of high-probability target matches in streaming, short-burst intercepted communications.

⚖️

Explainable Intelligence

The SHAP layer successfully provided analysts with human-readable reasoning, proven essential for building legally sound intelligence dossiers.

🌐

Ongoing Lab Work

Under Dr. Li's direction at Phaenarete ASI Lab, the framework is being expanded to cross-lingual attribution and wider network metadata analysis.

📑 Thesis Format & Compliance

Section Purpose: Confirms alignment with the University of Edinburgh's strict formatting and submission guidelines, while maintaining the confidentiality required by the research sponsor during the final defense.

🎓

✔️
Electronic Submission: As per University policy, the final approved thesis was submitted digitally via ERA (Edinburgh Research Archive).
✔️
Sensitivity Handling: The final title page bears the mandated “Restricted Access – Approved by University Committee” note. No real classified intelligence data was used; all data was synthetic or open-source.
✔️
Bibliography & Contents: Maintained consistent author-date formatting (APA 7th). Full table of contents followed the abstract, including a list of tables and figures, strictly adhering to the formatting guidance document.