Stylometry

From Conservapedia
Jump to: navigation, search
In 1996, computer style analysis at the Shakespeare Clinic at Claremont College in California compared Shakespeare's writing to that of a corpus of twenty-six Elizabethan authors, including Bacon, Marlowe and Oxford. [1]

According to the article Introduction to stylometry with Python by François Dominic Laramée:

Stylometry is the quantitative study of literary style through computational distant reading methods. It is based on the observation that authors tend to write in relatively consistent, recognizable and unique ways. For example:

- Each person has their own unique vocabulary, sometimes rich, sometimes limited. Although a larger vocabulary is usually associated with literary quality, this is not always the case. Ernest Hemingway is famous for using a surprisingly small number of different words in his writing, which did not prevent him from winning the Nobel Prize for Literature in 1954.

- Some people write in short sentences, while others prefer long blocks of text consisting of many clauses.

- No two people use semicolons, em-dashes, and other forms of punctuation in the exact same way.

The ways in which writers use small function words, such as articles, prepositions and conjunctions, has proven particularly telling. In a survey of historical and current stylometric methods, Efstathios Stamatatos points out that function words are “used in a largely unconscious manner by the authors, and they are topic-independent.” For stylometric analysis, this is very advantageous, as such an unconscious pattern is likely to vary less, over an author’s corpus, than his or her general vocabulary. (It is also very hard for a would-be forger to copy.) Function words have also been identified as important markers of literary genre and of chronology.[2]

Stylometry concepts

Temple University Libraries states about Stylometry:

Like many strategies in the digital humanities, stylometry combines traditional close reading alongside more distant reading.

Close reading

In literary criticism, close reading describes a sustained attention to the text, looking at word choice, syntax, and particular images, to reveal meaning. This methodology came out of New Criticism and does not rely on historical or biographical research, intending to analyze the complexities of the individual text instead.

Distant reading

This is a concept coined by Franco Moretti that suggests that looking at the wider scope of literature, through larger computational or archival methods, can help us see larger trends and reveal previously obscured systems within literary study. It’s up to the researcher to look at other relevant information related to authorship using traditional research such as archives, historical background, close texutal analysis, etc.

Key Point: ​Stylometry can only offer statistical probability, not definitively claim authorship.[3]

Journal article on stylometry

The abstract for the journal article A Complex Network Approach to Stylometry by Diego Raphael Amancio states:

Statistical methods have been widely employed to study the fundamental properties of language. In recent years, methods from complex and dynamical systems proved useful to create several language models. Despite the large amount of studies devoted to represent texts with physical models, only a limited number of studies have shown how the properties of the underlying physical systems can be employed to improve the performance of natural language processing tasks. In this paper, I address this problem by devising complex networks methods that are able to improve the performance of current statistical methods. Using a fuzzy classification strategy, I show that the topological properties extracted from texts complement the traditional textual description. In several cases, the performance obtained with hybrid approaches outperformed the results obtained when only traditional or networked methods were used. Because the proposed model is generic, the framework devised here could be straightforwardly used to study similar textual applications where the topology plays a pivotal role in the description of the interacting agents.[4]

Author obfuscation methods and tools

The are a number of free online article rewriters or paid programs which rewrite articles. The most rudimentary programs simply replace words with synonyms. More advanced programs change the complexity of the sentence structures.

One of the most advanced author obfuscation tools developed is Mutant-X.[5][6]

The abstract for the 2019 journal article entitled A Girl Has No Name: Automated Authorship Obfuscation using Mutant-X published in Proceedings on Privacy Enhancing Technologies indicates:

...a genetic algorithm based random search framework called Mutant-X which can automatically obfuscate text to successfully evade attribution while keeping the semantics of the obfuscated text similar to the original text. Specifically, Mutant-X sequentially makes changes in the text using mutation and crossover techniques while being guided by a fitness function that takes into account both attribution probability and semantic relevance. While Mutant-X requires black-box knowledge of the adversary’s classifier, it does not require any additional training data and also works on documents of any length. We evaluate Mutant-X against a variety of authorship attribution methods on two different text corpora. Our results show that Mutant-X can decrease the accuracy of state-of-the-art authorship attribution methods by as much as 64% while preserving the semantics much better than existing automated authorship obfuscation approaches.[7]

Mutant-X was developed in 2019 and its developers have indicated that number of improvements are planned to occur.[8]

See also

References

  1. Ward Elliott and Robert Valenza, "Was the Earl of Oxford the true Shakespeare? A Computer-aided analysis" (1991)
    Elliott and Valenza, "And Then There Were None: Winnowing the Shakespeare Claimants", Computers and the Humanities, v30 n3 p191-245 1996. For a summary, see here.
    Elliott and Valenza, "Oxford by the Numbers: What Are the Odds That the Earl of Oxford Could Have Written Shakespeare's Poems and Plays?. (2004) "The odds that either could have written the other’s work are much lower than the odds of getting hit by lightning."
  2. Introduction to stylometry with Python by François Dominic Laramée
  3. Stylometry Methods and Practices, Temple University Libraries
  4. Amancio DR (2015) A Complex Network Approach to Stylometry. PLoS ONE 10(8): e0136076. https://doi.org/10.1371/journal.pone.0136076
  5. A Girl Has No Name: Automated Authorship Obfuscation using Mutant-X by Asad Mahmood, Faizan Ahmad, Zubair Shafiq, Padmini Srinivasan, and Fareed Zaffar, Proceedings on Privacy Enhancing Technologies; 2019 (4):54–71
  6. Mutant X - Github
  7. A Girl Has No Name: Automated Authorship Obfuscation using Mutant-X by Asad Mahmood, Faizan Ahmad, Zubair Shafiq, Padmini Srinivasan, and Fareed Zaffar, Proceedings on Privacy Enhancing Technologies; 2019 (4):54–71
  8. A Girl Has No Name: Automated Authorship Obfuscation using Mutant-X by Asad Mahmood, Faizan Ahmad, Zubair Shafiq, Padmini Srinivasan, and Fareed Zaffar, Proceedings on Privacy Enhancing Technologies; 2019 (4):54–71