ALANN – Auto Lit Analysis Neural Net

ALANN – the automated literary analysis neural network idea posed here some months ago (https://anonymole.wordpress.com/2016/09/25/so-you-wrote-a-novel/)  “might” be realizable to some degree without the need for a DeepMind neural network.

There are no doubt certain aspects of writing (that this author is being slowly made aware of) that can be extracted as metrics from any writing.

Here are a few.

  • Word counts and the ratio of those counts
    • Verb count
    • Adverb count
    • Adjective count
    • Proper name counts
    • Single word counts
  • Comma count
  • Character length of words
  • Sentence length and sentence complexity
  • Quote counts and their dispersion throughout the text
  • Certain word usages, active vs passive voice
  • Jaggedness, how choppy is the the dialog vs narrative

Regarding word counts, what are the ratio’s of some word counts to others? What about common literary words vs the total count? Filter words, decorative, embellishment words vs the total?


Here’s some data (the means to acquire this data is below).

Let’s consider comma count to sentence count (Comma/Sent) as a measurement of “literary” intent. The higher the number the more lofty the writing (or the more Victorian…)

Charles Dickens’ Great Expectations had a Comma/Sent ratio of 200%. There were twice as many commas as periods.

Jack London’s White Fang, on the other hand, had a ratio of only 101%, there were about as many commas as periods.

If we examine the other writers and their works, this simple metrics *seems* to correlate with our expectations. HG Wells, Burroughs have lower “literary” quotients than Jane Austen or Herman Melville.

So, are there other factors that we can use to investigate the literary vs genre vs popular vs what-have-you aspects of novels? And, primarily, can we build a system that can judge them?

ALANN
Title Author Comma/Sent Excl/Sent Semi/Sent Dial/Sent Sing/Word
Adventures of Huckleberry Finn Mark Twain 164.99% 10.41% 31.99% 32.51% 2.50%
Great Expectations Charles Dickens 200.07% 11.56% 14.75% 46.07% 2.02%
Blue Across the Sea Dave Cline 93.06% 2.44% 0.76% 32.48% 4.02%
Pride and Prejudice Jane Austen 147.77% 8.07% 24.89% 28.58% 2.05%
Moby Dick Herman Melville 256.49% 23.57% 56.09% 19.49% na
Tarzan of the apes Edgar R Burroughs 129.84% 3.99% 6.82% 22.90% 3.86%
Sense and Sensibilities Jane Austen 200.85% 11.32% 32.03% 31.40% 2.07%
Island of Dr. Moreau HG Wells 115.16% 6.78% 12.83% 24.77% 6.15%
White Fang Jack London 101.10% 1.98% 4.81% 10.27% 4.15%

 

Here’s a site I found to help kickstart this concept:

https://www.online-utility.org/text/analyzer.jsp

If we go to the Gutenberg Project and pick some books, let’s start with Adventure of Huckleberry Finn: https://www.gutenberg.org/ebooks/76

What is the comparison of the word “was” to the total word count?

Order Unfiltered word count Occurrences Percentage
1. and 6350 5.6714
2. the 4779 4.2683
3. i 3270 2.9206
4. a 3150 2.8134
5. to 2934 2.6205
6. it 2326 2.0774
7. was 2069 1.8479
8. he 1676 1.4969
9. of 1633 1.4585
10. in 1433 1.2799
11. you 1360 1.2147
12. that 1083 0.9673
13. but 1035 0.9244
14. so 961 0.8583
15. on 880 0.7860
16. up 861 0.7690
17. all 852 0.7610
18. we 848 0.7574
19. for 843 0.7529
20. me 823 0.7351

Now, what of the total single use words?

There were 2752 words used exactly once of a total of 110016 words which gives us a percentage of 2.5014%

Using this tool: https://jumk.de/wortanalyse/word-analysis.php

General:
110016 words
554985 characters (with space)
450511 characters (without space)
104474 spaces
421151 letters
14 numbers
29346 others
2509 blank lines
11430 line breaks

Punctuation marks:
4870 times . (dot)
8035 times , (comma) commas to periods ratio percentage: 164.98%
729 times ? (question mark)
507 times ! (exclamation mark) written energy bangs per period: 10.41%
426 times : (colon)
1558 times ; (semicolon) semicolons as a sign of sentence count: 31.99%
2973 times – (hyphen)
0 times / (slash)
3166 times ” (quote) dialog statements as a % of sentence count: 32.50%
5004 times ‘ (single quote)



Now, let’s test another literary work… Great Expectations

Order Unfiltered word count Occurrences Percentage
1. the 8145 4.3638
2. and 7092 3.7996
3. i 6475 3.4690
4. to 5152 2.7602
5. of 4437 2.3772
6. a 4047 2.1682
7. in 3026 1.6212
8. that 2986 1.5998
9. was 2836 1.5194
10. it 2670 1.4305
11. he 2206 1.1819
12. you 2186 1.1712
13. had 2093 1.1213
14. my 2069 1.1085
15. me 1998 1.0704
16. his 1860 0.9965
17. as 1774 0.9504
18. with 1760 0.9429
19. at 1637 0.8770
20. on 1420 0.7608

Unique word use count 3723 of 184378 words = 2.0192

General:
184378 words
973823 characters (with space)
805308 characters (without space)
168515 spaces
761634 letters
5 numbers
43669 others
4125 blank lines
20011 line breaks

Punctuation marks:
8522 times . (dot)
17050 times , (comma) commas to periods ratio percentage:  200.07%
1216 times ? (question mark)
985 times ! (exclamation mark) written energy bangs per period: 11.55%
105 times : (colon)
1257 times ; (semicolon) semicolons as a % of sentence count: 14.75%
3483 times – (hyphen)
0 times / (slash)
7852 times ” (quote)  dialog statements as a % of sentence count: 46.06%
2512 times ‘ (single quote)



And here’s the novel I wrote Blue Across the Sea

Order Unfiltered word count Occurrences Percentage
1. the 6772 7.8213
2. and 2641 3.0502
3. to 2353 2.7176
4. a 1808 2.0881
5. of 1807 2.0870
6. he 1028 1.1873
7. you 1001 1.1561
8. tillion 974 1.1249
9. in 962 1.1111
10. i 839 0.9690
11. it 839 0.9690
12. his 796 0.9193
13. that 695 0.8027
14. with 650 0.7507
15. they 632 0.7299
16. as 623 0.7195
17. from 613 0.7080
18. her 588 0.6791
19. we 534 0.6167
20. up 514 0.5936

(“was” came in down around 280 instances…)

Unique word count 3469 out of 86219 = 4.0234

General:
86219 words
475623 characters (with space)
391243 characters (without space)
84380 spaces
370677 letters
4 numbers
20562 others
156 blank lines
2460 line breaks

Punctuation marks:
6595 times . (dot)
6137 times , (comma) commas to periods ratio percentage:  93.05%
644 times ? (question mark)
161 times ! (exclamation mark) written energy bangs per period: 2.44%
11 times : (colon)
50 times ; (semicolon) semicolons as a % of sentence count: 0.7581%
348 times – (hyphen)
1 times / (slash)
4284 times ” (quote) dialog statements as a % of sentence count: 32.47%
1912 times ‘ (single quote)

Here we will be adding additional literary works:

https://docs.google.com/spreadsheets/d/1Xop9GaBhjvXgA7dnupLlTin4VUNwNwPPrJWfGEMwxww/edit?usp=sharing

Advertisements

One response to “ALANN – Auto Lit Analysis Neural Net

  • Anony Mole

    https://aeon.co/essays/how-ai-is-revolutionising-the-role-of-the-literary-critic

    Regarding the article, however, the author misses a huge aspect of literary analysis that continues to be ignored in this day and age of deep-wide neural network analysis: manuscript evaluation.

    Both this:

    https://anonymole.wordpress.com/2016/09/25/so-you-wrote-a-novel/
    and this:
    https://anonymole.wordpress.com/2016/12/04/alann-auto-lit-analysis-neural-net/

    get into the concept that *tens of thousands* of new novels are written every year and must be evaluated by literary agents and publishers. There is a huge opportunity lurking here. The team that solves this issue is the team the can claim the right of best new author, hottest new best-sellers, NYTimes top-of-the-list novels for years to come.

    Currently, the process of manuscript submission for evaluation is as archaic as they come. It’s pure alchemy performed by cloaked agents in tall towers protected by obscure ramparts and digital moats. This process must change. And the researchers in this article are some of those teams who could change it.

    Who cares if computers will ever write Dickens or Austen? We have thousands of authors / today / who need the services of deep AI for evaluating their work. Myself included.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: