Monday, 5 May 2014

Classifying - prose and cons


Why do people classify? It's partly to make storage and retrieval efficient. If you know where to put something, you'll be able to find it again. However, there are many ways to classify. For example, where should you keep dinner forks? In a cupboard where you keep all things whose name begins with F? Or along with garden forks? If they have red handles you could store them with other red items. Using any of these strategies the forks would be easy to find. Usually however they're kept by things with a similar function (i.e. cutlery), where they're most often needed (the kitchen), and in appropriate conditions.

Classifying is also descriptive, aiding understanding. Once you know that something is classified as cutlery, you know its purpose. Another advantage of keeping cutlery together is that you may find a more suitable item than the one you were initially looking for. Classifying brings into play other issues -

  • Partitioning - some classifications segregate, splitting items into non-overlapping sets - an item in one class can't be in another. Classifying all humans into "male" or "female", or all texts in "fiction" or "non-fiction" is problematic, though there may be good reasons for such a restrictive classification (in law or for sport, for example).
  • Primary/Secondary qualities - Knowing that E.coli is a bacterium tells you much about it. On the other hand, that my hair is brown is contingent, not central to identity.
  • Noun/Adjective - Adjectives are less threatening than nouns when classifying; roughly equivalent to secondary effects. They tend to be less segregational. Describing something as "a poem" isn't the same as describing it as "poetic" - a poem can't be a painting, but a painting can be poetic.

Classifying can become a habit, but much of the time it has a purpose. In the UK, the law says that you have to be at least 17 years old to drive. This isn't descriptive - it's not claiming at on their 17th birthday people suddenly become responsible enough to drive. Putting tomatoes in the supermarket's vegetable section isn't asserting that tomatoes are vegetables, any more than finding "By Grand Central Station I Sat Down and Wept" in the poetry section of a library means that it's poetry.

How to classify

In statistics there are some techniques to help with classification. The basic ideas are worth knowing about. Let's start with a simple case. How could you train an alien or a computer to distinguish between dogs and family cars? The tried and trusted method is to take measurements (age, max speed, height, weight, etc) of known samples, then see which of those factors best discriminate between the 2 types of objects. In this situation perhaps one factor - weight - suffices. Anything weighing more than a certain amount is a car.

In fact, it's not necessary to "train" the system on already-classified samples. Factor analysis will be able to notice that there are 2 families of items, and it will identify the best discriminating factor(s). A more complex situation is determining whether a human face looks male or female. It's more complex partly because

  • it's unclear which easily measurable factors might be important - humans don't consciously analyse faces very much.
  • unlike the dog/car situation, the faces don't form 2 disjoint sets.

But the methodology used in the first example can still be applied here. First, decide on what to measure - it's best to play safe and measure many features whether you think they're relevant or not. Then get some humans to assess the gender of the faces. Having done that, the factors can be analysed automatically. Some factors can be eliminated as irrelevant, or because they duplicate information. Other factors might be only slightly skewed towards one gender. By taking a weighted combination of factors, a single number can be determined which optimally reflects the femininity of the face. In the image presented here, from InTech, 2 factors are being used, and the green line shows the axis along which the red and blue samples are most clearly separated.

Why bother with all this? Well, just as our eyes are limited (we need microscopes and telescopes to see nature the way other creatures do), so our ability to perceive patterns is limited.

  • Factor analysis might reveal the importance of factors that humans didn't think was important.
  • The analysis might reveal hidden factors (perhaps the distance between the eyes compared to the distance between the ears) that are significant.
  • The analysis might identify clumps of similar samples that humans missed.

Importantly, this process of identifying the factors useful for classifying can be automated and can deal with dozens of factors. See Wikipedia's Factor analysis page for details. Reducing variation to a single number is sometimes over-reductionist - it rather depends on the context, and the maths will tell us how reductionist our choice of parameters is. Other problems are to do with interpretation. In the example above -

  • "femininity" is in the eye of the beholder and changes according to fashion. That's not a problem for the method, which is merely trying to efficiently emulate the human's classifications - it's not trying to determine the Idealized notion of gender or beauty.
  • In the list of measurements taken, some significant metrics might have been left out. It's safer to take too many measurements and leave it to the methodology to eliminate the redundant values.
  • The results might be abused - males with (according to this method) very feminine faces might be picked on, for example.

In practise, humans classify lazily. They don't perform an in-depth study then classify. More often they try to make do with a glance. If that doesn't suffice, then they have another, longer (or more targeted) look.

Poetry and Prose

All that's a prequel to the discussion of poetry versus prose. I think authors are well advised to be aware of how readers are likely to process their texts. Faced with a text that looks like (or is described as) poetry, readers are likely to adopt a different initial reading strategy to the one they use for prose. In the course of reading the text they might switch strategies, in particular regarding reading-speed and linearity of reading. Such switches might be part of the writers plans, though they might well irritate readers.

Sometimes - e.g. in a general literary magazine - readers will quickly identify the recommended reading mode of a text without the aid of classification. But most of the time the classification matters a lot. I know many avid prose readers who never go near poetry, and want libraries to have segregated "Fiction" and "Poetry" sections. If they have to read a paragraph of an unclassified book before discarding it as "poetry" they'll become grumpy.

An automated classification of texts can be performed by taking many measurements (number of words, frequency of line-breaks, frequency of adjectives, etc), and various sets of judges can be asked to classify the texts. If a text scores highly as "poetic" it may nevertheless contain some very unpoetic features, but the text probably won't be a newspaper article. I suspect that the resulting classification of texts into prose or poetry would be plausible - better than many a human could achieve (they're likely to try to make do with a glance), and probably no more prone to "misjudgements" than would be a human. Even a quick, lazy classification would work pretty well. For example, to a first approximation, literary readers will treat short texts as poems, and anything with line-breaks as poems.


If the poetry/prose classification is a user-friendly description for the customer's benefit, why do writers complain about being their work being "pigeonholed"?

  • Misunderstanding the purpose of the classification can, like many other aspects of marketing, lead to difficulties.
  • If the classification partitions in a situation where it's unnecessary, it's restrictive. It may polarise in a way that affects production of texts - people won't write in genres that won't be bought or read. Prose-poems and penseés may suffer.
  • If classification promotes a secondary characteristic to a primary one, impressions may become distorted and out-of-date. Maybe line-breaks are becoming a secondary characteristic for some audiences.
  • Descriptions aren't neutral. Describing someone as "a woman poet", or a piece of work as "gay" rather than a love-story is more than mere description.

If an author's supposedly prose book goes in the poetry section of the library, what are the consequences? I presume there'll be fewer people browsing its spine. Describing an author's putative poem as "prose with line-breaks" is sometimes offered as adverse criticism, but the writer needn't receive it that way. It may indicate that public opinion is shifting, that genre-identification has become more reader-centred. If that leads to the piece being presented alongside other prose pieces, maybe it will read by far more people than before.

