It’s time to give an update on what I’ve been up to.
I sent a revised and expanded version of my thesis based on research I’ve been doing here off for publication around Christmas. The review process can take up to six months, assuming they even accept it, so it will be off my plate for a while. In the meantime, I’ve been aiming to collect my own data about the particular grammatical phenomenon I’m interested in.
My initial, unformed plan was to simply record some conversations in Macedonian between native speakers and play them back later for transcription, hoping I’d hit paydirt. But the structure I’m looking for is rare enough that this sort of amoeba approach to data gathering wouldn’t guarantee me much success for the amount of work involved.
On the recommendation of one of the professors I spoke with at UCSB, Prof. Carol Genetti, I decided to do an elicitation instead. She had had some success with writing a text in English containing contextual triggers that would elicit the structure she was looking for and then having bilingual informants first translate the text into the target language, then tell the story of the text without reference to either the original or their translation.
This sounded appropriate for what I’m interested in, especially because it might help me firm up my conclusions about the register differences between the standard and the spoken language in regards to this grammatical structure and thus help me explain why it’s so rare in writing. However, I needed the correct contextual triggers.
Truth is, I have only the vaguest idea what those would be. Since I don’t have a native’s intuition about what kind of sentences in Macedonian I’d like to turn out, I don’t know what kind of sentences in English to input. My own knowledge puts me in the wrong direction for attacking the problem.
So that left me back at square one, without any of my own examples of this particular grammatical pattern. Then I remembered that several months back a professor in the United States, Prof. George Mitrevski, had offered me access to his ongoing Macedonian corpus project.
Only problem is that those texts are in raw, untagged, unencoded form. I had to download the 368 texts by hand and reencode them into something readable on a Mac and then group them into manageable chunks. That required me to supplement my very limited command line and scripting knowledge with a lot of google searching and FAQ reading.
Then I needed to do the search itself. This grammatical structure involves the combination of two sets of a couple grammatical particles in a certain order together in one sentence. They both had to be there, and they had to be in the right order and without certain other features like prepositions that would muddy the results. That sort of search can’t really be done with your standard “text in a searchbox and a wildcard here or there,” so I also had to brush up on regular expressions. Advantage of regular expressions: they let you do incredibly precise, efficient searches. Disadvantage of regular expressions: well… just take a look at my latest revision of the algorithm…
(?>((((\b[^.?!]*)(?>\bго\b|\bги\b|\bја\b)[^.?!]*)(?=(\b[^.?!]*)(?>(?
(?
(?
(?>(?
\bедна\b|(?
\bедни\b)[^.?!]*)(?=(\b[^.?!]*)(?>\bго\b|\bги\b|\bја\b)[^.?!]*)[^.?!]*)))
Not every readable, is it? Everything in regular expressions is tight, like a fine-tuned watch. If you have every punctuation mark, every little period and paranthesis and backslash in the right place, then the thing ticks away and returns you the dozen sentences you wanted out of several hundred thousand. One typo though and the thing explodes.
Explodes!
Okay, not that dramatic, it simply doesn’t work, but if you’ve been hacking away for hours at a stream of punctuation marks that would make even the most obscene cartoon character blush, you might want to make something explode.
Anyway, my regular expression works, with occasional fine-tuning to reduce false positives. But because the corpus is raw, brute force selection and subtraction only gets me so far. After that, the computer hands the task over to me. So I’ve been sitting and reading sentences, mostly ruling them out as irrelevant. The corpus is a 1.8 gigabyte text file. Even with the severe pruning that the search algorithm does for me, that’s a lot of reading.
Much of the text so far has been gathered from transcriptions of Macedonian government debates. I am becoming uncomfortably familiar with local parliamentary rhetoric. And I’m torn between feeling exhausted with this busywork and lazy because I’m sitting in front of the computer all day, instead of IN THE FIELD gathering DATA from THE LOCALS like a REAL SOCIAL SCIENTIST.
But, fingers crossed, I’ll come away from this with a much more refined idea of exactly what I’m looking for and how to find it.
sewing circles are not soley made in trades of cloth
there’s spinsters all around us taking notes reporting on us