...making Linux just a little more fun!
Felipe Sanchez Martinez [fsanchez at dlsi.ua.es]
Hi,
Jimmy, very good explanation, :D
> Also; the rules that a-t-t generates are for the 'transfer only' mode > of apertium-transfer: this example uses the chunk mode - most language > pairs, unless the languages are very closely related, would really > be best served with chunk mode. Converting a-t-t to support this is on > my todo list, and though doing it properly may take a while, I can > probably get a crufty, hacked version together fairly quickly. With a > couple of sed scripts and an extra run of GIZA++ etc., we can also > generate rules for the interchunk module.
We could exchange some ideas about that, and future improvements such as the use of context-dependent lexicalized categories. This would give a-t-t better generalization capabilities and make the set of inferred rules smaller.
> The need for the bilingual dictionary seemed a little strange to me at > first, but Mikel, Apertium's BDFL, explained that it really helps to > reduce bad alignments. This probably means that a-t-t can't generate > rules for things like the Polish to English 'coraz piękniejsza' -> > 'prettier and prettier', but I haven't checked that yet.
The bilingual dictionary is used to derive a set of restrictions to prevent an alignment template (AT) to be applied in certain conditions in which it will generate a wrong translation. Restrictions refer to the target language (TL) inflection information of the non-lexicalized words in the AT. For example, suppose that you want to translate the following phrase from English into Spanish:
"the narrow street", with the following morphological analysis (after tagging): "^the<det><def><sp>$ ^narrow<adj><sint>$ ^street<n><sg>$"
The bilingual dictionary says: '' ^narrow<adj><sint>$ -------> estrecho<adj><f><ND>$ ^street<n><sg>$" -------> calle<n><f><sg>$ ''
Supose that you want to apply this AT:
SL: the<det><def><sp> <adj><sint> <n><sg> TL: el<det><def><f><sg> <n><f><sg> <adj><f><sg> Alignment: 1:1 2:3 3:2 Rstrictions (indexes refer to the TL part of the AT): w_2 = n.f.* w_3 = adj.*,
* Note: "the" and "el" are lexicalized words
This AT generalizes:
1. the reordering rule that moves the adjective after the noun, the correct order in Spanish. 2. The gender agreement rule that propagates the gender from the noun to the adjective (also the article)
Restrictions means:
w_2 --> the translation of the noun must be a noun feminine, no matter the number. w_3 --> the translation of the adjetive must be an adjective (obvious) but we do not care about the rest of inflection information (gender and number)
Finally, suppose that you want to translate "the white car" into Spanish. After applying the AT we have the following transfer output:
^el<det><def><f><sg>$ ^coche<n><f><sg>$ ^blanco<adj><f><sg>$
But the generator does not know how to inflect: ^coche<n><f><sg>$, since that word does exist in Spanish.
The bilingual dictionary is also used to discard those bilingual phrase pairs that cannot be reproduced using that bilingual dictionary (what Mikel explained to you); otherwise, we would be inferring a set of restrictions making no sense at all.
Anyway, from the parallel corpus a bilingual dictionary can also be inferred as Caseli does. But, take care. The bilingual dictionary needs to explicitly code only the inflection information (after part-of-speech) than changes from SL to TL. For example:
Spanish Catalan coche<n> ----> cotxe<n>
but
calle<n><f> ----> carrer<n><m>
I hope to have been successful in explaining how the bilingual dictionary is used by apertium-transfer-tools.
good weekend.
-- Felipe Sánchez Martínez <fsanchez@dlsi.ua.es> Departamento de Lenguajes y Sistemas Informáticos Universidad de Alicante, E-03071 Alicante (Spain) Tel.: +34 965 903 400, ext: 2038 Fax: +34 965 909 326 https://www.dlsi.ua.es/~fsanchez
Jimmy O'Regan [joregan at gmail.com]
2008/7/11 Felipe Sánchez Martínez <fsanchez@dlsi.ua.es>:
> Hi, > > Jimmy, very good explanation, :D >
Thanks
[snip]
> The bilingual dictionary is also used to discard those bilingual phrase > pairs that cannot be reproduced using that bilingual dictionary (what > Mikel explained to you); otherwise, we would be inferring a set of > restrictions making no sense at all. >
Yeah; that makes sense - what's more, it looks familiar - like something I read in a paper somewhere[1]
> Anyway, from the parallel corpus a bilingual dictionary can also be > inferred as Caseli does. But, take care. The bilingual dictionary needs > to explicitly code only the inflection information (after > part-of-speech) than changes from SL to TL. For example: > > Spanish Catalan > coche<n> ----> cotxe<n> > > but > > calle<n><f> ----> carrer<n><m>
Ok, that's true, but it hurts nothing to add the gender information when it's the same (provided, of course, it's added on both sides), and it's generally better practice to always include the gender information - to build dictionaries with the assumption that they will later be crossed. (Too few developers, too little time, too many languages :|)
[1] Felipe Sánchez-Martínez, Mikel L. Forcada. Automatic induction of shallow-transfer rules for open-source machine translation. In Proceedings of TMI, The Eleventh Conference on Theoretical and Methodological Issues in Machine Translation, p. ??-??, September 7-9, 2007, Skövde, Sweden. https://www.dlsi.ua.es/~fsanchez/pub/pdf/sanchez07c.pdf
Felipe Sanchez Martinez [fsanchez at dlsi.ua.es]
> Ok, that's true, but it hurts nothing to add the gender information > when it's the same (provided, of course, it's added on both sides), > and it's generally better practice to always include the gender > information - to build dictionaries with the assumption that they will > later be crossed. (Too few developers, too little time, too many > languages :|)
If the bilingual dictionary explicitly codes all the inflection information, two changes are needed:
* in a-t-t. The method that derives restrictions; I think it would be easy to do so.
* in the transfer module. To check the restrictions (at translation time) the transfer module gets the equivalent in the bilingual dictionary as it is coded (with only the part-of-speech and the inflection information that changes from SL to TL); therefore, if all the inflection information is coded this step would fail. This change is not obvious, almost for me.
regards.
-- Felipe Sánchez Martínez <fsanchez@dlsi.ua.es> Departamento de Lenguajes y Sistemas Informáticos Universidad de Alicante, E-03071 Alicante (Spain) Tel.: +34 965 903 400, ext: 2038 Fax: +34 965 909 326 https://www.dlsi.ua.es/~fsanchez
Mikel L. Forcada [mlf at dlsi.ua.es]
Hi all,
as Felipe said, I couldn't have explained it better. Thank you, guys. The material looks a lot like something that could be added to the wiki, for instance in connection with the new pair stuff. I encourage you to do so.
> The need for the bilingual dictionary seemed a little strange to me > at > first, but Mikel, Apertium's BDFL,
To be honest, I am not dictating much recently, a lot goes on without me even noticing (and I love that). To be a dictator means having time! And thanks for the "benevolent" part, heheh.
> explained that it really helps to > reduce bad alignments. This probably means that a-t-t can't generate > rules for things like the Polish to English 'coraz piękniejsza' -> > 'prettier and prettier', but I haven't checked that yet.
Won't work, you're right. Which of the "prettier" will align with the single instance found in the polish sentence?
Mikel
Jimmy O'Regan [joregan at gmail.com]
2008/7/12 Mikel L. Forcada <mlf@dlsi.ua.es>:
> Hi all, > as Felipe said, I couldn't have explained it better. Thank you, guys. > The material looks a lot like something that could be added to the > wiki, for instance in connection with the new pair stuff. I encourage > you to do so. > >> The need for the bilingual dictionary seemed a little strange to me >> at >> first, but Mikel, Apertium's BDFL, > > To be honest, I am not dictating much recently, a lot goes on without > me even noticing (and I love that). To be a dictator means > having time! And thanks for the "benevolent" part, heheh. > >> explained that it really helps to >> reduce bad alignments. This probably means that a-t-t can't generate >> rules for things like the Polish to English 'coraz piękniejsza' -> >> 'prettier and prettier', but I haven't checked that yet. > > Won't work, you're right. Which of the "prettier" will align with the > single instance found in the polish sentence?
Yes, but, for Arky and any interested LG reader, this can be done using a rule. Well, a pair of rules, one for each direction. In the Polish to English direction, this is a good example of what we mean by a 'lexicalised rule': the 'coraz' part (the lemma) has to be considered as a whole, not just by the type of word. (This glosses over a lot of detail, but I'll be writing about writing rules in my article after next)
First, we need our categories, so we can match the pattern:
<section-def-cats> <def-cat n="coraz"> <cat-item lemma="coraz" tags="preadv"/> </def-cat> <def-cat n="sint"><!--BCN--> <cat-item tags="adj.sint"/> </def-cat> </section-def-cats>
Polish and English both have synthetic adjectives, but they don't always match: piękny - pretty, piękniejszy - prettier, but słynny - famous, słynniejszy - more famous. So, we need to be able to test this, so we add an attribute for adjective type (in the <section-def-attrs> part):
<def-attr n="a_adj"> <attr-item tags="adj"/> <attr-item tags="adj.sint"/> <attr-item tags="adj.sint.comp"/> <attr-item tags="adj.sint.sup"/> </def-attr>
Now, we're (more or less) ready to write a rule:
<rule comment="coraz sint"> <pattern> <pattern-item n="coraz"/> <pattern-item n="sint"/> </pattern> <action> <!-- Here, we normally call a few macros, to check if we're at the start of a sentence, etc. --> <choose> <when> <test> <equal> <clip pos="2" side="tl" part="a_adj"/> <lit-tag v="adj.sint"/> </equal> </test> <out> <chunk name="sint_and_sint"> <tags><tag><lit-tag v="SN"/></tag/></tags> <lu> <clip pos="2" side="tl" part="whole"/> </lu> <b pos="1"/> <lu> <lit v="and"/> <lit-tag v="cnjcoo"/> </lu> <b pos="2"/> <lu> <clip pos="2" side="tl" part="whole"/> </lu> </chunk> </out> </when> <otherwise> <out> <chunk name="more_and_more_adj"> <tags><tag><lit-tag n="SN"/></tag></tags> <lu> <lit v="more"/> <lit-tag v="adv"/> </lu> <b pos="1"/> <lu> <lit v="and"/> <lit-tag v="cnjcoo"/> </lu> <b pos="2"/> <lu> <lit v="more"/> <lit-tag v="adv"/> </lu> <b pos="3"/> <lu> <clip pos="2" side="tl" part="whole"/> </lu> </chunk> </out> </otherwise> </action> </rule>
to speed things along, we'll make believe like categories, etc. are all done on the English side:
<rule comment="sint and sint"> <pattern> <pattern-item n="sint"/> <pattern-item n="and"/> <pattern-item n="sint"/> </pattern> <action> <choose> <when> <test> <equal> <!-- when both adjectives are exactly the same --> <clip pos="1" side="tl" part="whole"/> <clip pos="1" side="tl" part="whole"/> </equal> </test> <out> <chunk name="coraz_sint"> <tags> <tag><lit-tag v="SN"/></tag/> <!-- here, we would have tags for gender, etc. to be filled in interchunk --> </tags> <lu> <lit v="coraz"/> <lit-tag v="preadv"/> </lu> <b pos="1"/> <lu> <clip pos="1" side="tl" part="whole"/> <!-- we would have more attributes here, linked to chunk tags but this isn't the time or place to explain that --> </lu> </chunk> </out> </when> <otherwise> <!-- we'll gloss over all the other possibilities --> </otherwise> </action> </rule>
Jimmy O'Regan [joregan at gmail.com]
2008/7/12 Jimmy O'Regan <joregan@gmail.com>:
> <!-- when both adjectives are exactly the same --> > <clip pos="1" side="tl" part="whole"/> > <clip pos="1" side="tl" part="whole"/> > </equal>
Argh! That, of course, should be:
<clip pos="1" side="tl" part="whole"/> <clip pos="3" side="tl" part="whole"/>