Forums / Suggestions / transliteration for url_ålias

transliteration for url_ålias

Author Message

Mikhail Chekanov

Thursday 18 March 2004 4:00:12 am

Is it possible to add url_alias manager to the administration interface?
The idea is to change url_aliases within the adm. interface without manual interventions to the database.
This feature would be useful for non-english sites (by default, non-english characters converted to _ ).

Currently, I've added transliteration functionality for convertToAlias function for my russian site in kernel/classes/ezurlalias.php:

function convertToAlias( $urlElement, $defaultValue = false )
{
//transliteration of one-character phonemas:
//russian A -> ascii a, russian a -> ascii a

$urlElement = strtr( $urlElement, "AaCc...;","AaSs..." );
    
//transliteration of multi-capital phonemas
//norwegian Ø -> OE, ø -> oe; or russian "ч"=>"ch" ...        
$urlElement = strtr( $urlElement, 
array( "ø"=>"oe", "Ø"=>"oe", "å"=>"aa", "Å"=>"aa", ...... ) );
....

So, the question to eZ crew is: perhaps this little hack could be useful enough to be included in the source? Of course, this have to be done more smart way:
1. Include to locale files optional section [transliteration] with appropriate rules: two strings for one-character transliteration and one array for multi-character transliteration.
2. The system checks up existence of the rules in appropriate locale file and transliterates $urlElement according to this rules.

--
mike
#6595551

Gunnstein Lye

Thursday 18 March 2004 8:02:45 am

I agree that this is something we need! I have done the same thing as you for the norwegian characters. It's a good idea to have settings for this in the locale files. Ini-files are somewhat slow at the moment, but it should not be much of a problem in this case. I'll do some research.

Trond Åge Kvalø

Thursday 18 March 2004 1:19:01 pm

Hello Gunnstein!

Is this something you are willing to share? I have the exact same problem on a large portal we're making at the moment. The norwegian characters becomes _ in the urls.

I could try to write a function like the one above, but I have a very bad feeling about tampering with the kernel, and if you already have invented the wheel...

best regards
trondåge

trondåge

Jan Borsodi

Friday 19 March 2004 12:34:48 am

It's a good idea,

However there are several problems with the implementation.

1. Character sets/encodings
The placement of various characters will vary from charset to charset so it needs to be integrated with the i18n library to properly handle this. There's also the problem with non-8bit charset (e.g. utf8) which will use multiple bytes for a character.

A simple of solving this now is to turn the string into a Unicode array using

$codec =& eZTextCodec::instance( false, 'unicode' );
$urlElementArray = $codec->convertString( $urlElement );

Then replacing the characters using their Unicode values and converting it back.

$reverseCodec =& eZTextCodec::instance( 'unicode', false );
$urlElement = $reverseCodec->convertString( $urlElementArray );

However it should be noted that this is not very fast.

2. Unicode
A good implementation should provide conversion for all characters in Unicode. For instance a site could be running utf8 and have articles in multiple languages.

Actually this type of conversion is similar to lowercase, uppercase and search normalization all which should be handled by the i18n system some day.

--
Amos

Documentation: http://ez.no/ez_publish/documentation
FAQ: http://ez.no/ez_publish/documentation/faq

Gunnstein Lye

Friday 19 March 2004 5:04:08 am

Hi Trond,

For fixing just the norwegian characters I suggest you use the solution by Michail Che above. It is cleaner than mine. (Note: This will not work if you use UTF-8.) Currently, there is no way around tampering with the kernel.

Trond Åge Kvalø

Friday 19 March 2004 9:27:56 am

Ok, just to make sure I don't f**k up too much;

All I have to do is to add the folowing line at the top of the convertToAlias-function:

function convertToAlias( $urlElement, $defaultValue = false )
{
//transliteration of one-character phonemas:
//norwegian Æ -> ascii A, norwegian æ -> ascii a

$urlElement = strtr( $urlElement, "ÆæØøÅå;","AaOoAa" );

Am I correct or is there something I've misunderstood?

best regards
trondåge

trondåge

Georg Franz

Friday 19 March 2004 9:47:04 am

Hi,

just another comment to the url_alias conversion:

I've talked with two search engine experts. They say that Google like the "-" in urls more than "_".

Or to say it in another way:
In a search result of google

/this-is/a-test-url

will be ranked higher than

/this_is/a_test_url

Can anybody confirm this?

Kind regards,
Emil.

Best wishes,
Georg.

--
http://www.schicksal.com Horoskop website which uses eZ Publish since 2004

Mikhail Chekanov

Monday 22 March 2004 4:00:52 am

Trond Åge Kvalø wrote:
>All I have to do is to add the folowing line at the top of the convertToAlias-function:

function convertToAlias( $urlElement, $defaultValue = false )
{
$urlElement = strtr( $urlElement, "ÆæØøÅå;","AaOoAa" );
...

>Am I correct or is there something I've misunderstood?
In case you want to replace "Æ" with "a", not with "Aa", this is correct, but you need to remove one semicolon:

$urlElement = strtr( $urlElement, "ÆæØøÅå","AaOoAa" );

---
Emil Webber wrote:
>just another comment to the url_alias conversion:
>I've talked with two search engine experts. They say that Google like the "-" in urls more than "_".
>Can anybody confirm this?

May be they are right, because Google counts "word1-word2" as two words for pageranking formula, otherwise "word1_word2" as one senseless word, AFAIK.
---

I think there are 2 possible solutions to deal with i18n of the aliases:
1st: above-named way through transliteration, but there is a problem with UTF . Do you think that some slowing is critical? This is one-time operation, isn't it?
2nd: special text field for admin interface to submit/edit url_alias manually.
---
Jan Borsodi wrote:
>A simple of solving this now is to turn the string into a Unicode array...

Or something more usual, isnt'it?
At one of my sites I'm using 1251 within my templates and UTF-8 for database/site due some historical reasons ;), so I've tested this code:

    function convertToAlias( $urlElement, $defaultValue = false )
    {
    	include_once( 'lib/ezi18n/classes/eztextcodec.php' );
        $codec =& eZTextCodec::instance( false, 'cp1251' );
        $urlElementArray = $codec->convertString( $urlElement );
    	$urlElementArray = strtr( $urlElementArray, "Aa...", "aa..." );
        $urlElementArray = strtr( $urlElementArray, array( "z"=>"zh" ));
    	$reverseCodec =& eZTextCodec::instance( 'cp1251', false );
        $urlElement = $reverseCodec->convertString( $urlElementArray );
        ...

This works good enough... but this become too complicated to be included as self-tuned code, because we have to detect charsets, not only transliteration strings/arrays.

--
mike
#6595551

Trond Åge Kvalø

Monday 22 March 2004 5:52:44 am

> >Am I correct or is there something I've misunderstood?
> In case you want to replace "Æ" with "a", not with "Aa",
> this is correct, but you need to remove one semicolon:
> $urlElement = strtr( $urlElement, "ÆæØøÅå","AaOoAa" );

Ok, thanks Mikhail.

Just one question, though. The way I've written it it now, wouldn't "Æ" be replaced with "A" and "æ" with "a" etc..?

I did remove the semi-colon also, but it doesn't seem to work. Any ideas why?

best regards
trondåge

trondåge

Mikhail Chekanov

Tuesday 23 March 2004 2:17:01 am

>Just one question, though. The way I've written it it now, wouldn't "Æ" be replaced with "A" and "æ" with "a" etc..?
Yes, exactly.

>I did remove the semi-colon also, but it doesn't seem to work. Any ideas why?
What charset do you use? As you can see above, there is a problem with multi-byte encodings (e.g. UTF-8)...

--
mike
#6595551

Trond Åge Kvalø

Tuesday 23 March 2004 3:46:51 am

>> Just one question, though. The way I've written it it now, wouldn't "Æ" be replaced
>> with "A" and "æ" with "a" etc..?
> Yes, exactly.

Ok, got that one then :-)

>> I did remove the semi-colon also, but it doesn't seem to work. Any ideas why?
> What charset do you use? As you can see above, there is a problem with multi-byte
> encodings (e.g. UTF-8)...

This is my charset in my pagelayout.tpl
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

But on second thought you probably mean the charset in my site.ini.append file, right?

Now let's see.... I have #?ini charset="iso-8859-1"
and in the database settings I have Charset=iso-8859-1

Any other place I should look?

best regards
trondåge

trondåge

Mikhail Chekanov

Wednesday 24 March 2004 2:35:29 am

> Any other place I should look?

Sorry, I havn't any useful idea... In case that you have ISO everywhere, I don't see any errors... :(

--
mike
#6595551

Gunnstein Lye

Wednesday 24 March 2004 8:01:02 am

Try running update/common/scripts/updateniceurls.php

Trond Åge Kvalø

Wednesday 24 March 2004 10:09:28 am

> Try running update/common/scripts/updateniceurls.php

Total updates 0/99 ??

(After som tweaking of code so that it found the includes and moving the
$argv = $_SERVER['argv'];
to the top so the argv variable in line 124 wasn't undefined)

And nothing happens with my URL's. I <b>am</b> using nice urls when I don't see the content/view/full/xyz, right?

trondåge

trondåge

Gunnstein Lye

Friday 26 March 2004 3:30:45 am

> And nothing happens with my URL's. I am using nice urls when I don't see the
> content/view/full/xyz, right?

Yes. Well, I'm out of suggestions now.

Georg Franz

Sunday 28 March 2004 12:15:28 pm

Hi Trond,

I've altered the kernel/classes/ezurlalias.php:

function convertToAlias( $urlElement, $defaultValue = false )
{
    include_once ( 'path/to/gwf_textutils.php' );
    $urlElement = gwf_TextUtils::convertToAlias ( $urlElement );
    
    if ( strlen( $urlElement ) == 0 )
    {
        if ( $defaultValue === false )
            $urlElement = '-1';
        else
        {
            $urlElement = $defaultValue;
            $urlElement = gwf_TextUtils::convertToAlias ( $urlElement );
        }
    }
    return $urlElement;
}

You need my "text util" class which can be found at
http://ez.no/community/contributions/hacks/gwf_textutils

in gwf_TextUtils::convertToAlias

the main conversion is done with
$specialChars = array ( "à", "á", "â", "ã", "ä", "å", "æ", "è", "é", "ê", "ß", " ", "'", "&acute;", "`",
"ë", "Ç", "í", "ì", "ò", "ó", "ô", "õ", "ö", "ù", "ú", "û", "ü");
$normalChars = array ( "a", "a", "a", "a", "ae", "a", "ae", "e", "e", "e", "ss", "-", "", "", "",
"e", "c", "i", "i", "o", "o", "o", "o", "oe", "u", "u", "u", "ue");

So if you have additional characters which should be "converted" you have to put it in the two arrays.

After doing the "hack", you have to run
update/common/scripts/updateniceurls.php

Kind regards,
Emil
(alias Georg :-)

Best wishes,
Georg.

--
http://www.schicksal.com Horoskop website which uses eZ Publish since 2004

Gunnstein Lye

Wednesday 05 May 2004 8:12:15 am

I have made a locale-based fix for this, that should work well with unicode.

http://ez.no/community/contributions/hacks/url_alias_transliteration