transliteration for url_ålias

Author Message

Mikhail Chekanov

Thursday 18 March 2004 4:00:12 am

Is it possible to add url_alias manager to the administration interface?
The idea is to change url_aliases within the adm. interface without manual interventions to the database.
This feature would be useful for non-english sites (by default, non-english characters converted to _ ).

Currently, I've added transliteration functionality for convertToAlias function for my russian site in kernel/classes/ezurlalias.php:

function convertToAlias( $urlElement, $defaultValue = false )
{
//transliteration of one-character phonemas:
//russian A -> ascii a, russian a -> ascii a

$urlElement = strtr( $urlElement, "AaCc...;","AaSs..." );
    
//transliteration of multi-capital phonemas
//norwegian Ø -> OE, ø -> oe; or russian "ч"=>"ch" ...        
$urlElement = strtr( $urlElement, 
array( "ø"=>"oe", "Ø"=>"oe", "å"=>"aa", "Å"=>"aa", ...... ) );
....

So, the question to eZ crew is: perhaps this little hack could be useful enough to be included in the source? Of course, this have to be done more smart way:
1. Include to locale files optional section [transliteration] with appropriate rules: two strings for one-character transliteration and one array for multi-character transliteration.
2. The system checks up existence of the rules in appropriate locale file and transliterates $urlElement according to this rules.

--
mike
#6595551

Gunnstein Lye

Thursday 18 March 2004 8:02:45 am

I agree that this is something we need! I have done the same thing as you for the norwegian characters. It's a good idea to have settings for this in the locale files. Ini-files are somewhat slow at the moment, but it should not be much of a problem in this case. I'll do some research.

Trond Åge Kvalø

Thursday 18 March 2004 1:19:01 pm

Hello Gunnstein!

Is this something you are willing to share? I have the exact same problem on a large portal we're making at the moment. The norwegian characters becomes _ in the urls.

I could try to write a function like the one above, but I have a very bad feeling about tampering with the kernel, and if you already have invented the wheel...

best regards
trondåge

trondåge

Jan Borsodi

Friday 19 March 2004 12:34:48 am

It's a good idea,

However there are several problems with the implementation.

1. Character sets/encodings
The placement of various characters will vary from charset to charset so it needs to be integrated with the i18n library to properly handle this. There's also the problem with non-8bit charset (e.g. utf8) which will use multiple bytes for a character.

A simple of solving this now is to turn the string into a Unicode array using

$codec =& eZTextCodec::instance( false, 'unicode' );
$urlElementArray = $codec->convertString( $urlElement );

Then replacing the characters using their Unicode values and converting it back.

$reverseCodec =& eZTextCodec::instance( 'unicode', false );
$urlElement = $reverseCodec->convertString( $urlElementArray );

However it should be noted that this is not very fast.

2. Unicode
A good implementation should provide conversion for all characters in Unicode. For instance a site could be running utf8 and have articles in multiple languages.

Actually this type of conversion is similar to lowercase, uppercase and search normalization all which should be handled by the i18n system some day.

--
Amos

Documentation: http://ez.no/ez_publish/documentation
FAQ: http://ez.no/ez_publish/documentation/faq

Gunnstein Lye

Friday 19 March 2004 5:04:08 am

Hi Trond,

For fixing just the norwegian characters I suggest you use the solution by Michail Che above. It is cleaner than mine. (Note: This will not work if you use UTF-8.) Currently, there is no way around tampering with the kernel.

Trond Åge Kvalø

Friday 19 March 2004 9:27:56 am

Ok, just to make sure I don't f**k up too much;

All I have to do is to add the folowing line at the top of the convertToAlias-function:

function convertToAlias( $urlElement, $defaultValue = false )
{
//transliteration of one-character phonemas:
//norwegian Æ -> ascii A, norwegian æ -> ascii a

$urlElement = strtr( $urlElement, "ÆæØøÅå;","AaOoAa" );

Am I correct or is there something I've misunderstood?

best regards
trondåge

trondåge

Georg Franz

Friday 19 March 2004 9:47:04 am

Hi,

just another comment to the url_alias conversion:

I've talked with two search engine experts. They say that Google like the "-" in urls more than "_".

Or to say it in another way:
In a search result of google

/this-is/a-test-url

will be ranked higher than

/this_is/a_test_url

Can anybody confirm this?

Kind regards,
Emil.

Best wishes,
Georg.

--
http://www.schicksal.com Horoskop website which uses eZ Publish since 2004

Mikhail Chekanov

Monday 22 March 2004 4:00:52 am

Trond Åge Kvalø wrote:
>All I have to do is to add the folowing line at the top of the convertToAlias-function:

function convertToAlias( $urlElement, $defaultValue = false )
{
$urlElement = strtr( $urlElement, "ÆæØøÅå;","AaOoAa" );
...

>Am I correct or is there something I've misunderstood?
In case you want to replace "Æ" with "a", not with "Aa", this is correct, but you need to remove one semicolon:

$urlElement = strtr( $urlElement, "ÆæØøÅå","AaOoAa" );

---
Emil Webber wrote:
>just another comment to the url_alias conversion:
>I've talked with two search engine experts. They say that Google like the "-" in urls more than "_".
>Can anybody confirm this?

May be they are right, because Google counts "word1-word2" as two words for pageranking formula, otherwise "word1_word2" as one senseless word, AFAIK.
---

I think there are 2 possible solutions to deal with i18n of the aliases:
1st: above-named way through transliteration, but there is a problem with UTF . Do you think that some slowing is critical? This is one-time operation, isn't it?
2nd: special text field for admin interface to submit/edit url_alias manually.
---
Jan Borsodi wrote:
>A simple of solving this now is to turn the string into a Unicode array...

Or something more usual, isnt'it?
At one of my sites I'm using 1251 within my templates and UTF-8 for database/site due some historical reasons ;), so I've tested this code:

    function convertToAlias( $urlElement, $defaultValue = false )
    {
    	include_once( 'lib/ezi18n/classes/eztextcodec.php' );
        $codec =& eZTextCodec::instance( false, 'cp1251' );
        $urlElementArray = $codec->convertString( $urlElement );
    	$urlElementArray = strtr( $urlElementArray, "Aa...", "aa..." );
        $urlElementArray = strtr( $urlElementArray, array( "z"=>"zh" ));
    	$reverseCodec =& eZTextCodec::instance( 'cp1251', false );
        $urlElement = $reverseCodec->convertString( $urlElementArray );
        ...

This works good enough... but this become too complicated to be included as self-tuned code, because we have to detect charsets, not only transliteration strings/arrays.

--
mike
#6595551

Trond Åge Kvalø

Monday 22 March 2004 5:52:44 am

> >Am I correct or is there something I've misunderstood?
> In case you want to replace "Æ" with "a", not with "Aa",
> this is correct, but you need to remove one semicolon:
> $urlElement = strtr( $urlElement, "ÆæØøÅå","AaOoAa" );

Ok, thanks Mikhail.

Just one question, though. The way I've written it it now, wouldn't "Æ" be replaced with "A" and "æ" with "a" etc..?

I did remove the semi-colon also, but it doesn't seem to work. Any ideas why?

best regards
trondåge

trondåge

Mikhail Chekanov

Tuesday 23 March 2004 2:17:01 am

>Just one question, though. The way I've written it it now, wouldn't "Æ" be replaced with "A" and "æ" with "a" etc..?
Yes, exactly.

>I did remove the semi-colon also, but it doesn't seem to work. Any ideas why?
What charset do you use? As you can see above, there is a problem with multi-byte encodings (e.g. UTF-8)...

--
mike
#6595551

Trond Åge Kvalø

Tuesday 23 March 2004 3:46:51 am

>> Just one question, though. The way I've written it it now, wouldn't "Æ" be replaced
>> with "A" and "æ" with "a" etc..?
> Yes, exactly.

Ok, got that one then :-)

>> I did remove the semi-colon also, but it doesn't seem to work. Any ideas why?
> What charset do you use? As you can see above, there is a problem with multi-byte
> encodings (e.g. UTF-8)...

This is my charset in my pagelayout.tpl
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

But on second thought you probably mean the charset in my site.ini.append file, right?

Now let's see.... I have #?ini charset="iso-8859-1"
and in the database settings I have Charset=iso-8859-1

Any other place I should look?

best regards
trondåge

trondåge

Mikhail Chekanov

Wednesday 24 March 2004 2:35:29 am

> Any other place I should look?

Sorry, I havn't any useful idea... In case that you have ISO everywhere, I don't see any errors... :(

--
mike
#6595551

Gunnstein Lye

Wednesday 24 March 2004 8:01:02 am

Try running update/common/scripts/updateniceurls.php

Trond Åge Kvalø

Wednesday 24 March 2004 10:09:28 am

> Try running update/common/scripts/updateniceurls.php

Total updates 0/99 ??

(After som tweaking of code so that it found the includes and moving the
$argv = $_SERVER['argv'];
to the top so the argv variable in line 124 wasn't undefined)

And nothing happens with my URL's. I <b>am</b> using nice urls when I don't see the content/view/full/xyz, right?

trondåge

trondåge

Gunnstein Lye

Friday 26 March 2004 3:30:45 am

> And nothing happens with my URL's. I am using nice urls when I don't see the
> content/view/full/xyz, right?

Yes. Well, I'm out of suggestions now.

Georg Franz

Sunday 28 March 2004 12:15:28 pm

Hi Trond,

I've altered the kernel/classes/ezurlalias.php:

function convertToAlias( $urlElement, $defaultValue = false )
{
    include_once ( 'path/to/gwf_textutils.php' );
    $urlElement = gwf_TextUtils::convertToAlias ( $urlElement );
    
    if ( strlen( $urlElement ) == 0 )
    {
        if ( $defaultValue === false )
            $urlElement = '-1';
        else
        {
            $urlElement = $defaultValue;
            $urlElement = gwf_TextUtils::convertToAlias ( $urlElement );
        }
    }
    return $urlElement;
}

You need my "text util" class which can be found at
http://ez.no/community/contributions/hacks/gwf_textutils

in gwf_TextUtils::convertToAlias

the main conversion is done with
$specialChars = array ( "à", "á", "â", "ã", "ä", "å", "æ", "è", "é", "ê", "ß", " ", "'", "&acute;", "`",
"ë", "Ç", "í", "ì", "ò", "ó", "ô", "õ", "ö", "ù", "ú", "û", "ü");
$normalChars = array ( "a", "a", "a", "a", "ae", "a", "ae", "e", "e", "e", "ss", "-", "", "", "",
"e", "c", "i", "i", "o", "o", "o", "o", "oe", "u", "u", "u", "ue");

So if you have additional characters which should be "converted" you have to put it in the two arrays.

After doing the "hack", you have to run
update/common/scripts/updateniceurls.php

Kind regards,
Emil
(alias Georg :-)

Best wishes,
Georg.

--
http://www.schicksal.com Horoskop website which uses eZ Publish since 2004

Gunnstein Lye

Wednesday 05 May 2004 8:12:15 am

I have made a locale-based fix for this, that should work well with unicode.

http://ez.no/community/contributions/hacks/url_alias_transliteration

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.

eZ debug

Timing: Jan 18 2025 15:16:16
Script start
Timing: Jan 18 2025 15:16:16
Module start 'layout'
Timing: Jan 18 2025 15:16:16
Module start 'content'
Timing: Jan 18 2025 15:16:17
Module end 'content'
Timing: Jan 18 2025 15:16:17
Script end

Main resources:

Total runtime0.5714 sec
Peak memory usage4,096.0000 KB
Database Queries120

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0046 587.9141152.6250
Module start 'layout' 0.00460.0020 740.539139.4453
Module start 'content' 0.00660.5635 779.9844869.5469
Module end 'content' 0.57010.0012 1,649.531336.1641
Script end 0.5713  1,685.6953 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00300.5283160.0002
Check MTime0.00130.2292160.0001
Mysql Total
Database connection0.00060.100510.0006
Mysqli_queries0.454779.57851200.0038
Looping result0.00100.17761180.0000
Template Total0.547295.820.2736
Template load0.00200.348920.0010
Template processing0.545295.415720.2726
Template load and register function0.00020.034110.0002
states
state_id_array0.00070.115910.0007
state_identifier_array0.00070.123720.0004
Override
Cache load0.00220.38121550.0000
Sytem overhead
Fetch class attribute can translate value0.00080.143550.0002
Fetch class attribute name0.00100.1816200.0001
XML
Image XML parsing0.00200.343450.0004
class_abstraction
Instantiating content class attribute0.00000.0071250.0000
General
dbfile0.00090.1623290.0000
String conversion0.00000.000940.0000
Note: percentages do not add up to 100% because some accumulators overlap

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
17content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
29content/datatype/view/ezxmltags/line.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/line.tplEdit templateOverride template
44content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
8content/datatype/view/ezxmltags/literal.tpl<No override>extension/community/design/standard/templates/content/datatype/view/ezxmltags/literal.tplEdit templateOverride template
8content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
1print_pagelayout.tpl<No override>extension/community/design/community/templates/print_pagelayout.tplEdit templateOverride template
 Number of times templates used: 108
 Number of unique templates used: 7

Time used to render debug report: 0.0001 secs