non-ansi characters in generated url's

Author Message

Erlend Halvorsen

Friday 09 November 2007 2:49:04 pm

Hi!

I've just installed ez 3.10.0, and I'm having some trouble with norwegian characters. I've created a folder named Tilbehør, and the generated url is http://somedomain.com/Tilbehør, which is then of course translated by the browser to /Tilbeh%F8r, which doesn't exist. If I however rewrite the url by hand back to /Tilbehør, it's translated to /Tilbeh%C3%B8r, and that works. Seems to me to be some sort of UTF-8/ISO-8859-1 problem. The site it self is encoded in ISO. I've been screwed over by PHP and UTF-8 so many times now, I'm not doing that again.

Any ideas how to fix this?

Also, while we're on the topic, is it possible to have ez generate lowercase url's?

-Erlend

Erlend Halvorsen

Saturday 10 November 2007 8:14:57 am

Ok, to answer my own question, adding

[URLTranslator]
TransformationGroup=urlalias

settings/override/site.ini.append.php fixes the problem. Now the generated url is /Tilbehoer, and /Tilbehør still works - perfect!

-Erlend

Ole Marius Smestad

Wednesday 28 November 2007 1:22:39 am

Hi Erlend,

I'm glad you found a solution which worked for your site. In 3.10.0 and in the 4.0.0 alpha and beta releasees, the default url transformation setting have been urlalias_iri, which as you've seen will include unicode characters in the generated urls.

For the final release of eZ Publish 4.0 and 3.10.1 we are considering changing the default to a more restrictive setting. Either 'urlalias_compat', or 'urlalias'

Do you, and the rest of the community have any wishes in this regard?

For reference I am including a snippet from the feature documentation:

1. Only allow a restricted set of characters in the url, this means
   a to z, numbers and underscore. (This is the same behaviour as in
   3.9 and earlier.)

   The identifier for this is *urlalias_compat*

2. Allow more characters in the url, but still restrict it to the
   ASCII characters (with a few exceptions). Capitalization of words
   are now kept.

   The identifier for this is *urlalias*

3. Similar to #2 but allow all Unicode characters (with a few
   exceptions). This allows the text to preserved as much as possible
   and is highly recommended for uni- or multi-lingual sites. The only
   changes to the text is removal of a few characters which are
   special to the urls on the Internet and trimming of multiple
   whitespaces to only one whitespace.
   It is recommened to use the utf-8 charset for the site when having
   this enabled (*i18n.ini*).

   The identifier for this is *urlalias_iri*

When the desired transformation is chosen it must be configured in
*site.ini* by setting the TransformationGroup setting in the settings
*group URLTranslator to contain the identifier of the chosen type.
e.g. if the third type was chosen::

  [URLTranslator]
  TransformationGroup=urlalias_iri

Advanced users might also want to take a look at *transform.ini* to
configure your own transformation group. Tweaking this file and adding
an extension to the transformation allows for full control over the
created URL aliases.

Note: #3 is referred to as IRI [1] (Internationalized Resource
      Identifiers) which is a specialization of URI/URL with Unicode
      support.

[1] http://www.w3.org/International/O-URL-and-ident.html


--
Ole Marius Smestad
Lead Engineer eZ Publish
Member of the Community Project Board

Peter Putzer

Wednesday 28 November 2007 2:16:04 am

I find 'urlalias' to be reasonable compromise. Due to the way browsers encode Unicode characters, 'urlalias_iri' is not really an option IMHO.

However, I have an additional feature request: Please add an option to automatically generate urlalias_compat forwardings even for new objects. In practice, this makes URLs case-insensitive (but case-preserving) when using 'urlalias'. An example:

Object 'Über uns' (German for 'About Us')

Old urlalias (< 3.10): '/ueber_uns'
urlalias (=> 3.10): '/Ueber-uns'
urlalias_iri: '/Über uns'

If 'Über uns' already existed before running the upgrade script, '/ueber_uns' continues to work fine. However, if I create new object 'Bla bla', only the new alias '/Bla-bla' is created. Using an old-style URL '/bla_bla' will not work.

This is a problem with URLs that are to be entered from memory (e.g. if you use them in printed media like leaflets). Changing the way URLs are generated should be as transparent as possible for users.

Accessible website starting from eZ publish 3.0 (currently: 4.1.0): http://pluspunkt.at

Erlend Halvorsen

Wednesday 28 November 2007 3:25:25 am

I can't really say I cared much for the unicode url's, as as soon as they are clicked the characters are converted to percentage encoded %g%a%r%b%a%g%e. This makes them hard to read, type, say, and remember. If one could come up with a solution where the generated url's contained only ansi characters, while still supporting utf-8 characters in the url (for instance for use in printed material) that would be the best solution in my opinion.

Update: Re-reading my own response, I see that this is exactly what urlalias does :) Now, if only I could get rid of those capitalized letters..

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.