Opened 12 years ago

Closed 10 years ago

#182 closed task (fixed)

UTF-8

Reported by: ken Owned by: clfs-commits@…
Priority: minor Milestone: CLFS Standard 1.2.0
Component: BOOK Version: CLFS Standard GIT
Keywords: Cc:

Description

Couldn't find a ticket for this, so starting a new one as an aide-memoire.

If people want to use UTF-8 (and so far, there seems a lack of consensus), the assumption is that it should be optional. So far, I've been using it for a couple of years or so, and I'm aware of at least the following additions (there are probably others):

  1. for glibc add libidn. Now that glibc no longer gets releases, I'm going to try this with upstream libidn (v1.9), but I haven't yet.
  1. for ncurses --enable-widec so that we build the ...w versions and remove/replace the non-wide versions similar to in LFS (ISTR the detail is slightly different for how to do this on multilib).
  1. perhaps a note that if procps fails to compile in a UTF-8 system, check what you did to ncurses.
  1. for groff, optionally sed characters U+2010,2018,2019,2212 to ascii characters more likely to be found in common screen fonts, as in LFS.
  1. for man, convert the message files from various legacy encodings to UTF-8, and similarly the supplied non-English man pages (apropos, makewhatis, etc). I don't know if any other core packages need this, the problem for each package is to find a message that has been translated, and work out how to generate that error so it can be tested to ee if the translation appears or if a legacy encoding appears.
  1. follow man by groff-utf8 and sed man.conf to use it.
  1. alter vim to put UTF-8 pages (fr, it, pl, ru) into the language directory instead of fr.UTF-8 etc. My notes say that russian otherwise goes into ru.KOI8-R but I don't apparently do any recoding, so that needs to be checked again - certainly, with vim-7.1 I've got UTF-8 pages installed.
  1. At the moment, I don't think there are any UTF-8 pages shipped in any of the core packages. Shadow used to have loads, but those seem to have been dropped when debian rescued it. Perhaps we should have something a bit like what is in LFS explaining how to recode pages, but with the presumption that anyone doing this wil be recoding to UTF-8. Maybe also a note that support for non-alphabetic in groff-utf8 is not perfect - sometimes there are error messages about fitting the text to the line, e.g.

<standard input>:51: warning [p 1, 2.3i]: cannot adjust line - this applies particularly for japanese, but maybe also for chinese or korean (I can only trigger it for japanese). Doing the recoding of the man files apparently means that 'man' cannot use legacy encodings (e.g. latin2, koi8r) - even latin1 might have oddities.

Note that man pages in UTF-8 alphabetic languages work in the console, provided you have a suitable font. For chinese, japanese, korean you need a graphical display - rxvt-unicode works, I assume gnome-terminal does too.

We would also need some explanation of why to use this (easy - supports multiple languages on screen at the same time, rather than just a number of neighbouring languages, and handles "fancy quotes" sometimes found in english pages, e.g. from smartmontools), and alternatively why to not use it (perhaps, for people who have a large amount of text in legacy encodings, or who need to use legacy encodings).

Discussion about the "should we do this" part on -dev, please.

Change History (36)

comment:1 Changed 12 years ago by ken

Re item 5, man : apparently, current versions of man-db expect man pages to be in UTF-8, and will convert them if they detect a legacy encoding. Also, man is now pretty much out on its own in using the obsolete catgets, so most packages will not need the message files to be recoded. Apparently, it could be worth looking at current ubuntu and debian - apparently the non-English man pages there are converted to UTF-8.

comment:2 Changed 12 years ago by Joe Ciccone

Re Item 5, I'd much rather see man & a utf8 groff then man-db.

comment:3 in reply to:  description ; Changed 12 years ago by ken

Replying to ken:

  1. for glibc add libidn. Now that glibc no longer gets releases, I'm going to try this with upstream libidn (v1.9), but I haven't yet.

in fact, this isn't a formally-released version of glibc, so the libidn part is already there, no need for an add-on.

comment:4 in reply to:  3 Changed 12 years ago by Joe Ciccone

Milestone: CLFS Standard 1.2.0
Version: CLFS Standard SVN

Replying to ken:

Replying to ken:

  1. for glibc add libidn. Now that glibc no longer gets releases, I'm going to try this with upstream libidn (v1.9), but I haven't yet.

in fact, this isn't a formally-released version of glibc, so the libidn part is already there, no need for an add-on.

When I created the 2.8 tarball I checked out the glibc_2_8 tag from cvs and created a tarball (after touching a few files to ensure proper timestamps). This may be a good thing that I left libidn in then.

comment:5 Changed 12 years ago by Joe Ciccone

For #2 ncurses. I was looking at the different ways for doing this. From what I can tell, the best way to cause the least amount of compatibility issues is to compile two sets of ncurses libraries. One without widec and one with. I *might* beable to tackle this in the near future. Time will tell.

comment:6 Changed 12 years ago by willimm

I would consider replacing Man with Man-DB. Any comments?

comment:7 Changed 12 years ago by Joe Ciccone

Using Man-DB is not an option for me. A patched groff will let man work just fine.

comment:8 Changed 12 years ago by Jim Gifford

Man-DB is not acceptable. LFS implemented UTF-8 to quickly, even Alex has said it was not done properly. I don't want us to run into the same issues. Yes, Ryan, Joe, and myself want to support utf-8. But we want to done within the same packages we are currently using today.

comment:9 Changed 12 years ago by Jim Gifford

A solution is scheduled to be release on Dec 28th, 2008.

http://lists.gnu.org/archive/html/groff/2008-12/msg00013.html

comment:10 Changed 12 years ago by Joe Ciccone

New NCurses WideC Builds are in the book as of r4253. This is the start.

comment:11 Changed 12 years ago by Jim Gifford

I've been adding in the utf-8 support in various commits. Working on changing the way we currently build ncurses right now.

Ken I would appreciate it if you could test the new groff and man capabilities to make sure it works as expected.

I've also been playing around about making the build only utf-8 compliant altogether. Having glibc only install the utf-8 locales. Any comments on this idea.

comment:12 in reply to:  11 Changed 12 years ago by ken

Replying to jim:

I've been adding in the utf-8 support in various commits. Working on changing the way we currently build ncurses right now.

Ken I would appreciate it if you could test the new groff and man capabilities to make sure it works as expected.

I've also been playing around about making the build only utf-8 compliant altogether. Having glibc only install the utf-8 locales. Any comments on this idea.

Yeah, I saw you were busy! I've added the testing to my To Do List - will be a week or two before I get round to it.

For removing other locales, I'm not so sure. Aside from people who use other locales such as koi8, there might be an impact on test suites, either now or in the future. My own builds only support utf-8, but I install all locales and in fact I don't know of any convenient way of installing ALL the utf-8 locales except by installing every locale.

For glibc's own testsuite, looking at 2.8's localedata/Makefile I get the impression the LOCALES variable might create all the locales needed (if not cross-compiling), but every other package can add new test cases with extra locales at any time.

comment:13 Changed 12 years ago by Jim Gifford

Can Joe came up with a sed. Run this before you switch directories to compile glibc

sed -i -e '/\(SUPPORTED\|UTF-8\)/!d' \

-e '/zu_ZA/s/
$' localedata/SUPPORTED

comment:14 Changed 12 years ago by Jim Gifford

Also their is a new bootscript that replaces console. It's in the bootscripts repo. Here are links to it.

For /etc/rc.d/init.d http://svn.cross-lfs.org/svn/repos/bootscripts/trunk/standard/clfs/init.d/i18n

For /etc/sysconfig http://svn.cross-lfs.org/svn/repos/bootscripts/trunk/standard/clfs/sysconfig/i18n

Install cp init.d/i18n /etc/rc.d/init.d chmod 754 /etc/rc.d/init.d/i18n cp sysconfig/i18n /etc/sysconfig rm -f /etc/rc.d/rcsysinit.d/S70console ln -sf ../init.d/i18n /etc/rc.d/rcsysinit.d/S70i18n

comment:15 Changed 12 years ago by Jim Gifford

Forgot the wiki formatting Install
cp init.d/i18n /etc/rc.d/init.d
chmod 754 /etc/rc.d/init.d/i18n
cp sysconfig/i18n /etc/sysconfig
rm -f /etc/rc.d/rcsysinit.d/S70console
ln -sf ../init.d/i18n /etc/rc.d/rcsysinit.d/S70i18n

comment:16 Changed 12 years ago by ken

I had 14 failures in glibc/localedata tests. Possibly related to the sed for UTF-8, don't know. Haven't really looked at the other testsuites yet, beyond a quick glance at gcc (failures in mudflap, i think - similar to 4.3.2) and binutils (more failures than before, probably because this is x86_64-64). Maybe, if I can find the enthusiasm to rebuild the final system without that sed, I might do so to see what changes in the tests.

The variables in the i18n script were mostly new to me - I've never encountered people adding extra partial maps, I thought normal maps already included things like the euro - but not a problem. I do have a problem with defaulting to the 'windowkeys', but that isn't major and easily altered by the user. I'll also note that my Compose key in my local keymap doesn't seem to be working on this build - the dead keys for latin1-compatible accents are still working, forgot to test the things I've added to AltGr?.

On to the main test of man pages: Perhaps I'm missing something, e.g. maybe groff now needs some sort of override or post-install configuration to use the new functionality ? For the moment, my test results in everything except en_GB.UTF-8 are labelled as "unusable".

I wanted to test the following areas:

possible unrendered text in the error message for 'man foo' because man-1.6f provides localized messages in legacy encodings. I have some (long) iconv invocations that I normally use to fix this.

possible unrendered text for 'man apropos' and the other pages supplied with man-1.6f in legacy encodings. Again, I have some iconv invocations to fix this.

man-pages for vim might need to be moved to be found (fr,it,pl,ru).

man-pages from shadow (typically, tested with passwd (5)) are supplied as UTF-8. The scripts (ja, ko, zh_*) can't be rendered in the console, and in the event I didn't think it was worthwhile to build xorg because of the errors I found.

Results (testing with LC_ALL=whatever man something) 'gibberish' means the text rendered, typically as a mixture of accented vowels and occasional icelandic letters, but was clearly non-sensical for the language. 'invalid UTF-8' means one or more inverse '?' symbols, caused by invalid UTF-8 characters arriving at the pager. '<xx>' means the pager displayed these hex values in inverse video

These were all tested using my (UTF-8) sigma-general 'font', which can handle the main latin and cyrillic languages as well as monotonic greek.

bg_BG.UTF-8 man foo invalid UTF-8, man 1 apropos gibberish

cs_CZ.UTF-8 man foo invalid UTF-8, man 1 apropos invalid UTF-8, man 5 passwd mix of invalid UTF-8 and gibberish

da_DK.UTF-8 man foo ok (all ascii), man 1 apropos ok

de_DE.UTF-8 man foo ok, man 1 apropos ok, man 5 passwd gibberish and invalid UTF-8

el_GR.UTF-8 man foo invalid UTF-8, man 1 apropos gibberish

es_ES.UTF-8 man foo invalid UTF-8, man 1 apropos and man 5 passwd <C3><93> etc

fi_FI.UTF-8 similar to es_ES.UTF-8

fr_FR.UTF-8 man foo ok (ascii), man 1 apropos, man 1 vim, man 5 passwd <C3><A9> etc interestingly, if I point man to fr_FR.ISO8859-1/man1/vim.1 it renders correctly.

At this point I gave up. I guessI can fix the 'man foo' issue, but the rest is so broken I don't know where to begin looking.

comment:17 Changed 12 years ago by Jim Gifford

Ken I have confirmed your findings. I have been reading about the preconv feature of groff, and wondering if we need to implement it some fashion with man. Will update this ticket with any findings.

comment:18 in reply to:  17 Changed 12 years ago by willimmn

Sugestion: Use Man-DB. It's maintaned, better UTF-8 support, etc. Please use that instead of man.

comment:19 Changed 12 years ago by Jim Gifford

Man-db is not an option for CLFS. Man-DB would require significant changes the book, and those changes are unwarranted. Berkeley DB would be adding unnecessary bloat to a standard system. Distro's have used just man for this, we just need to get the right combination of configuration and patches.

comment:20 in reply to:  19 Changed 12 years ago by willimmn

Replying to jim:

Man-db is not an option for CLFS. Man-DB would require significant changes the book, and those changes are unwarranted. Berkeley DB would be adding unnecessary bloat to a standard system. Distro's have used just man for this, we just need to get the right combination of configuration and patches.

Well, there is another way to use Man-DB without having to add BDB. It is GDBM, GNU's db system.

Why not copy and paste the CBLFS instructions into a new GDBM page? (BTW, GDBM is the default for Man-DB.)

comment:21 Changed 12 years ago by Jim Gifford

Where is the technical merit in adding this to the book. There is none.

comment:22 Changed 12 years ago by Jim Gifford

A few of us have done some testing. With the current i18n script we can support various locales. The problem seems to be with the encodings of the man-pages. We will have to address these as they come up. But the base for UTF-8 support in CLFS is solid. I do believe there is more work to do on this topic, a lot of it is documenting in the /etc/sysconfig/i18n script to specify the options for locales.

Also locales that are more exotic locales need to either go into the cblfs wiki or clfs hints for none CLFS development support.

comment:23 in reply to:  21 Changed 12 years ago by willimmen

Replying to jim:

Where is the technical merit in adding this to the book. There is none.

What the heck are you talking about?!

comment:24 Changed 12 years ago by Jonathan

willimmen, In case you haven't noticed your previous accounts have been removed. Maybe you should take that as a sign.

You can not suggest that something should be put into the book without doing research into it. Installing Man-DB requires dependances that are not already in the book. As Jim said significate changes would have to be made in order to get Man-DB into the book. As it is Man supports most UTF-8 locales at the moment and we are working on the rest.

Also please note that CLFS and LFS are not related apart from by name. You can not make suggestions simple because somebody else uses it. Please do more research before commenting on the book.

comment:25 in reply to:  24 Changed 12 years ago by willimmen

Replying to Cosmo:

willimmen, In case you haven't noticed your previous accounts have been removed. Maybe you should take that as a sign.

Then DON'T EVER REMOVE ANY OF MY ACCOUNTS AGAIN!!!!!

You can not suggest that something should be put into the book without doing research into it. Installing Man-DB requires dependances that are not already in the book. As Jim said significate changes would have to be made in order to get Man-DB into the book. As it is Man supports most UTF-8 locales at the moment and we are working on the rest.

The only package that needs to be installed also is GDBM, and that is pretty simple to compile, unlike BDB. The other change is that the Man page (not UNIX manpages, mind you) is to rework it to look like LFS's man-db page, other than the Man-DB switch. And why Man-DB instead of Man? Lots of reasons: man is abordened upstream, needs no patch to support UTF-8, supports bzipped and lzmaed man pages (very nifty for saving disk space), no silly seds for building, etc, etc.

Also please note that CLFS and LFS are not related apart from by name. You can not make suggestions simple because somebody else uses it. Please do more research before commenting on the book.

Then I now know.

comment:26 Changed 12 years ago by willimmen

And yes, I did put a lot of research into it.

comment:27 Changed 12 years ago by Jim Gifford

Then post the research instead of saying because I say so. The CLFS requirements for UTF-8 are no additional packages and no downgrading of packages. Stay within these guidelines and show proof that it works along with test cases, and I will consider putting it in the book.

comment:28 in reply to:  27 Changed 12 years ago by willimmen

Replying to jim:

Then post the research instead of saying because I say so. The CLFS requirements for UTF-8 are no additional packages and no downgrading of packages. Stay within these guidelines and show proof that it works along with test cases, and I will consider putting it in the book.

Well, even I admit, there is ONE new package for CLFS to support Man-DB. GDBM, which is lightweight and simple to compile, unlike BDB. Groff doesn't have to be downgraded no more, thanks to preconv support in Man-DB. But there is two minor things:

  1. Man-DB doesn't normaly think Groff 1.20.1 is multibyte capable, resulting in messed up man pages. That is a Man-DB issue, reported upstream, and for now, use --enable-mb-groff/
  2. Still, many screen fonts don't have Unicode single dashes and quotes in them, meaning the ASCII equvilents are needed. This sed worked for Groff 1.18.1.4:
sed -i -e 's/2010/002D/' -e 's/2212/002D/' \
    -e 's/2018/0060/' -e 's/2019/0027/' font/devutf8/R.proto

...but not for 1.20.1, because the format has changed. Something needs to be figured out about this.

And yes, I will show test cases, with a modified CLFS SVN X86_64 Multilib system, with Man-DB instead of Man.

comment:29 Changed 12 years ago by Jim Gifford

You didn't follow our requirements. No additional packages, which means no man-db. Work with in the developement guidelines.

comment:30 in reply to:  29 Changed 12 years ago by willimmen

Replying to jim:

You didn't follow our requirements. No additional packages, which means no man-db. Work with in the developement guidelines.

I KNOW, but GDBM is a simple package.

As for compilation instrutions, they are here:

./configure --prefix=/usr &&
make &&
make BINOWN=root BINGRP=root install

The BINOWN and BINGRP things for make install overide the ownership of GDBM to root instead of bin.

You see why it's simple? But for compatibly with older apps, you may need to install NDBM and DBM compatibalty headers:

make BINOWN=root BINGRP=root install-compat

comment:31 Changed 12 years ago by willimmen

comment:32 Changed 12 years ago by Jim Gifford

Your proposal is rejected. Reason: not following guidelines of no additional packages added to CLFS build.

Please do not post further on this.

Jim Gifford - Lead Developer

comment:33 Changed 12 years ago by ken

Partial update on this:

(i) The book (man-1.6f) still mentions "-Tlatin1" but we set it to "-Tutf8" in the i18n patch. However, since the current rendering is so broken, I'm not willing to alter that.

(ii.) If I change the NROFF entry in man.conf to "-c -Tutf8" then the latin1 manpages (N.W. European pages from man-1.6f, also 'man top' render correctly.

Note that for 'man top' without this change, section 3a has a lot of '<B4>'- with the change, they seem to be some sort of fancy quotes so the result might vary depending on your console font (I use sigma-consolefonts, of course!).

I don't regard this as an adequate fix - the UTF-8 pages from shadow (that is, ALL the man pages from shadow, except English) are still trashed, as are all other non-latin1 pages (e.g the bulgarian and latin2 pages from man-1.6f). With the previous version of groff plus groff-utf8 these can all be rendered (if iconv'd to UTF-8) so for me this is still a regression.

Also, on a UTF-8 console the messages (e.g. "No entry for foo in section N of the manual") need to be converted to UTF-8, except for the German messages which are already in UTF-8 (I found that in what will be fc11 - hadn't noticed it earlier, but I guess it explains why on an older system with all-UTF-8 and man-1.6f I get

$LC_ALL=de_DE.UTF-8 man foo
Keine Handbuchseite für foo
  • that is, f, A-tilde, half, r instead of f, u-umlaut (or diaeresis), r

So far, I see no way of fixing the non-latin1 pages.

Trying man-db is next on my To Do list. The current version supposedly makes it all just work (and gets rid of recoding messages from the obsolete catgets form). I'll be sorry to drop man-1.6f if man-db works, man has been a good friend to me, and last time I looked man-db didn't have the more exotic pages (e.g. no bulgarian) but man is no longer maintained. Of course, testing man-db may show other issues.

Non-alphabetic scripts (chinese, japanese, korean) are another matter - they need xorg, on my cmost recent build I can still see no point in building xorg until I can fix the text console - I suspect that man-db will not support these at all.

Meanwhile, I still regard UTF-8 in trunk as broken.

comment:34 Changed 12 years ago by ken

OK. there was a small issue with the man-db testsuite (see an LFS ticket if you care) - the maintainer has fixed it and all the alphabetic pages I can find to test it with render well on a UTF-8 terminal, whether they are in UTF-8 or legacy encodings (no need for separate directories for different character sets, provided the pages are in the expected encoding or UTF-8. The pages for script languages look good, but I can't make a comment on whether they are "correct".

The only additional dependency is gdbm. At the moment, I haven't worked out where that should fit in the build order.

For my own builds I'm replacing man-1.6f by man-db.

Just out of interest, how did dhcpcd and xz-utils get past the "no new packages" rule ?

comment:35 in reply to:  34 Changed 11 years ago by willimm

Replying to ken:

OK. there was a small issue with the man-db testsuite (see an LFS ticket if you care) - the maintainer has fixed it and all the alphabetic pages I can find to test it with render well on a UTF-8 terminal, whether they are in UTF-8 or legacy encodings (no need for separate directories for different character sets, provided the pages are in the expected encoding or UTF-8. The pages for script languages look good, but I can't make a comment on whether they are "correct".

The only additional dependency is gdbm. At the moment, I haven't worked out where that should fit in the build order.

For my own builds I'm replacing man-1.6f by man-db.

Just out of interest, how did dhcpcd and xz-utils get past the "no new packages" rule ?

Count me stuck with my old options, but with Ken backing me, I can safily say that Man-DB should be in CLFS.

comment:36 Changed 10 years ago by Joe Ciccone

Resolution: fixed
Status: newclosed

Resolving this ticket. By waiting it out of the packages in the base system are UTF-8 compatible. If a specific package has an issue, please open an individual ticket for that package.

Note: See TracTickets for help on using tickets.