~ireas/public-inbox

9 2

i18n Support

Details
Message ID
<CAJuWSy+kXQp=RjigmFAXr0JyM_GLdL2GJ6isUSCeHbEkexjRdg@mail.gmail.com>
DKIM signature
pass
Download raw message
Hi,

I'm not sure if this is me, or I'm missing something, but it seems
that genpdf is missing i18n support, cause I'm trying to use Chinese,
but it seems that the text is not coming in at all:

https://gist.github.com/HangingClowns/bed7737f5886f0660a001343f574adda
Details
Message ID
<20210621070815.GA1149@ireas.org>
In-Reply-To
<CAJuWSy+kXQp=RjigmFAXr0JyM_GLdL2GJ6isUSCeHbEkexjRdg@mail.gmail.com> (view parent)
DKIM signature
pass
Download raw message
Hi Allen,

I think there are two different aspects:

One aspect is using characters from non-Latin scripts.  This should
already work.  I’ll try to investigate why the characters are missing in
your example.

The other aspect is proper i18n that takes into account the writing
direction and other language-specific properties.  This is currently not
implemented in genpdf (other than hyphenation).  I’m happy to accept
patches that improve this, but it is not on my roadmap to implement that
myself in the near future.

Best,
Robin
Details
Message ID
<CAJuWSyLJC_PK9qRAZ91pPx+PHSKtA5vofaewxjAayqf5YeK1qQ@mail.gmail.com>
In-Reply-To
<20210621070815.GA1149@ireas.org> (view parent)
DKIM signature
pass
Download raw message
Hi Robin,

Good point, I totally forgot about RTL support. For me it's just the
characters aren't coming in at all. I'm than happy to help with
testing. I'm still a bit of a beginner in rust.

On Mon, Jun 21, 2021 at 3:08 PM Robin Krahl <robin.krahl@ireas.org> wrote:
>
> Hi Allen,
>
> I think there are two different aspects:
>
> One aspect is using characters from non-Latin scripts.  This should
> already work.  I’ll try to investigate why the characters are missing in
> your example.
>
> The other aspect is proper i18n that takes into account the writing
> direction and other language-specific properties.  This is currently not
> implemented in genpdf (other than hyphenation).  I’m happy to accept
> patches that improve this, but it is not on my roadmap to implement that
> myself in the near future.
>
> Best,
> Robin
Details
Message ID
<20210621072818.GB1149@ireas.org>
In-Reply-To
<CAJuWSyLJC_PK9qRAZ91pPx+PHSKtA5vofaewxjAayqf5YeK1qQ@mail.gmail.com> (view parent)
DKIM signature
pass
Download raw message
Hi Allen,

I think I found the reason why your example doesn’t work:
	https://gist.github.com/HangingClowns/bed7737f5886f0660a001343f574adda#file-main-rs-L20
Uncomment that line and it print the signs.  (There seems to be another
issue if the string is too long and can’t be split up using the naive
implementation – please try shorter strings first.)

/Robin
Details
Message ID
<CAJuWSyJah-qwGnXrOVbQRG5KZ1evMTZUUWm+GnTpRVZ4HLzYbQ@mail.gmail.com>
In-Reply-To
<20210621072818.GB1149@ireas.org> (view parent)
DKIM signature
pass
Download raw message
shorter strings seems to help. thanks!

On Mon, Jun 21, 2021 at 3:28 PM Robin Krahl <robin.krahl@ireas.org> wrote:
>
> Hi Allen,
>
> I think I found the reason why your example doesn’t work:
>         https://gist.github.com/HangingClowns/bed7737f5886f0660a001343f574adda#file-main-rs-L20
> Uncomment that line and it print the signs.  (There seems to be another
> issue if the string is too long and can’t be split up using the naive
> implementation – please try shorter strings first.)
>
> /Robin
Details
Message ID
<CAJuWSyKp9NpRg4Ak-v2uyT-6Rnt3jAmJH0HDv-uqHUJ2XV0yiQ@mail.gmail.com>
In-Reply-To
<CAJuWSyJah-qwGnXrOVbQRG5KZ1evMTZUUWm+GnTpRVZ4HLzYbQ@mail.gmail.com> (view parent)
DKIM signature
pass
Download raw message
to follow up: any ideas on how to do the text wrapping? or even a work
around? Seems kind of strange that there's 0 text instead of say the
text just traveling off the page?

On Mon, Jun 21, 2021 at 6:37 PM Allen Wyma <allen.wyma@gmail.com> wrote:
>
> shorter strings seems to help. thanks!
>
> On Mon, Jun 21, 2021 at 3:28 PM Robin Krahl <robin.krahl@ireas.org> wrote:
> >
> > Hi Allen,
> >
> > I think I found the reason why your example doesn’t work:
> >         https://gist.github.com/HangingClowns/bed7737f5886f0660a001343f574adda#file-main-rs-L20
> > Uncomment that line and it print the signs.  (There seems to be another
> > issue if the string is too long and can’t be split up using the naive
> > implementation – please try shorter strings first.)
> >
> > /Robin
Details
Message ID
<20210621111301.GA1027@ireas.org>
In-Reply-To
<CAJuWSyKp9NpRg4Ak-v2uyT-6Rnt3jAmJH0HDv-uqHUJ2XV0yiQ@mail.gmail.com> (view parent)
DKIM signature
pass
Download raw message
On 2021-06-21 18:53:57, Allen Wyma wrote:
> to follow up: any ideas on how to do the text wrapping? or even a work
> around? Seems kind of strange that there's 0 text instead of say the
> text just traveling off the page?

There seems to be a bug because usually genpdf returns an error if the
content does not fit on the page.  Just breaking the string at some
arbitrary point might be a better solution, but I haven’t implemented
that yet.  I’ve created some tickets for that:
	https://todo.sr.ht/~ireas/genpdf-rs/54
	https://todo.sr.ht/~ireas/genpdf-rs/55
	https://todo.sr.ht/~ireas/genpdf-rs/56

Currently the wrapping logic is something like this:  First, split the
text into word at spaces.  If a word does not fit on the current line,
the next step depends on wheter a hyphenator is set
(Document::set_hyphenator).  If one is set, genpdf tries to split the
word using that hyphenator.

Unfortunately, I have no idea how the concept of word breaks and
hyphenation applies to Chinese.  Could you give me a short summary?

genpdf currently uses the hyphenation crate [0] for word splitting.  It
seems to support Chinese [1], so you can try to experiment with that.
We are probably migrating to the textwrap crate [2] soon, which also
uses the hyphenation crate.

[0] https://lib.rs/crates/hyphenation
[1] https://docs.rs/hyphenation/0.8.3/hyphenation/enum.Language.html#variant.Chinese
[2] https://lib.rs/crates/textwrap

It would be very helpful for me if you could prepare a small example
that shows how to properly split Chinese text using the hyphenation
or textwrap crate (without genpdf, maybe based on this example [3]).
Let me know if you have any questions.

[3] https://github.com/mgeisler/textwrap/blob/master/examples/hyphenation.rs

Best,
Robin
Details
Message ID
<CAJuWSyLrYFqNaRfo3ra1nQBVrPE3UjJdxEee-YrdA73YwKtk7g@mail.gmail.com>
In-Reply-To
<20210621111301.GA1027@ireas.org> (view parent)
DKIM signature
pass
Download raw message
Hi Robin,

thanks for the detailed message: basically for chinese, you can break
after each character is fine. There's no concept of "words", just
characters. Although a word like telephone is written as 電話, it's okay
to break after 電 because seeing the character 話 afterwards makes it
clear that they belong together.

On Mon, Jun 21, 2021 at 7:13 PM Robin Krahl <robin.krahl@ireas.org> wrote:
>
> On 2021-06-21 18:53:57, Allen Wyma wrote:
> > to follow up: any ideas on how to do the text wrapping? or even a work
> > around? Seems kind of strange that there's 0 text instead of say the
> > text just traveling off the page?
>
> There seems to be a bug because usually genpdf returns an error if the
> content does not fit on the page.  Just breaking the string at some
> arbitrary point might be a better solution, but I haven’t implemented
> that yet.  I’ve created some tickets for that:
>         https://todo.sr.ht/~ireas/genpdf-rs/54
>         https://todo.sr.ht/~ireas/genpdf-rs/55
>         https://todo.sr.ht/~ireas/genpdf-rs/56
>
> Currently the wrapping logic is something like this:  First, split the
> text into word at spaces.  If a word does not fit on the current line,
> the next step depends on wheter a hyphenator is set
> (Document::set_hyphenator).  If one is set, genpdf tries to split the
> word using that hyphenator.
>
> Unfortunately, I have no idea how the concept of word breaks and
> hyphenation applies to Chinese.  Could you give me a short summary?
>
> genpdf currently uses the hyphenation crate [0] for word splitting.  It
> seems to support Chinese [1], so you can try to experiment with that.
> We are probably migrating to the textwrap crate [2] soon, which also
> uses the hyphenation crate.
>
> [0] https://lib.rs/crates/hyphenation
> [1] https://docs.rs/hyphenation/0.8.3/hyphenation/enum.Language.html#variant.Chinese
> [2] https://lib.rs/crates/textwrap
>
> It would be very helpful for me if you could prepare a small example
> that shows how to properly split Chinese text using the hyphenation
> or textwrap crate (without genpdf, maybe based on this example [3]).
> Let me know if you have any questions.
>
> [3] https://github.com/mgeisler/textwrap/blob/master/examples/hyphenation.rs
>
> Best,
> Robin
Details
Message ID
<20210621115102.GC1027@ireas.org>
In-Reply-To
<CAJuWSyLrYFqNaRfo3ra1nQBVrPE3UjJdxEee-YrdA73YwKtk7g@mail.gmail.com> (view parent)
DKIM signature
pass
Download raw message
On 2021-06-21 19:17:00, Allen Wyma wrote:
> thanks for the detailed message: basically for chinese, you can break
> after each character is fine. There's no concept of "words", just
> characters. Although a word like telephone is written as 電話, it's okay
> to break after 電 because seeing the character 話 afterwards makes it
> clear that they belong together.

Thanks for the explanation!  So the problem is my naive approach of
splitting words at spaces.  The proper solution would be to use the
Unicode word splitting algorithm, but I’ll have to do some research on
how to properly apply that.  I have some other tasks that I want to
finish fist.

In the mean time, may I suggest the following workaround:  If a
Paragraph is composed of multiple StyledStrings, it inserts word
boundaries between these strings.  If there is a word boundary at every
character, then you could just replace a paragraph with a single string:

    let cn_text = genpdf::elements::Paragraph(cn_para);

by a paragraph with one string per character:

    let cn_text: genpdf::elements::Paragraph = cn_para
         .chars()
	 .map(String::from)
	 .map(genpdf::style::StyledString::from)
	 .collect();

/Robin
Details
Message ID
<20210621193121.GD1027@ireas.org>
In-Reply-To
<20210621115102.GC1027@ireas.org> (view parent)
DKIM signature
pass
Download raw message
I’ve experimented a bit with possible solutions both for the case where
a long string cannot be fitted into a line and for the wrong word
splitting for Chinese text.  It is not entirely trivial to fix these
with the current wrapping system.  I’d rather not do that work right now
as we are migrating to textwrap soon, which will fix both issues.

My current plan is to finish the textwrap migration after the next
release v0.3.0.  I hope this is acceptable for you.

/Robin
Reply to thread Export thread (mbox)