~codegouvfr/codegouvfr-devel

2

problem running codegouvfr-fetch-data on Windows

Details
Message ID
<245775c365224704853e96b45c4bd2e5@sip.etat.lu>
Sender timestamp
1638345370
DKIM signature
missing
Download raw message
Hello, 

I have made some tests with codegouvfr-fetch-data on Linux and Windows. 
On Windows, I have encountered some issues when some emojis are present in the description of a repository.
The problem seems to be a question of encoding, on Windows, the script tries to output files with the CP-1252. 

Here is an example of stack trace:
Traceback (most recent call last):
  File "C:\Users\IFI774\Projects\codegouvfr-fetch-data\fetch.py", line 41, in <module>
    save_repos(all_repos)
  File "C:\Users\IFI774\Projects\codegouvfr-fetch-data\storage.py", line 42, in save_repos
    save_data(data, "repo")
  File "C:\Users\IFI774\Projects\codegouvfr-fetch-data\storage.py", line 31, in save_data
    w.writerows(set(zip(*data.values())))
  File "C:\Users\IFI774\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U000130e1' in position 54: character maps to <undefined>

\U000130e1 is an emoji (egyptian hieroglyph)

I used Python 3.9, and the environment is MINGW64 / Bash ("Git Bash"):
Python 3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)] on win32

With the same configuration, I have no problem on Linux.

Could it be a configuration issue? Do you think there is a way to force the encoding of the output files to UTF-8 to avoid this issue?

Thanks,

Alain Vagner
Référent accessibilité numérique
Division Open Data et accès à l'information

LE GOUVERNEMENT DU GRAND-DUCHÉ DE LUXEMBOURG
Service information et presse 

33, bd Roosevelt . L-2450 Luxembourg 
Tél. (+352) 247-82182 
E-mail : alain.vagner@sip.etat.lu
www.gouvernement.lu . www.luxembourg.lu
Details
Message ID
<87pmqgk8cr.fsf@data.gouv.fr>
In-Reply-To
<245775c365224704853e96b45c4bd2e5@sip.etat.lu> (view parent)
Sender timestamp
1638352836
DKIM signature
missing
Download raw message
-- 
 Bastien Guerry
Details
Message ID
<875ys8cvps.fsf@data.gouv.fr>
In-Reply-To
<87pmqgk8cr.fsf@data.gouv.fr> (view parent)
Sender timestamp
1638368127
DKIM signature
missing
Download raw message
Bastien Guerry <bastien.guerry@data.gouv.fr> writes:

> Alain Vagner <Alain.Vagner@sip.etat.lu> writes:
>
>> Do you think there is a way to
>> force the encoding of the output files to UTF-8 to avoid this issue?
>
> Probably - can you try this patch?

Fix confirmed, I just applied the patch.

Thanks!

-- 
 Bastien Guerry
Reply to thread Export thread (mbox)