Hello,
I have made some tests with codegouvfr-fetch-data on Linux and Windows.
On Windows, I have encountered some issues when some emojis are present in the description of a repository.
The problem seems to be a question of encoding, on Windows, the script tries to output files with the CP-1252.
Here is an example of stack trace:
Traceback (most recent call last):
File "C:\Users\IFI774\Projects\codegouvfr-fetch-data\fetch.py", line 41, in <module>
save_repos(all_repos)
File "C:\Users\IFI774\Projects\codegouvfr-fetch-data\storage.py", line 42, in save_repos
save_data(data, "repo")
File "C:\Users\IFI774\Projects\codegouvfr-fetch-data\storage.py", line 31, in save_data
w.writerows(set(zip(*data.values())))
File "C:\Users\IFI774\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U000130e1' in position 54: character maps to <undefined>
\U000130e1 is an emoji (egyptian hieroglyph)
I used Python 3.9, and the environment is MINGW64 / Bash ("Git Bash"):
Python 3.9.0 (tags/v3.9.0:9cf6752, Oct 5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)] on win32
With the same configuration, I have no problem on Linux.
Could it be a configuration issue? Do you think there is a way to force the encoding of the output files to UTF-8 to avoid this issue?
Thanks,
Alain Vagner
Référent accessibilité numérique
Division Open Data et accès à l'information
LE GOUVERNEMENT DU GRAND-DUCHÉ DE LUXEMBOURG
Service information et presse
33, bd Roosevelt . L-2450 Luxembourg
Tél. (+352) 247-82182
E-mail : alain.vagner@sip.etat.lu
www.gouvernement.lu . www.luxembourg.lu