Documentation/Labs/VTK-String
Contents
Problem
Currently, encoding in VTK strings is not explicitly specified. When receiving a string from external libraries or using the string in operating system calls (e.g., reading/writing files) then the behavior is often incorrect.
For example:
- files that have non-ASCII characters in their name cannot be opened
- when changing the application locale (so that some necessary special characters can be stored in a single byte), then generated files may become invalid (e.g., because decimal point is replaced by decimal comma)
- Python and Qt stores strings with known encoding, but there is no way to convert them to/from strings in VTK without loss of information
There is a vtkUnicodeString class in VTK that you store string with a known encoding. It can store and provide string in utf8 and utf16 encoding. It is already used extensively in text rendering, arrays, tables, certain file export/import, but majority of VTK still uses const char* and get/set macros.
Using const char* for string storage, managing memory with vtkSetStringMacro/vtkGetStringMacro, and process strings with C string functions are all very outdated programming practices. VTK-based applications must choose between following VTK's approach and stuck with outdated practices or break away from it and live with inconsistencies in the code base and be cautious with managing strings (additional conversions, null-pointer checks are needed) - none of these options are good.
It could be possible to state that "all strings in VTK are utf8 encoded", but it would be hard to enforce that this requirement is fulfilled in all VTK and all classes of VTK-based applications. It would be especially difficult to ensure that strings are transcoded when getting/setting data in VTK to/from other libraries and when VTK calls system APIs. It could have performance impact if we required all strings to be converted to utf8 to be stored in VTK (if we want to store a non-utf8-encoded string in VTK, we would need to always need to do an extra encoding and decoding operation).
See some more discussion here: https://discourse.vtk.org/t/proposal-should-we-replace-vtkstdstring-with-std-string/796/14
Proposal
Design
Use an encoding-aware string class in VTK to store all strings.
vtkUnicodeString is a good starting point, as it can store any string with known encoding and it is already in VTK, used in a number of VTK classes. vtkDICOMFilePath contains useful conversion code, too.
It could be renamed to vtkString to make the name shorter. It is also more clear if we don't include the name of a particular encoding in the class name (as in the future we might support multiple encodings inside the string class, not necessarily just Unicode). This would also consistent with how other libraries manage strings (see for example Qt's QString).
Maybe vtkDICOMFilePath should be added to VTK (as vtkFilePath, maybe parent class could be vtkString) for storing file paths. File paths need some extra features, such as conversion of slashes and handling of extended paths (\\?\) on Windows.
Plan
- Rename vtkUnicodeString by vtkString. Maybe improve the API with adding get as/set from Latin1.
- Replace all string attributes in VTK classes by vtkString (add new get/set macros, create object instance in the constructor)
- Update Python wrapping
- Add converters to/from QString
- Maybe add automatic converters to const char* (can be disabled by CMake flags) to make update of application code easier
- Review all operating system calls and make sure strings are properly converted