THE DIGITAL CURATION OF THE ROMANIAN INTERWAR NOVEL (1920-1940)

: In recent years, Romanian literary studies took one of its major methodological turns toward distant reading, using either or both quantitative and computational analysis. While quantitative analysis employed lexicographical instruments such as dictionaries and literary chronologies, computational analysis tried to approach the issue from a “data rich” historical perspective (Katherine Bode), while also attempting to build a digital corpus adapted to computational methods. The following paper attempts to survey the main research projects that deal with the computational analysis of Romanian literature in general and the Romanian novel in particular. The first part of the study undertakes a succinct state-of-the-art on past and ongoing endeavours concerned with digital approaches to the study of Romanian literature, their initial findings and potential. The second part will take a more theoretic approach to some of the key concepts related to data supported literary history. Finally, the last part of the study tackles the main challenges of developing a digital corpus of a local literature and the shortcomings related to this literature’s “locality” in terms of computational approaches and the compatibility of the tools developed by Western research projects.

Novel from Archive to Canon) reunites qualitative formal instruments (mainly genre theory) and computational tools (with a heavy emphasis on NLP technologies) in order to investigate the relationship between the literary canon of the period 1845-1947 and the "literary archive" of the era.
Before the appearance of these projects, however, the Romanian critical space went through a period of theoretical debate and metacritical reflection. Tracing the emergence and development of computational studies in the West, Alex Goldiș published one of the first articles concerned with Digital Humanities and quantitative studies . This early intervention is rightfully cautious when talking about the possibility of importing distant reading methodologies in the study of Romanian literature, as in the absence of a comprehensive digital literary archive, the type of macroanalysis proposed by Jockers is all but impossible to emulate. The following year, Mihaela Ursa asked herself if Romanian culture is ready for "the digital turn", outlining the country's cultural and institutional shortcomings and the local academe's lack of flexibility toward intermediality and interdisciplinarity (Ursa stating that terms such as multi-, post-, pluri-, trans-, and interdisciplinarity have become interchangeable and, thus, superfluous, opting for Digital Humanitieswhich is inherently multi-, post-, pluri-, trans-, and interdisciplinary -as a means to overcome this handicap (Patraș et al 17-31, see also Nicolaescu and Mihai). Later studies built upon these first reflections (Bâlici 54-71, Olaru 30-37, Gârdan and Modoc 52-65) after the first autochthonous forays into distant reading have appeared in the country.

Archive, corpus, canon. Digital Humanities in the age of curation
In one of the pamphlets (Algee-Hewitt et al) later published in the volume Canon/ Archive, the researchers at Stanford Literary Lab make an essential distinction between what they called "the published" (i.e., the total production of a given culture), "the archive" (i.e., what was conserved out of the total production) and "the corpus" (what made into their final sample collection). "[T]he fundamental horizon of all quantitative work" (Algee-Hewitt 2), the published is a literary sample that can never be completely surveyed through direct means, but can be accessible through its preserved metadata (by way of lexicographical instruments, for instance).
The archive, on the other hand, is the sum total of literary works that was preserved in libraries, collections, and archives. This is the basis of any given corpus, the lowest common denominator of the three, which invariably implies a process of selection, since the archive (in this particular sense) cannot be readily accessible at all times.
The difficulties surrounding the need to equalise the three types of collections are pertinently explored by the researchers at Stanford Literary Lab, who worked with a corpus of 4000 English novels published between 1750 and 1880, a relatively small number compared to the archive, let alone the published. Moreover, they also had to face with a selection bias in the corpus, with significantly more gothic novels than historical and large quantitative differences across different timeframes. Thus, in order to compensate for the bias, Stanford Literary Lab opted, in their experiment, for a sample of 674 novels (Algee-Hewitt et al 2-3). Setting aside the difficulties related to the data mining of this corpus, one of the authors' conclusion in this case study was that any given quantitative approach on a corpus of this magnitude relied heavily on the interaction between multiple institutions and research collectives, a notion that is, to this day, quite foreign for the Romanian humanist. In the case of the Romanian collection of novels accessible in digital format, the archive and the corpus present a much more generous ratio: 80% of the novels published in book form between 1845 and 1947 is currently available via the Astra Data Mining corpus, according to the coordinators (Baghiu et al, Geografia 1933Geografia -1947 1 . In principle, this solves the selection bias encountered by the Stanford Literary Lab researchers in their corpus. The issue of bias is replaced, however, with the one regarding the relation between the corpus and the canon. Because the Romanian literary canon is highly restrictive, giving way to a "great unread" exponentially greater than any found in a culture with a larger literary production. Out of nowhere, the corpus made available by the Astra Data Mining comes with an abundance of data that require the ability to sort through the newly found knowledge in order to utilize it more effectively. For this, I propose a typological approach to the curation of the Romanian novel using the cultural common denominators of the Romanian novel (i.e. the established canon) and a temporal frame that can be used as the milestone for the birth and peak Another crucial aspect to my selection is that it presupposes an agreement about the specific subgenre classes in question. This is not as simple as it seems, since the only (somewhat) consistent source of traditional genre labelling in Romania remains DCRR 3 . The only aspect that simplifies the question of label adequacy is the general consensus between the canonical novels and their respective labels, since one will be hard pressed to find scholars that do not deem Liviu sentimental novels, and 50 social novels. The three subgenres that have been capped completely dominate the novelistic production and have quite an imbalanced canonto-archive ratio, but in turn feature a very diverse selection of authors. This shows that, at least in part, subgenres that can permeate different groups of writers (specialized, amateurs) and different groups of public (high-, middle-, and low-brow alike) have the best quantitative representation in the field. Because most of the subgenres that display a high degree of canonicity rarely reach this number, the total amount of novels in my selection amounts to 331 novels out of a total of 859 published in the period (according to DCRR). With a coverage just shy of 40% of the total production, my selection is useful in covering the part of the archive that was most engaged in the dynamic cultural field of the interwar period.

An abundance of data, a lack of tools
Following the main digitization projects mentioned above, Romanian culture has quickly gained access to an abundance of data. Before that, scholars relied heavily (and many still do) on lexicographical proxies in order to compensate for the lack of data with metadata. Subgenres, years of publication, publishing houses, the main cities housing most of the Romanian editorial landscape etc. were systematically used to survey the Romanian novel in its absence. Even now, the archive still needs work. As of yet, no project has been able to deliver a consistent enough corpus of annotated novels or a sufficiently encompassing structured dataset.
It is an endeavour that takes a great deal of consistent collective effort and an even greater deal of raising awareness of the fact that with digitization comes an urgent need for collective research groups and the establishment of strong relational ties between institutions.
The limits of current efforts to employ computational analysis on a Romanian corpus are obvious: lack of encoded, machine-readable corpus, limited implementation of NLP tools for writing with a high degree of historical character, absence of markup standards for Romanian (either procedural, presentational or descriptive). One of the more basic instruments for computational textual analysis that also feature NLP support for Romanian is TXM (Textométrie). However, the tool is highly dependent on a stable form of language and the presence of encoded text, as it is primarily a linguistics tool, capable of lexical operations such as progressions, concordances or co-occurrences. More complex operations such as topic modelling or sentiment analysis, methods that feature unsupervised data collocation, automated semantic clustering or machine learning integration are all but impossible to implement at this time. Other methods that do not depend necessarily on NLP integration (since it relies on normalized word frequencies/ zscores) such as stylometry (computational stylistics) show some promise (see Modoc and Gârdan 48-63) 5 , but nevertheless require further research in order to confirm its validity on Romanian language corpuses.
In conclusion, the domain of Romanian literary studies is still in its very early stages when it comes to employing methods of computational analysis. It cannot be overstated how much future institutional strategies and policies will play their part in the continued establishment of Digital Humanities in the Romanian academe. While the projects discussed above may seem modest in their scope, they are indicative of a paradigm shift in the field of literary studies. It is expected that in the following few years even the shortcomings mentioned earlier will be overcome. The extent to which Digital Humanities will exact change in the autochthonous academic field will depend on the future outcomes that will have built upon these humble beginnings.
Finally, it should be remembered that, numbers-wise, the current available digital archive of Romanian novels has a degree of coverage impossible to achieve in any Western culture. Perhaps for the first time in the history of literary studies, 5 In this article, we have not used a balanced corpus, nor was it representative of the entire literary production. Representing 13% of the period's production and with a 50% presence of canonical works, our study merely intended to test the merits and/ or limits of stylometry on a literary corpus without NLP support. In some cases, authors marked as stylistically special by traditional literary historiography do end up in an isolated position within the network, while, in other cases (Camil Petrescu, Mircea Eliade or G. Călinescu), they are a great deal more closely linked to the rest of the authors. We believe that these results have the potential to reopen some debates concerning personal style using formal markers such as stop-words.