An Experiment in Natural Language Processing, Machine Learning, and Islamic Law :: Part 1 ::

 

By Mairaj Syed

Project Description and Goals

As I briefly indicated in my previous blog post, a fundamental desideratum for the field of Islamic law and ethics is a corpus of texts whose argumentation has been fully mined: conclusions would be distinguished from premises, and the premises categorized according to type of argument. The types of arguments could range from interpretations of scripture, to citations of canons, to empirical claims about the world, to analogy, just to name a few. To do this manually, would involve a large, complex, and coordinated endeavor. The amount of labor involved can potentially be vastly cut down with the application of recent advances in the fields of machine learning, natural language processing, and more recently in the burgeoning area of argument mining. This post explores the viability of applying recent advances in natural language processing to search for and discover the citation of canons in works of Islamic positive law (fiqh), which would be a small step in the creation of a fully annotated corpus of Islamic legal texts. Given the experimental nature of this exploratory research task, which involved much trial and error, I decided to document some of the mistakes I made and the challenges that I faced and how I tried, sometimes in vain, to overcome them. You can find the Python notebook documenting the sequence of Python commands and scripts that ultimately yielded positive results here and you can watch a video walking through the notebook here.

At first glance, discovering the citation of canons in works of positive law does not seem too difficult. In fact, one possible solution would be to simply search for the canon in already digitized collections, such as those provided in al-Maktaba al-Shāmila, Maktaba Ahl al-Bayt, or al-Jāmiʿ al-Kabīr. But, searching for each canon in potentially hundreds of volumes, one text at a time through what are often clunky search interfaces is time-consuming. This can be automated to a certain extent, because of the recent standardization and digital publication of Islamic texts by the OpenITI initiative. One may simply write a script that searches all sets of texts in question and delivers the results to be evaluated in a more organized and potentially efficient way by scholars. But, even this more automated approach has the fundamental drawback of returning the results of only exact matches. If one were to search for the doubt canon, “tudraʾ al-ḥudūd bi al-shubuhāt”, only the results that matched exactly that phrase would be discovered. But, this canon has been phrased a variety of ways by different authors through the long history of Islamic law. Moreover, something akin to its meaning, even when not encapsulated in a pithy formula is probably present in a variety of texts, especially early ones that were authored before the canon acquired a maxim-like formulation.

Advances in NLP and ML allow us to search for texts that are semantically similar and not just identical to the one queried. Many of the algorithms developed to solve this problem are now readily available in easily deployable modules, especially in the Python programming language. Although these modules could initially only handle English or Chinese texts, more recently technology firms and open-source researchers have started adding the ability to search for semantically similar phrases to other languages, including Arabic.

Google has recently introduced an experimental cloud service that allows one to quantify the extent to which texts are similar to each other and can perform the operation on Arabic texts. Because, the service at the moment is experimental, it is offered at no cost, on the condition that one’s proposed project is accepted.[1] When I saw a news article describing the service a couple of months ago, I decided to try my luck and applied to the program. Google got back to me within a couple of days of receiving my application and approved it.

At a basic level, the service receives two texts and returns a score representing the extent of the semantic similarity between them – the higher the number, the more similar the texts are to each other. Potentially, we can do this with a given canon and a corpus of fiqh texts, appropriately divided up into small, manageable chunks. After we receive a similarity score for the canon and each chunk of the fiqh corpus, we can do a sort on the similarity score, and the texts most similar to the canon will show up at the top. As a test, I initially sent the Google Semantic Similarity Service (GSSS), the following texts (the “input” would be the canon, while “candidate” would be text searched against):[2]

In the example, above, I chose the intentions canon “inna-mā aʿmāl bi al-niyyāt (actions are judged by intentions)” and sent a random assortment of phrases, with some that I thought were semantically more similar to it than others. The results were promising. In the first batch it correctly rated the phrase, “al-umūr bi-maqāsidihā (matters are to be considered according to their purposes)” as more semantically similar (≈0.45) to the intentions canon versus the basmala (≈0.15)[3] and the doubt canon (≈0.22). The same thing can be noticed in the second batch. Broadly speaking, GSSS is able to separate semantically similar phrases from those that are not. GSSS thought that the phrase “niyyat al-muʾmin khayr min ʿamalihi (the believer’s intention is better than his action)” was most similar to intentions canon, then the “al-umūr bi-maqāsidihā” and then the phrase, I invented “inna al-afʿāl bi-maqāṣidihā.” Intuitively, though I would think the last phrase would be most semantically similar to the intentions canon. Regardless, I thought the results were promising enough to pursue a more substantial experiment.

Text Selection and Preparation

After consultation with Dr. Intisar Rabb and Dr. Mariam Sheibani, we decided to search for two canons in Qarāfī’s (d. 684/1285) work of positive law, the Treasure of Mālikī Rulings (al-Dhakhīra fī Furūʿ al-Mālikīyya). The canons were the evidence canon (al-bayyina ʿalā al-muddaʿī wa al-yamīn ʿalā man ankara/البينة على المدعي واليمين على من أنكر/the plaintiff has the burden of proof and the defendant has the option to take an oath), and the doubt canon (tudraʾ al-ḥudūd bi al-shubuhāt/ تدرأ الحدود بالشبهات/punishments should be suspended in the presence of doubt). We selected the Treasure, because Qarāfī was the author of both a collection of canons and a work of positive law.[4] Moreover, we know that Qarāfī cited canons in his work of positive law, because he said so in the introduction to his work on canons:

I wrote this book specifically to cover legal canons, adding many canons that are not in the Dhakhīra, and including all those that appear in the Dhakhīra in simplified and clarified terms. In the Dhakhīra, I was attempting to draw on the abundance of foundational-source references (kathrat al-naql) for substantive law rulings (furūʿ), following the mode of writing substantive law [works]. I preferred not to combine [that form] with the simpler forms of works outlining legal permissions (mubāḥāt) and legal canons (qawāʿid). So, I wrote a separate book, given that [understanding] the canons from [the Dhakhīra alone] would otherwise be difficult.[5]

Since Open ITI has mined, cleaned, and structured thousands of Islamic texts found in other digital repositories and formatted them so that they can be easily manipulated in programming languages, such as Python, I used the two versions of the Treasure available there.[6] One was drawn from the version found in the popular Islamic text repository al-Maktaba al-Shāmila, and the other from the Islamic digital library al-Jāmīʿ al-Kabīr. I had initially thought these versions of the Treasure were similar enough for our purposes that an investigation of their differences was unwarranted. This turned out to be incorrect. I initially decided to work with the version drawn from al-Maktaba al-Shāmila, simply because it seems like it is the largest contributor of texts to OpenITI to begin with and because I myself had much more familiarity with their texts than the ones found in al-Jāmiʿ al-Kabīr. I also decided to use the Jupyter notebook framework’s implementation of Python to prepare the text, manage the process of sending them to Google’s semantic similarity service (GSSS) and receive the scores back.[7] The entire process of figuring out how all of this worked together involved many hours of trial-and-error, as is the nature of these projects, not to mention my own relative unfamiliarity with working in Python.

The most difficult part of the experiment consisted of preparing the list of pairs I would send to GSSS. How do I divide up the Treasure so that I may send GSSS a list of two pairs: one the doubt canon, and the other a segment from the Treasure that may contain the canon? The instructions from GSSS were not at all clear. They did not numerically define how short or long the candidate texts needed to be other than noting that they can range from being short phrases to no longer than a paragraph, without more precisely quantifying what they considered to be too long. Second, they suggested not sending the service more than 1000 pairs at a time. They cautioned that sending them more than that could time-out the service.

A second ambiguity with how to divide up the Treasure, stemmed not from GSSS, but from the difficulty in selecting a non-arbitrary way to partition the text into smaller components. Would those smaller components be phrases, sentences, or paragraphs? Ideally, we would want to partition it into fragments that are as small as possible under two constraints: they should not be smaller than the canon itself and they should not lose semantic integrity, by which I mean their ability to convey something like a single idea. Sub-sentence phrases would probably be ideal, but there was no symbol (like a comma), that we could easily rely upon to identify them in the Treasure. If I was dealing with an English language text, sentences would be the next optimal candidate for partition. They retain semantic integrity (they convey one idea), are usually relatively short, and they can be easily located in a text because the period delimits their ending. But, the period as a convention to demarcate sentences was not one that premodern Islamic scholars used. The introduction of periods into modern edited versions of classical Islamic texts varied from one editor to the next and the editor of the Treasure seemed to have used them sparingly. I therefore decided to partition the Treasure by paragraphs which thankfully were formally indicated in OpenITI’s versions by the use of the hashtag sign (#).[8] Here’s a screenshot of what the beginning of a typical OpenITI text file looks like:

I ended up dividing the whole of the Treasure into 9,678 paragraphs.[9] However, I noticed the length of each paragraph varied widely. Some could be just 4-5 characters long, others could be as long as 52,120 characters.[10]

In addition to dividing the Treasure into paragraphs, each paragraph needed to be stripped of any characters that would detract from the core goal of capturing the semantic similarity between two texts, the main determinant of which are words in a particular sequence. This meant I had to get rid of any punctuation (e.g. ‘.’, ’!’, ’:’, etc.) and any characters OpenITI introduced in their own notation of the texts, such as pound signs and tildes. This step, often referred to as “cleaning a text”, though relatively easy to perform programmatically, is immensely important for many natural language processing tasks.

After cleaning the paragraphs, I had 9,678 paragraphs that consisted only of words composed of Arabic characters, but of widely varying lengths. I was skeptical that this could work, especially because of the presence of the large paragraphs. I was also curious about how long it would take GSSS to return scores for all the paragraphs. Nevertheless, I persisted. I recited the basmala and attempted to send paragraphs, 1000 at a time, to GSSS and see if they could process the texts and return similarity scores in a timely manner. The service timed out. I kept reducing the number of pairs I sent the service until I was sending just 2 pairs. Most times, I would receive a response, sometimes I would not. Each response took about 20-30 seconds to get. Given the fact that sometimes I could not get a response and that even when I did, I got one only after 20-30 seconds, I determined it was impractical to send the entirety of the Treasure to the service, even one paragraph at a time, and I decided that the paragraphs need to be broken down further.[11]

Dr. Rabb suggested that instead of sending the entirety of the Treasure, why not pick one or two chapters, since all we were trying to do is test out the idea. She suggested the chapter devoted to testimony (kitāb al-shahādāt) for the evidence canon and the chapter on criminal law (kitāb al-jināyāt) for the doubt canon. When we eliminated the burden of sending the entirety of the Treasure, we had a much more manageable task before us, though this introduced the complexity of finding the relevant chapters. OpenITI does mark the sections in their texts, presumably by reproducing the chapter and section structure of the repositories that they mined.[12] This presented a problem in the al-Maktaba al-Shāmila version of the Treasure: its headings did not include the titles of chapters, hence would have required manual effort to find the chapters on criminal law and testimony.[13] On a lark, I decided to see whether the version Open ITI mined from al-Jāmiʿ al-Kabīr would have the chapter titles. Alḥamdulillah (all praise is God’s), they did! Using this version had an additional boon: I noticed that OpenITI used the pipe symbol “|” perhaps in a way that represented the period, in the sense that each group of words separated by it seemed to represent a semantic unity.[14] Partitioning along the pipe symbol would result in fragments larger than sentences, but ones that would be smaller than paragraphs. I decided to start with the chapter on testimony and search for the evidence canon. The testimony chapter consisted of 330 paragraphs,[15] and after further partitioning along the pipe symbol, consisted of 945 fragments. The 945 fragments still had great variation between the lengths of them – 6 characters to as long 3,929, with an average length of 231 characters.

I recited the basmala and attempted to send the 945 fragments to the GSSS. It still didn’t work. Although GSSS noted that its service could handle up to 1000 pairs of text, I found that sending even one pair at a time took too long for it to be practicable. Even by limiting the number of pairs I sent, I would often not get responses at all. I had one last trick up my sleeve, and so, despite the failure, I persisted.

I suspected that some of the fragments were still entirely too long, perhaps outside the bounds of what GSSS considered a “short paragraph.” I needed to find a way to divide up the fragments even more. From a programmatic perspective, dividing up a paragraph into fragments of an equal number of words is pretty easy. But doing so ran the following risk: what if I divided up the text precisely in a place that contained something similar to a canon? Fortunately, NLP provides a solution to this problem in the form of what is called an n-gram, a term of art in that field. Partitioning a text using the n-gram approach ensures that there is always an overlap of words between each individual fragment. The ‘n’ in the term n-gram represents the number of words each fragment of a text will contain. The image below gives a sense of how n-grams accomplish this:

Using n-grams will ensure that at least one fragment will contain the semantically-similar canon. The downside of using the n-grams is of course, the proliferation of texts you feed the google service, most of which will be redundant. In addition, depending on how large of an n-gram you use to partition your text, the most similar texts will probably refer to the same exact sentence or passage. These difficulties can be overcome, and it is immaterial for the purpose of this very small experiment, which merely wishes to test the viability of using semantic similarity based search to identify canons.

Notes:

[1] Note, the reader will not be able to use Google semantic similarity service unless they apply for access and get accepted. Only once access has been given does Google make the technical documentation on how to use the service available. This requires, amongst other things, signing up for their cloud services.

[2] I performed this initial test by logging onto GSSS and manually typing in the “Input” and “Candidate” texts.

[3] The basmala is shorthand for the phrase bi-ismillah al-rahmān al-raḥīm (in the name of God, most Gracious, most Merciful).”

[4] For an analysis of Qarāfī’s work on canons and its relationship to the work of his Shāfiʿī teacher, Ibn ʿAbd al-Salām, see Mariam Sheibani, “Innovation, Influence, and Borrowing in Mamluk-Era Legal Maxim Collections: The Case of Ibn ʿAbd al-Salām and al-Qarāfī,” Journal of the American Oriental Society Forthcoming (2020).

[5] I owe this point to Intisar Rabb, and the translation of the text above is hers. Aḥmad b. Idrīs Qarāfī, al-Furūq [Anwār al-burūq fī anwāʿ al-Furūq], ed. Khalīl Manṣūr (Beirut: Dār al-Kutub al-ʿIlmiyya, 1998), 9.

[6] For an easy to use catalog of Islamic texts that OpenITI has rendered machine readable according to a largely uniform structure, see: https://kitab-corpus-metadata.azurewebsites.net/.

[7] Jupyter has created a notebook interface especially useful for researchers interested in data exploration and the quick, ad hoc development of scripts.

[8] For a description of OpenITI’s tagging scheme, see here. The way OpenITI used the hashtag in the two versions of the Treasure was not consistent. In the al-Maktaba al-Shamila version, Open ITI used the hashtag to indicate not only paragraphs, but also the page and volume numbers of the original edited edition. In order to get only the paragraphs of the Treasure, while excluding the information indicating page numbers, I had to first filter them out. A second difficulty was that, for purposes of readability, OpenITI did not confine one paragraph to a single line. Rather they artificially divided them up using two tilde symbols (‘~~’). So I needed to delete these as well, before partitioning into paragraphs. Once these two issues were taken care, one could simply partition based on the new line symbol (‘/n’).

[9] This number includes paragraphs composed of two things, the metadata OpenITI places at the beginning of the file documenting details such as the author’s information, the original text repository they mined, published edition that repository relied on, etc. It also includes the paragraphs containing the editor’s introduction. I deleted these paragraphs from the list I sent to GSSS.

[10] The mean length was ~678.

[11] I have a fairly strong intuition that many of the paragraphs I sent to GSSS were too long and that was the reason for the time-out. But, there is one other possibility that I should rule out but haven’t gotten a chance to – that I had not yet learned how to log-in to Google’s cloud service and present the security credentials in the proper manner to get consistent responses from the service.

[12] Here there is some inconsistency between the way OpenITI says it identified chapter, section, and sub-section headings and the way it actually did for the texts that I imagined. OpenITI claims to differentiate between chapter, section, and sub-section headings by the use of pipe symbols, ‘|’: the more pipe symbols that follow a hashtag the lower level of sub-section it is supposed to represent. This scheme was not followed in either one of the versions of the Treasure. Chapter titles, section, and sub-sections were all uniformly indicted by a single hashtag followed by a single pipe symbol. For their description of how the tagged section headers, see the “—Section headers” section of their description of their tagging scheme. You can find it here.

[13] Finding the beginning and end of two chapters in a single book, while frustrating, would have been not very time-consuming. Now imagine trying to scale up the procedure to cover the entire corpus of Islamic legal texts.

[14] I found no place in which OpenITI described this use of pipe symbol.

[15] The smallest relevant paragraph consisted of 11 characters, and the largest 30,135, with a mean of 696.833.

Leave a Reply